Adiii143
/

code-search-net-tokenizer

Transformers

tokenizer

gpt-2

Model card Files Files and versions

xet

Community

Adiii143 commited on Mar 12

Commit

e7c7f61

verified ·

1 Parent(s): c210965

Update README.md

Browse files

Files changed (1) hide show

README.md +15 -15

README.md CHANGED Viewed

@@ -12,7 +12,8 @@ base_model:
 # Model Card for Code-Net Tokenizer Trained on GPT-2
-This model card describes a custom tokenizer trained on the existing GPT-2 tokenizer using the CodeSearchNet dataset. The tokenizer was adapted to better handle code-specific tokenization, leveraging the large scale and fine-grained vocabulary of the GPT-2 model.
 ## Model Details
@@ -20,41 +21,40 @@ This model card describes a custom tokenizer trained on the existing GPT-2 token
 This tokenizer was fine-tuned on the CodeSearchNet dataset, which contains millions of code snippets in multiple programming languages. The tokenizer was initialized with the GPT-2 tokenizer and then adapted to better handle the unique characteristics of programming language syntax and semantics.
-- **Developed by:** [Your Name or Organization]
-- **Funded by [optional]:** [More Information Needed]
-- **Shared by [optional]:** [More Information Needed]
 - **Model type:** Tokenizer
-- **Language(s) (NLP):** Python, Java, JavaScript, Go, Ruby, etc.
 - **License:** Apache 2.0
 - **Finetuned from model [optional]:** openai-community/gpt2
-### Model Sources [optional]
-- **Repository:** [More Information Needed]
-- **Paper [optional]:** [More Information Needed]
-- **Demo [optional]:** [More Information Needed]
 ## Uses
 ### Direct Use
-The tokenizer can be used directly in any NLP tasks that involve source code, such as code generation, code summarization, or bug detection, by replacing the original GPT-2 tokenizer with this newly trained version.
 ### Downstream Use [optional]
-When plugged into a code-generation or code-understanding pipeline, this tokenizer can help improve the model’s understanding of programming languages and code structure.
 ### Out-of-Scope Use
-This tokenizer is specifically designed for tokenizing programming code. It is not suited for general text-based NLP tasks like natural language processing, sentiment analysis, or text generation outside the context of source code.
 ## Bias, Risks, and Limitations
-This model may introduce bias based on the dataset it was trained on. For example, the tokenizer might have difficulty with edge cases or rare programming language constructs that were underrepresented in the training data.
 ### Recommendations
-Users should be aware of potential limitations when applying this tokenizer to specific, less-common programming languages. Additionally, it may not handle malformed code or highly unconventional syntaxes well.
 ## How to Get Started with the Model

 # Model Card for Code-Net Tokenizer Trained on GPT-2
+This model card describes a custom tokenizer trained on the existing GPT-2 tokenizer using the CodeSearchNet dataset.
+The tokenizer was adapted to better handle code-specific tokenization, leveraging the large scale and fine-grained vocabulary of the GPT-2 model.
 ## Model Details
 This tokenizer was fine-tuned on the CodeSearchNet dataset, which contains millions of code snippets in multiple programming languages. The tokenizer was initialized with the GPT-2 tokenizer and then adapted to better handle the unique characteristics of programming language syntax and semantics.
+- **Developed by:** Aditya Ak
+- **Shared by [optional]:** Aditya Ak
 - **Model type:** Tokenizer
+- **Language(s) (NLP):** Python
 - **License:** Apache 2.0
 - **Finetuned from model [optional]:** openai-community/gpt2
 ## Uses
 ### Direct Use
+The tokenizer can be used directly in any NLP tasks that involve source code, such as code generation, code summarization,
+or bug detection, by replacing the original GPT-2 tokenizer with this newly trained version.
 ### Downstream Use [optional]
+When plugged into a code-generation or code-understanding pipeline, this tokenizer
+can help improve the model’s understanding of programming languages and code structure.
 ### Out-of-Scope Use
+This tokenizer is specifically designed for tokenizing programming code. It is not suited for general text-based NLP
+tasks like natural language processing, sentiment analysis, or text generation outside the context of source code.
 ## Bias, Risks, and Limitations
+This model may introduce bias based on the dataset it was trained on. For example, the tokenizer might have
+difficulty with edge cases or rare programming language constructs that were underrepresented in the training data.
 ### Recommendations
+Users should be aware of potential limitations when applying this tokenizer to specific, less-common programming
+languages. Additionally, it may not handle malformed code or highly unconventional syntaxes well.
 ## How to Get Started with the Model