| datasets: | |
| - code-search-net/code_search_net | |
| language: | |
| - en | |
| base_model: | |
| - openai-community/gpt2 | |
| pipeline_tag: text-generation | |
| library_name: transformers | |
| tags: | |
| - code | |
| ## Detailed Model Description | |
| A GPT-2-based tokenizer further trained on 400 k+ Python functions. It keeps the original BPE backbone, adds robust encoding for indentation, common keywords, operators and camel-case variables, and is ready for any code-generation or code-understanding pipeline. | |
| ## Usage Examples: | |
| ``` | |
| example = """ | |
| class LinearLayer(): | |
| def __init__(self, input_size, output_size): | |
| self.weight = torch.randn(input_size, output_size) | |
| self.bias = torch.zeros(output_size) | |
| def __call__(self, x): | |
| return x @ self.weights + self.bias | |
| """ | |
| ``` | |
| Performance: | |
| ``` | |
| ['class', 'ĠLinear', 'Layer', '():', 'ĊĠĠĠ', 'Ġdef', 'Ġ__', 'init', '__(', 'self', ',', 'Ġinput', '_', 'size', ',', 'Ġoutput', '_', 'size', '):', 'ĊĠĠĠĠĠĠĠ', 'Ġself', '.', 'weight', 'Ġ=', 'Ġtorch', '.', 'randn', '(', 'input', '_', 'size', ',', 'Ġoutput', '_', 'size', ')', 'ĊĠĠĠĠĠĠ', 'Ġself', '.', 'bias', 'Ġ=', 'Ġtorch', '.', 'zeros', '(', 'output', '_', 'size', ')', 'ĊĊĠĠ', 'Ġdef', 'Ġ__', 'call', '__(', 'self', ',', 'Ġx', '):', 'ĊĠĠĠĠĠĠ', 'Ġreturn', 'Ġx', 'Ġ@', 'Ġself', '.', 'weights', 'Ġ+', 'Ġself', '.', 'bias', 'ĊĠĠĠĠ'] | |
| ``` | |
| ## Datasets feature(Train): | |
| ``` | |
| Dataset({ | |
| features: ['repo', 'path', 'func_name', 'original_string', 'language', 'code', 'code_tokens', 'docstring', 'docstring_tokens', 'sha', 'url', 'partition'], | |
| num_rows: 412178 | |
| }) | |
| ``` |