File size: 4,286 Bytes
21840bb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
---
language: code
thumbnail: https://cdn-media.huggingface.co/CodeBERTa/CodeBERTa.png
datasets:
- code_search_net
library_name: transformers.js
base_model:
- huggingface/CodeBERTa-small-v1
pipeline_tag: fill-mask
---



# CodeBERTa-small-v1 (ONNX)


This is an ONNX version of [huggingface/CodeBERTa-small-v1](https://huggingface.co/huggingface/CodeBERTa-small-v1). It was automatically converted and uploaded using [this Hugging Face Space](https://huggingface.co/spaces/onnx-community/convert-to-onnx).


## Usage with Transformers.js


See the pipeline documentation for `fill-mask`: https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.FillMaskPipeline


---


# CodeBERTa

CodeBERTa is a RoBERTa-like model trained on the [CodeSearchNet](https://github.blog/2019-09-26-introducing-the-codesearchnet-challenge/) dataset from GitHub.

Supported languages:

```shell
"go"
"java"
"javascript"
"php"
"python"
"ruby"
```

The **tokenizer** is a Byte-level BPE tokenizer trained on the corpus using Hugging Face `tokenizers`.

Because it is trained on a corpus of code (vs. natural language), it encodes the corpus efficiently (the sequences are between 33% to 50% shorter, compared to the same corpus tokenized by gpt2/roberta).

The (small) **model** is a 6-layer, 84M parameters, RoBERTa-like Transformer model – that’s the same number of layers & heads as DistilBERT – initialized from the default initialization settings and trained from scratch on the full corpus (~2M functions) for 5 epochs.

### Tensorboard for this training ⤵️

[![tb](https://cdn-media.huggingface.co/CodeBERTa/tensorboard.png)](https://tensorboard.dev/experiment/irRI7jXGQlqmlxXS0I07ew/#scalars)

## Quick start: masked language modeling prediction

```python
PHP_CODE = """
public static <mask> set(string $key, $value) {
	if (!in_array($key, self::$allowedKeys)) {
		throw new \InvalidArgumentException('Invalid key given');
	}
	self::$storedValues[$key] = $value;
}
""".lstrip()
```

### Does the model know how to complete simple PHP code?

```python
from transformers import pipeline

fill_mask = pipeline(
    "fill-mask",
    model="huggingface/CodeBERTa-small-v1",
    tokenizer="huggingface/CodeBERTa-small-v1"
)

fill_mask(PHP_CODE)

## Top 5 predictions:
# 
' function' # prob 0.9999827146530151
'function'  # 
' void'     # 
' def'      # 
' final'    # 
```

### Yes! That was easy 🎉 What about some Python (warning: this is going to be meta)

```python
PYTHON_CODE = """
def pipeline(
    task: str,
    model: Optional = None,
    framework: Optional[<mask>] = None,
    **kwargs
) -> Pipeline:
	pass
""".lstrip()
```

Results:
```python
'framework', 'Framework', ' framework', 'None', 'str'
```

> This program can auto-complete itself! 😱

### Just for fun, let's try to mask natural language (not code):

```python
fill_mask("My name is <mask>.")

# {'sequence': '<s> My name is undefined.</s>', 'score': 0.2548016905784607, 'token': 3353}
# {'sequence': '<s> My name is required.</s>', 'score': 0.07290805131196976, 'token': 2371}
# {'sequence': '<s> My name is null.</s>', 'score': 0.06323737651109695, 'token': 469}
# {'sequence': '<s> My name is name.</s>', 'score': 0.021919190883636475, 'token': 652}
# {'sequence': '<s> My name is disabled.</s>', 'score': 0.019681859761476517, 'token': 7434}
```

This (kind of) works because code contains comments (which contain natural language).

Of course, the most frequent name for a Computer scientist must be undefined 🤓.


## Downstream task: [programming language identification](https://huggingface.co/huggingface/CodeBERTa-language-id)

See the model card for **[`huggingface/CodeBERTa-language-id`](https://huggingface.co/huggingface/CodeBERTa-language-id)** 🤯.

<br>

## CodeSearchNet citation

<details>

```bibtex
@article{husain_codesearchnet_2019,
	title = {{CodeSearchNet} {Challenge}: {Evaluating} the {State} of {Semantic} {Code} {Search}},
	shorttitle = {{CodeSearchNet} {Challenge}},
	url = {http://arxiv.org/abs/1909.09436},
	urldate = {2020-03-12},
	journal = {arXiv:1909.09436 [cs, stat]},
	author = {Husain, Hamel and Wu, Ho-Hsiang and Gazit, Tiferet and Allamanis, Miltiadis and Brockschmidt, Marc},
	month = sep,
	year = {2019},
	note = {arXiv: 1909.09436},
}
```

</details>