| | --- |
| | license: cc-by-4.0 |
| | language: |
| | - en |
| | library_name: transformers |
| | pipeline_tag: text-classification |
| | tags: |
| | - code |
| | metrics: |
| | - accuracy |
| | - f1 |
| | --- |
| | # CodeBERT-SO |
| | Repository for CodeBERT, fine-tuned on Stack Overflow snippets with respect to NL-PL pairs of 6 languages (Python, Java, JavaScript, PHP, Ruby, Go). |
| | ## Training Objective |
| | This model is initialized with [CodeBERT-base](https://huggingface.co/microsoft/codebert-base) and trained to classify whether a user will drop out given their posts and code snippets. |
| | ## Training Regime |
| | Preprocessing methods for input texts include unicode normalisation (NFC form), removal of extraneous whitespaces, removal of punctuations (except within links), lowercasing and removal of stopwords. |
| | Code snippets were also removed of their in-line comments or docstrings (cf. the main manuscript). RoBERTa tokenizer was used, as the built-in tokenizer for the original CodeBERT implementation. |
| |
|
| | Training was done across 8 epochs with a batch size of 8, learning rate of 1e-5, epsilon (weight update denominator) of 1e-8. |
| | A random 20% sample of the entire dataset was used as the validation set. |
| | ## Performance |
| | * Final validation accuracy: 0.822 |
| | * Final validation F1: 0.809 |
| | * Final validation loss: 0.5 |