| # CodeColBERT | |
| This model serves as the base for our semantic code retrieval system SELMA. It can be applied for indexing and retrieval using the Pyterrier bindings for ColBERT. | |
| ## Training Details | |
| This model was trained for code retrieval. As a base, CodeBERT is used. It is trained using the official ColBERTv2 code | |
| ([Github](https://github.com/stanford-futuredata/ColBERT)). | |
| Our data source is the [CodeSearchNet Challenge](https://github.com/github/CodeSearchNet). | |
| Training ColBERT requires a tripes of queries, positive examples and negative examples. As queries, we used the documentation | |
| provided for each sample in the CodeSearchNet data set, while its code snippet serves as the positive example. Negative examples were | |
| sampled randomly from the corpus. In total, we train for 400.000 steps. | |