| # GenerRNA | |
| GenerRNA is a generative RNA language model based on a Transformer decoder-only architecture. It was pre-trained on 30M sequences, encompassing 17B nucleotides. | |
| Here, you can find all the relevant scripts for running GenerRNA on your machine. GenerRNA enable you to generate RNA sequences in a zero-shot manner for exploring the RNA space, or to fine-tune the model using a specific dataset for generating RNAs belonging to a particular family or possessing specific characteristics. | |
| # Requirements | |
| A CUDA environment, and a minimum VRAM of 8GB was required. | |
| ### Dependencies | |
| ``` | |
| torch>=2.0 | |
| numpy | |
| transformers==4.33.0.dev0 | |
| datasets==2.14.4 | |
| tqdm | |
| ``` | |
| # Usage | |
| Firstly, combine the split model using the command `cat model.pt.part-* > model.pt.recombined` | |
| #### Directory tree | |
| ``` | |
| . | |
| βββ LICENSE | |
| βββ README.md | |
| βββ configs | |
| β βββ example_finetuning.py | |
| β βββ example_pretraining.py | |
| βββ experiments_data | |
| βββ model.pt.part-aa # splited bin data of *HISTORICAL* model (shorter context window, less VRAM comsuption) | |
| βββ model.pt.part-ab | |
| βββ model.pt.part-ac | |
| βββ model.pt.part-ad | |
| βββ model_updated.pt # *NEWER* model, with longer context windows and being trained on a deduplicated dataset | |
| βββ model.py # define the architecture | |
| βββ sampling.py # script to generate sequences | |
| βββ tokenization.py # preparete data | |
| βββ tokenizer_bpe_1024 | |
| β βββ tokenizer.json | |
| β βββ .... | |
| βββ train.py # script for training/fine-tuning | |
| ``` | |
| ### De novo Generation in a zero-shot fashion | |
| Usage example: | |
| ``` | |
| python sampling.py \ | |
| --out_path {output_file_path} \ | |
| --max_new_tokens 256 \ | |
| --ckpt_path {model.pt} \ | |
| --tokenizer_path {path_to_tokenizer_directory, e.g /tokenizer_bpe_1024} | |
| ``` | |
| ### Pre-training or Fine-tuning on your own sequences | |
| First, tokenize your sequence data, ensuring each sequence is on a separate line and there is no header. | |
| ``` | |
| python tokenization.py \ | |
| --data_dir {path_to_the_directory_containing_sequence_data} \ | |
| --file_name {file_name_of_sequence_data} \ | |
| --tokenizer_path {path_to_tokenizer_directory} \ | |
| --out_dir {directory_to_save_tokenized_data} \ | |
| --block_size 256 | |
| ``` | |
| Next, refer to `./configs/example_**.py` to create a config file of GPT model. | |
| Lastly, excute following command: | |
| ``` | |
| python train.py \ | |
| --config {path_to_your_config_file} | |
| ``` | |
| ### Train your own tokenizer | |
| Usage example: | |
| ``` | |
| python train_BPE.py \ | |
| --txt_file_path {path_to_training_file(txt,each sequence is on a separate line)} \ | |
| --vocab_size 50256 \ | |
| --new_tokenizer_path {directory_to_save_trained_tokenizer} \ | |
| ``` | |
| # License | |
| The source code is licensed MIT. See `LICENSE` |