MsAlEhR
/

MetaBERTa-bigbird-gene

Mask Generation

Model card Files Files and versions

MsAlEhR commited on Oct 29, 2024

Commit

23072ad

·

verified ·

1 Parent(s): 52359bd

Update README.md

Files changed (1) hide show

README.md +32 -22

README.md CHANGED Viewed

@@ -30,29 +30,39 @@ Details of the dataset will be shared in the supplementary materials of the pape
     ```
 3. **Example Code:**
    ```python
-   from KmerTokenizer import KmerTokenizer
-   from transformers import AutoModel
-   import torch
-   # Example gene sequence
-   seq_list = ["ATTTTTTTTTTTCCCCCCCCCCCGGGGGGGGATCGATGC"]
-   # Initialize the tokenizer
-   tokenizer = KmerTokenizer(kmerlen=6, overlapping=True, maxlen=4096)
-   tokenized_output = tokenizer.kmer_tokenize(seq_list)
-   # Convert tokenized output to tensor
-   inputs = torch.tensor(tokenized_output)
-   # Load the pre-trained BigBird model
-   model = AutoModel.from_pretrained("MsAlEhR/MetaBERTa-bigbird-gene", output_hidden_states=True)
-   # Generate hidden states
-   hidden_states = model(inputs)[0]
-   # Compute mean and max pooling of the hidden states
-   embedding_mean = torch.mean(hidden_states[-1], dim=1)
-   embedding_max = torch.max(hidden_states[-1], dim=1)
    ```
 **Citation:**

     ```
 3. **Example Code:**
    ```python
+    from KmerTokenizer import KmerTokenizer
+    from transformers import AutoModel
+    import torch
+    # Example gene sequence
+    seq = "ATTTTTTTTTTTCCCCCCCCCCCGGGGGGGGATCGATGC"
+    # Initialize the tokenizer
+    tokenizer = KmerTokenizer(kmerlen=6, overlapping=True, maxlen=4096)
+    tokenized_output = tokenizer.kmer_tokenize(seq)
+    pad_token_id = 2  # Set pad token ID
+    # Create attention mask (1 for tokens, 0 for padding)
+    attention_mask = torch.tensor([1 if token != pad_token_id else 0 for token in tokenized_output], dtype=torch.long).unsqueeze(0)
+    # Convert tokenized output to LongTensor and add batch dimension
+    inputs = torch.tensor([tokenized_output], dtype=torch.long)
+    # Load the pre-trained BigBird model
+    model = AutoModel.from_pretrained("MsAlEhR/MetaBERTa-bigbird-gene", output_hidden_states=True)
+    # Generate hidden states
+    outputs = model(input_ids=inputs, attention_mask=attention_mask)
+    # Get embeddings from the last hidden state
+    embeddings = outputs.hidden_states[-1]
+    # Expand attention mask to match the embedding dimensions
+    expanded_attention_mask = attention_mask.unsqueeze(-1)
+    # Compute mean sequence embeddings
+    mean_sequence_embeddings = torch.sum(expanded_attention_mask * embeddings, dim=1) / torch.sum(expanded_attention_mask, dim=1)
    ```
 **Citation:**