biomedical_ner_roberta_base / README.md

Shoriful025

Create README.md

a2331f8 verified 13 days ago

preview code

raw

history blame contribute delete

2.47 kB

metadata

language:
  - en
tags:
  - ner
  - biomedical
  - token-classification
  - roberta
license: apache-2.0
datasets:
  - bc5cdr
  - ncbi_disease

biomedical_ner_roberta_base

Overview

biomedical_ner_roberta_base is a token classification model specifically fine-tuned for Named Entity Recognition (NER) in the biomedical domain. It is designed to extract entities from scientific abstracts, clinical notes, and medical literature.

The model identifies three primary entity types using the BIO labeling scheme:

DISEASE: Pathological conditions, signs, and symptoms.
CHEMICAL: Drugs, medications, and chemical compounds.
GENE: Genes, proteins, and related molecular structures.

Model Architecture

This model is based on the roberta-base architecture, fine-tuned using RobertaForTokenClassification. It was trained on a composite dataset including BC5CDR (BioCreative V CDR task corpus) and the NCBI Disease corpus.

Base Model: RoBERTa Base (12 layers, 768 hidden dimension, 12 heads, 125M parameters).
Task: Token Classification (7 labels: O, B-DISEASE, I-DISEASE, B-CHEMICAL, I-CHEMICAL, B-GENE, I-GENE).

Intended Use

This model is intended for researchers and developers working with biomedical text data.

Information Extraction: Automated parsing of PubMed abstracts to identify key biomedical concepts.
Knowledge Graph Construction: Linking genes, drugs, and diseases discovered in text to structured knowledge bases.
Clinical Text Mining: Assisting in extracting relevant information from unstructured electronic health records (EHRs).

How to use

from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

model_name = "your_username/biomedical_ner_roberta_base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

nlp = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")

text = "The patient was treated with metformin for type 2 diabetes, but showed resistance related to the SLC22A1 gene variant."
results = nlp(text)

for entity in results:
    print(f"Entity: {entity['word']}, Label: {entity['entity_group']}, Score: {entity['score']:.4f}")

# Expected Output structure:
# Entity: metformin, Label: CHEMICAL, Score: 0.99...
# Entity: type 2 diabetes, Label: DISEASE, Score: 0.98...
# Entity: SLC22A1, Label: GENE, Score: 0.97...