Named Entity Recognition (NER) in Scientific Texts: Custom Rule-Based Pipelines¶
Extracting structured domain-specific entities from multi-disciplinary research abstracts using standard spaCy architectures and custom pattern-matching extensions. Built for indexing unstructured literature datasets and metadata cataloging.
1. Setup and Load spaCy¶
This section loads spaCy and the default English model. In an HPC or local environment, install the model first with:
pip install spacy
python -m spacy download en_core_web_sm
import spacy
from spacy import displacy
nlp = spacy.load("en_core_web_sm")
print("spaCy version:", spacy.__version__)
print("Pipeline components:", nlp.pipe_names)
2. Sample Scientific Text¶
Define a few abstracts that reflect Physics, Genomics, and satellite hydrology research. These examples show the kind of domain text that might be processed in an NMS faculty collaboration.
What is Named Entity Recognition?¶
Named Entity Recognition (NER) is a task that finds and labels important items in text, like organizations, people, locations, dates, or scientific terms. In a research workflow, NER helps transform unstructured text into structured data that can be searched, filtered, or linked to other datasets.
texts = [
"The Higgs boson measurements from the Large Hadron Collider at CERN continue to refine the Standard Model of particle physics.",
"Using CRISPR-Cas9 gene editing, researchers at Ohio State University are studying genomic signatures in Arabidopsis thaliana to improve trait selection.",
"The SWOT mission collaboration between NASA and CNES provides high-resolution river surface elevation data for global hydrology research."
]
for i, text in enumerate(texts, 1):
print(f"Abstract {i}: {text}")
print()
3. Run the NER Pipeline¶
Process each abstract and extract entities along with their labels and explanations. This is the core of a simple NLP extraction pipeline.
spaCy uses standardized entity labels like ORG for organizations, GPE for geopolitical entities, PERSON for people, and DATE for dates. The spacy.explain() helper lets you see a readable description for each label.
for i, text in enumerate(texts, 1):
doc = nlp(text)
print(f"\nAbstract {i} entities:")
for ent in doc.ents:
print(f" {ent.text:<25} | {ent.label_:<10} | {spacy.explain(ent.label_)}")
4. Visualize Detected Entities¶
Use spaCy's displaCy to render entities visually inside the notebook. This can be useful for presenting extraction results to faculty or stakeholders.
doc = nlp(texts[1])
displacy.render(doc, style="ent", jupyter=True)
5. Add Custom Domain Entities¶
Many research domains use terms not recognized by a general English model. An EntityRuler lets us add domain-specific terms such as SWOT, CRISPR-Cas9, and Higgs boson.
from spacy.pipeline import EntityRuler
ruler = nlp.add_pipe("entity_ruler", before="ner")
patterns = [
{"label": "INSTRUMENT", "pattern": "SWOT"},
{"label": "GENE_TOOL", "pattern": "CRISPR-Cas9"},
{"label": "PARTICLE", "pattern": "Higgs boson"}
]
ruler.add_patterns(patterns)
doc = nlp(texts[2])
for ent in doc.ents:
print(f"{ent.text:<20} | {ent.label_:<10} | {spacy.explain(ent.label_)}")
displacy.render(doc, style="ent", jupyter=True)