Decoding Cancer Research

How AI Unlocks Hidden Patterns in Billion-Dollar Funding

The Billion-Dollar Puzzle

Every year, the National Cancer Institute (NCI) invests billions to conquer cancer—$6.38 billion in 2020 alone, with over $500 million directed to radiology and radiation oncology. Yet for decades, understanding what exactly this funds required Herculean manual efforts. Researchers pored over thousands of grant abstracts, categorizing them painstakingly into broad topics like "lung cancer" or "genetics." This approach was slow, subjective, and incapable of detecting nuanced trends 5 .

Funding Breakdown

NCI's $6.38 billion budget in 2020 included over $500 million for radiology and radiation oncology research.

AI Solution

Natural language processing (NLP) transforms how we analyze and understand cancer research funding patterns.

Enter natural language processing (NLP)—a branch of artificial intelligence that teaches machines to "read" and interpret human language. By applying NLP to decades of NCI grant data, scientists are now revealing hidden research priorities, emerging fields, and overlooked gaps. This isn't just academic curiosity; it's about steering the future of cancer breakthroughs 1 5 .

The Language of Science: How AI Reads Grants

Key Concepts: From Words to Wisdom

  • Natural Language Processing (NLP): Algorithms analyze text by identifying patterns in word usage and context. For grant abstracts, NLP converts complex sentences into mathematical vectors, capturing semantic relationships (e.g., "immunotherapy" relates to "T cells" and "tumor response") 3 .
  • Word Embeddings: Words like "biomarker" or "radiopharmaceutical" are mapped as points in multidimensional space. Similar terms cluster together, revealing thematic connections invisible to human readers 1 .
  • Unsupervised Clustering: Algorithms group grants into topics based on semantic similarity. A 2025 study used this to distill 5,874 grants into 60 distinct research themes—like "imaging biomarkers" or "radiopharmaceutical therapy"—without human bias 5 .
NLP Process Flow
Data Collection

Extract and clean grant abstracts from NIH RePORTER

Semantic Mapping

Convert words to vectors using BioWordVec embeddings

Topic Clustering

Group grants into thematic clusters based on similarity

Trend Analysis

Track funding changes across clusters over time

Why It Matters:

Traditional categorization relied on manual labels, limiting analysis to ~200 grants every few years 5 . NLP processes all available data, uncovering granular trends:

  • Which topics are surging or declining?
  • How do physics-focused grants differ from biology-driven ones?
  • Where are funding gaps?

The Breakthrough Experiment: Mapping 21 Years of NCI Funding

A landmark 2025 study analyzed every non-education R-type NCI grant awarded to radiation oncology/radiology departments from 2000–2020—totaling $1.9 billion. Here's how they did it 1 5 :

Step-by-Step Methodology:

1. Data Collection
  • 5,874 grants extracted from NIH RePORTER.
  • Abstracts cleaned (removing stopwords, punctuation) and tokenized into 25,000+ unique words.
3. Topic Clustering
  • A sequential algorithm grouped grants into 60 clusters based on vector similarity.
  • Top words (e.g., "PET," "tracer," "synthesis") and representative grants defined each cluster.
2. Semantic Mapping
  • Pretrained BioWordVec embeddings converted words to vectors, capturing biomedical context (e.g., "MRI" linked to "imaging" and "diagnosis") 5 .
4-5. Validation & Analysis
  • Human experts labeled clusters; NLP-human agreement matched inter-rater agreement among scientists.
  • Funding changes per cluster were tracked annually.

Results That Rewrote Priorities

Top Funding Trends (2000–2020)
Topic Funding Trend Annual Change
Imaging biomarkers Sharp increase +$218,000
Informatics software Rapid growth +$218,000
Radiopharmaceuticals Steady rise +$218,000
Cellular stress response Significant decline –$110,000
Imaging hardware technology Sharp decline –$110,000
Dominant Research Axes
Axis Physics-Focused Biology-Focused
Therapeutic Radiopharmaceuticals Immunotherapy targets
Diagnostic MRI hardware advances Genomic biomarkers
Surprising Insights:
  • Therapeutic/physics clusters (e.g., radiopharmaceuticals) grew 30% faster than diagnostic/biology topics.
  • Hotspots like AI-driven "informatics software" emerged post-2015, while hardware-focused grants dwindled 1 5 .
Funding Trends Visualization (Placeholder for Chart)

Interactive chart would appear here showing funding trends over time for different research clusters

The Scientist's Toolkit: NLP Essentials for Funding Analysis

Key Tools & Resources
Tool/Resource Function Source/Access
BioWordVec Embeddings Pre-trained word vectors for biomedical NLP Publicly available datasets
GDC Data Portal Genomic/clinical data for AI training NCI Genomic Data Commons 4
SEER Program Data Cancer surveillance records for NLP testing NCI Surveillance Systems
High-Performance Computing DOE supercomputers for complex NLP tasks NCI-DOE Collaborations 3

Beyond Funding: How This Tech Fights Cancer

NLP's impact extends far beyond grant analysis:

Early Diagnosis

Algorithms scan pathology reports to identify precancerous lesions (e.g., cervical cancer) faster than humans 2 .

Drug Discovery

AI models predict T-cell responses to tumors, accelerating immunotherapy design 4 .

Health Equity

NLP extracts social determinants of health (e.g., income, location) from clinical notes to address care disparities 2 .

Challenges Ahead: Bias and the "Black Box"

Data Bias

Models trained on non-diverse data may overlook minority health needs. NCI now mandates representative training sets 2 .

Explainability

Clinicians distrust "black box" AI. New interpretable NLP methods trace decisions (e.g., linking "radiopharmaceutical" grants to drug delivery keywords) 5 .

Conclusion: The Future Is Language-Aware

NLP has transformed funding from a ledger of dollars into a map of scientific priorities. As one researcher notes: "We're no longer just reading grants—we're reading the collective mind of cancer research." The next frontier? Real-time topic tracking to dynamically allocate funds to emerging fields like AI-guided drug repurposing. With cancer's complexity, decoding our own language might be the key to defeating it 1 4 5 .

"Science is a conversation. NLP lets us hear the chorus."

Dr. Mark Nguyen, Lead Author, Semiautomated Extraction of Research Topics 5

References