Title: Rulebased Human Gene Normalization in Biomedical Text with Confidence Estimation
1Rule-based Human Gene Normalization in Biomedical
Text with Confidence Estimation
- William W. Lau, Center for Information Technology
- Kevin G. Becker, National Institute on Aging
- Calvin A. Johnson, Center for Information
Technology - National Institutes of Health
2Outline
- Motivation
- Algorithm
- Datasets
- Performance
- Analysis
- Demo
3Motivation
- To build a text mining tool for genetic
association studies - Basic keyword searches is not good enough
Prerequisite Accurate term mapping to unique
genes in a referent data source (gene
normalization)
4Outline
- Motivation
- Algorithm
- Datasets
- Performance
- Analysis
- Demo
5Algorithm
- A two-step process
- Step 1
- Searching for gene mentions in the text against
the EntrezGene lexicon - Goal is high recall
- Step 2
- Calculate confidence score based on features of
the mention - Filtering based on confidence score
- Goal is high precision
True Positive True Positive False Negative
True Positive True Positive False Positive
6Step 1A Generate gene mention lexicon
- Retrieve official and alias gene names and
symbols from EntrezGene database. - Create new names by abbreviating common chemical
names - E.g. fatty-acid-Coenzyme A ligase ?
fatty-acid-CoA ligase - Generate new terms by interchanging Greek
alphabets between short form and long form. - E.g. CHKB ? CHK beta
- Generate new symbols by interchanging Roman and
Arabic numerals. - E.g. GAL4 ? GAL IV
7Step 1B Pattern matching for symbols
- Using regular expressions
- Macro matching (is there a match?)
- Micro matching (what matches?)
8Step 1C Approximate Name Matching
- Difficulties with gene names
- Word ordering
- E.g. Type II IL-1 receptor vs. IL-1 receptor Type
II - Flexibility in choice of word
- E.g. Transcription factor 2 (liver) vs. Hepatic
transcription factor 2 - Boundaries
- E.g. human homologue of mouse Ly9
- Using ABNER1 to find potential mentions first to
speed up the process
1 B. Settles (2005). ABNER an open source tool
for automatically tagging genes, proteins, and
other entity names in text. Bioinformatics,
21(14)3191-3192., 2005.
9Step 1C Approximate Name Matching
- Tokenization of the gene name and the text
- Find the longest match with lists of prohibited
and allowed missing words - Increase flexibility with a list of connecting
words
E.g. eukaryotic translation initiation factor 4
gamma, 1 (EntrezGene ID 1981)
. Type 1 of the eukaryotic initiation factor 4-G
.
tokens in candidate tokens to match
6 7
rm
86
10Step 2 Measuring the level of confidence
- The confidence score of a mention is a weighted
linear combination of 6 feature scores, plus a
boosting factor as an exponent. - The confidence and feature scores range from 0 to
1. - Leveraging statistical, morphological, and
contextual information available to the system.
Boosting factor
Inverse distance
Mention type
Coverage
Uniqueness
mention
HUGO status
11Coverage
- Heuristic Longer the mention higher the
confidence - For names
minimum occurrence frequency threshold for any
missing words not in the allowed list
Length of candidate Length of matching term
occurrence frequency of the least common missing
word
Word based
Character based
12Coverage
- For symbols
- Scaling factor, s, shifts the burden to name
matching when the symbol is enclosed. E.g. WBS
critical region gene 14 (WBSCR14).
character length of the candidate
13Inverse Distance
- A measure of the degree of variations between the
candidate and the dictionary term - Heuristic the candidate mention and the
corresponding gene term in the knowledge base
should be fairly similar for a true gene mention. - For names
- For symbols
edit distance of the candidate to the matching
term
14Uniqueness
- A statistical measure
- Heuristic frequency of occurrence should be low
for a true gene mention. - For names
- For symbols, scale the score by s.
documents containing the term
size of population (MEDLINE)
frequency threshold for which the max. of
documents the term can appear in
15Discrete Features
Boosting factor
Mention type
of mention
HUGO status
- Mention type Is the gene mention an official
term? - of mention Is there more than one unique
mention for this gene? - HUGO status Has the gene been approved by the
committee? - Boosting factor
- Components
- (Counter-)indicating words
- Chromosome location
- Accession number
16Outline
- Motivation
- Algorithm
- Datasets
- Performance
- Analysis
- Demo
17BioCreative II challenge2
- Gene normalization task
- Data are MEDLINE abstracts that have been
annotated manually - 90 interannotator agreement
- Training set had 281 citations
- Test set had 262 citations
- System results were computed by comparing the
system-generated EntrezGene IDs to the gold
standard. - Goal is to achieve the highest F-measure
- 20 international groups participated in this task
2RecallPrecision Recall Precision
2 A.A. Morgan and L. Hirschman (2007). Overview
of BioCreative II Gene Normalization. Proceedings
of the Second BioCreative Challenge Evaluation
Workshop, p.p. 17-27.
18Outline
- Motivation
- Algorithm
- Datasets
- Performance
- Analysis
- Demo
19Performance
- F-measure was .640 after first step (85.9 r _at_
51.1 p) - Nelder-Mead optimization to find the optimal
feature weights using the training data - Objective function combines F-measure and Area
under P-R curve
20Performance
F-measure
AUC
21Outline
- Motivation
- Algorithm
- Datasets
- Performance
- Analysis
- Demo
22F-measure Feature Analysis
Large improvements seen with coverage, Inverse
distance, and uniqueness features
gt .77 f-score could be achieved
23AUC Feature Analysis
A different landscape!!
24Strengths and Improvements
- High recall at reasonable precision
- Customizable tradeoff between precision and
recall - Flexible, intuitive features
- The highest F-measure in BioCreative II GN Task
was .81 - Why didnt we do as well?
- Because other groups have
- Better species identification
- Better detection. E.g. freac-1 to freac-7
- Better pruning of the dictionary
- Better disambiguation using background knowledge
- How good is good enough?
25Outline
- Motivation
- Algorithm
- Datasets
- Performance
- Analysis
- Demo
26Web Interface http//gaint.cit.nih.gov
27Web Interface http//gaint.cit.nih.gov
28Web Interface http//gaint.cit.nih.gov
29Web Interface http//gaint.cit.nih.gov