Rulebased Human Gene Normalization in Biomedical Text with Confidence Estimation PowerPoint PPT Presentation

presentation player overlay
1 / 29
About This Presentation
Transcript and Presenter's Notes

Title: Rulebased Human Gene Normalization in Biomedical Text with Confidence Estimation


1
Rule-based Human Gene Normalization in Biomedical
Text with Confidence Estimation
  • William W. Lau, Center for Information Technology
  • Kevin G. Becker, National Institute on Aging
  • Calvin A. Johnson, Center for Information
    Technology
  • National Institutes of Health

2
Outline
  • Motivation
  • Algorithm
  • Datasets
  • Performance
  • Analysis
  • Demo

3
Motivation
  • To build a text mining tool for genetic
    association studies
  • Basic keyword searches is not good enough

Prerequisite Accurate term mapping to unique
genes in a referent data source (gene
normalization)
4
Outline
  • Motivation
  • Algorithm
  • Datasets
  • Performance
  • Analysis
  • Demo

5
Algorithm
  • A two-step process
  • Step 1
  • Searching for gene mentions in the text against
    the EntrezGene lexicon
  • Goal is high recall
  • Step 2
  • Calculate confidence score based on features of
    the mention
  • Filtering based on confidence score
  • Goal is high precision

True Positive True Positive False Negative
True Positive True Positive False Positive
6
Step 1A Generate gene mention lexicon
  • Retrieve official and alias gene names and
    symbols from EntrezGene database.
  • Create new names by abbreviating common chemical
    names
  • E.g. fatty-acid-Coenzyme A ligase ?
    fatty-acid-CoA ligase
  • Generate new terms by interchanging Greek
    alphabets between short form and long form.
  • E.g. CHKB ? CHK beta
  • Generate new symbols by interchanging Roman and
    Arabic numerals.
  • E.g. GAL4 ? GAL IV

7
Step 1B Pattern matching for symbols
  • Using regular expressions
  • Macro matching (is there a match?)
  • Micro matching (what matches?)

8
Step 1C Approximate Name Matching
  • Difficulties with gene names
  • Word ordering
  • E.g. Type II IL-1 receptor vs. IL-1 receptor Type
    II
  • Flexibility in choice of word
  • E.g. Transcription factor 2 (liver) vs. Hepatic
    transcription factor 2
  • Boundaries
  • E.g. human homologue of mouse Ly9
  • Using ABNER1 to find potential mentions first to
    speed up the process

1 B. Settles (2005). ABNER an open source tool
for automatically tagging genes, proteins, and
other entity names in text. Bioinformatics,
21(14)3191-3192., 2005.
9
Step 1C Approximate Name Matching
  • Tokenization of the gene name and the text
  • Find the longest match with lists of prohibited
    and allowed missing words
  • Increase flexibility with a list of connecting
    words

E.g. eukaryotic translation initiation factor 4
gamma, 1 (EntrezGene ID 1981)
. Type 1 of the eukaryotic initiation factor 4-G
.
tokens in candidate tokens to match
6 7
rm
86



10
Step 2 Measuring the level of confidence
  • The confidence score of a mention is a weighted
    linear combination of 6 feature scores, plus a
    boosting factor as an exponent.
  • The confidence and feature scores range from 0 to
    1.
  • Leveraging statistical, morphological, and
    contextual information available to the system.

Boosting factor
Inverse distance
Mention type
Coverage
Uniqueness
mention
HUGO status
11
Coverage
  • Heuristic Longer the mention higher the
    confidence
  • For names

minimum occurrence frequency threshold for any
missing words not in the allowed list
Length of candidate Length of matching term
occurrence frequency of the least common missing
word
Word based
Character based
12
Coverage
  • For symbols
  • Scaling factor, s, shifts the burden to name
    matching when the symbol is enclosed. E.g. WBS
    critical region gene 14 (WBSCR14).

character length of the candidate
13
Inverse Distance
  • A measure of the degree of variations between the
    candidate and the dictionary term
  • Heuristic the candidate mention and the
    corresponding gene term in the knowledge base
    should be fairly similar for a true gene mention.
  • For names
  • For symbols

edit distance of the candidate to the matching
term
14
Uniqueness
  • A statistical measure
  • Heuristic frequency of occurrence should be low
    for a true gene mention.
  • For names
  • For symbols, scale the score by s.

documents containing the term
size of population (MEDLINE)
frequency threshold for which the max. of
documents the term can appear in
15
Discrete Features
Boosting factor
Mention type
of mention
HUGO status
  • Mention type Is the gene mention an official
    term?
  • of mention Is there more than one unique
    mention for this gene?
  • HUGO status Has the gene been approved by the
    committee?
  • Boosting factor
  • Components
  • (Counter-)indicating words
  • Chromosome location
  • Accession number

16
Outline
  • Motivation
  • Algorithm
  • Datasets
  • Performance
  • Analysis
  • Demo

17
BioCreative II challenge2
  • Gene normalization task
  • Data are MEDLINE abstracts that have been
    annotated manually
  • 90 interannotator agreement
  • Training set had 281 citations
  • Test set had 262 citations
  • System results were computed by comparing the
    system-generated EntrezGene IDs to the gold
    standard.
  • Goal is to achieve the highest F-measure
  • 20 international groups participated in this task

2RecallPrecision Recall Precision
2 A.A. Morgan and L. Hirschman (2007). Overview
of BioCreative II Gene Normalization. Proceedings
of the Second BioCreative Challenge Evaluation
Workshop, p.p. 17-27.
18
Outline
  • Motivation
  • Algorithm
  • Datasets
  • Performance
  • Analysis
  • Demo

19
Performance
  • F-measure was .640 after first step (85.9 r _at_
    51.1 p)
  • Nelder-Mead optimization to find the optimal
    feature weights using the training data
  • Objective function combines F-measure and Area
    under P-R curve

20
Performance
F-measure
AUC
21
Outline
  • Motivation
  • Algorithm
  • Datasets
  • Performance
  • Analysis
  • Demo

22
F-measure Feature Analysis
Large improvements seen with coverage, Inverse
distance, and uniqueness features
gt .77 f-score could be achieved
23
AUC Feature Analysis
A different landscape!!
24
Strengths and Improvements
  • High recall at reasonable precision
  • Customizable tradeoff between precision and
    recall
  • Flexible, intuitive features
  • The highest F-measure in BioCreative II GN Task
    was .81
  • Why didnt we do as well?
  • Because other groups have
  • Better species identification
  • Better detection. E.g. freac-1 to freac-7
  • Better pruning of the dictionary
  • Better disambiguation using background knowledge
  • How good is good enough?

25
Outline
  • Motivation
  • Algorithm
  • Datasets
  • Performance
  • Analysis
  • Demo

26
Web Interface http//gaint.cit.nih.gov
27
Web Interface http//gaint.cit.nih.gov
28
Web Interface http//gaint.cit.nih.gov
29
Web Interface http//gaint.cit.nih.gov
Write a Comment
User Comments (0)
About PowerShow.com