Title: Gleaning Relational Information from Biomedical Text
1Gleaning Relational Information from Biomedical
Text
- Mark Goadrich
- Computer Sciences Department
- University of Wisconsin - Madison
- Joint Work with Jude Shavlik and Louis Oliphant
- CIBM Seminar - Dec 5th 2006
2Outline
- The Vacation Game
- Formalizing with Logic
- Biomedical Information Extraction
- Evaluating Hypotheses
- Gleaning Logical Rules
- Experiments
- Current Directions
3The Vacation Game
4The Vacation Game
- Positive
- Apple
- Feet
- Luggage
- Mushrooms
- Books
- Wallet
- Beekeeper
- Negative
- Pear
- Socks
- Car
- Fungus
- Novel
- Money
- Hive
- Positive
- Apple
- Feet
- Luggage
- Mushrooms
- Books
- Wallet
- Beekeeper
5The Vacation Game
- My Secret Rule
- The word must have two adjacent letters which are
the same letter. - Found by using inductive logic
- Positive and Negative Examples
- Formulating and Eliminating Hypotheses
- Evaluating Success and Failure
6Inductive Logic Programming
- Machine Learning
- Classify data into categories
- Divide data into train and test sets
- Generate hypotheses on train set and then measure
performance on test set - In ILP, data are Objects
- person, block, molecule, word, phrase,
- and Relations between them
- grandfather, has_bond, is_member,
7Formalizing with Logic
apple
w2169
a b c d e f g h i j k l m n o p q r s t u v w x
y z
8Formalizing with Logic
- word(w2169). letter(w2169_1).
- has_letter(w2169, w2169_2).
- has_letter(w2169, w2169_3).
- next(w2169_2, w2169_3).
- letter_value(w2169_2, p).
- letter_value(w2169_3, p).
apple'
w2169
a b c d e f g h i j k l m n o p q r s t u v w x
y z
pos(X) - has_letter(X, A), has_letter(X,
B), next(A, B), letter_value(A, C),
letter_value(B, C).
9Biomedical Information Extraction
image courtesy of SEER Cancer Training Site
10Biomedical Information Extraction
- http//www.geneontology.org
11Biomedical Information Extraction
- NPL3 encodes a nuclear protein with an RNA
recognition motif and similarities to a family of
proteins involved in RNA metabolism. - ykuD was transcribed by SigK RNA polymerase from
T4 of sporulation. - Mutations in the COL3A1 gene have been
implicated as a cause of type IV Ehlers-Danlos
syndrome, a disease leading to aortic rupture in
early adult life.
12Biomedical Information Extraction
- The dog running down the street tackled and bit
my little sister.
13Biomedical Information Extraction
- NPL3 encodes a nuclear protein with
14MedDict Background Knowledge
- http//cancerweb.ncl.ac.uk/omd/
15MeSH Background Knowledge
- http//www.nlm.nih.gov/mesh/MBrowser.html
16GO Background Knowledge
- http//www.geneontology.org
17Some Prolog Predicates
- Biomedical Predicates
- phrase_contains_medDict_term(Phrase, Word,
WordText) - phrase_contains_mesh_term(Phrase, Word, WordText)
- phrase_contains_mesh_disease(Phrase, Word,
WordText) - phrase_contains_go_term(Phrase, Word, WordText)
- Lexical Predicates
- internal_caps(Word) alphanumeric(Word)
- Look-ahead Phrase Predicates
- few_POS_in_phrase(Phrase, POS)
- phrase_contains_specific_word_triple(Phrase, W1,
W2, W3) - phrase_contains_some_marked_up_arg(Phrase, Arg,
Word, Fold) - Relative Location of Phrases
- protein_before_location(ExampleID)
- word_pair_in_between_target_phrases(ExampleID,
W1, W2)
18Still More Predicate
- High-scoring words in protein phrases
- bifunction, repress, pmr1,
- High-scoring words in location phrases
- golgi, cytoplasm, er
- High-scoring BETWEEN protein location
- across, cofractionate, inside,
19Biomedical Information Extraction
- Given Medical Journal abstracts tagged
with biological relations - Do Construct system to extract related
phrases from unseen text - Our Gleaner Approach
- Develop fast ensemble algorithms focused on
recall and precision evaluation
20Using Modes to Chain Relations
Word
Phrase
Sentence
21Growing Rules From Seed
- NPL3 encodes a nuclear protein with
- prot_loc(ab1392078_sen7_ph0, ab1392078_sen7_ph2,
ab1392078_sen7).
phrase_contains_novelword(ab1392078_sen7_ph0,
ab1392078_sen7_ph0_w0). phrase_next(ab1392078_sen7
_ph0, ab1392078_sen7_ph1). noun_phrase(ab139207
8_sen7_ph2). word_child(ab1392078_sen7_ph2,
ab9018277_sen5_ph11_w3). avg_length_sentence(ab
1392078_sen7).
Word
Phrase
Word
22Growing Rules From Seed
- prot_loc(Protein,Location,Sentence) -
- phrase_contains_some_alphanumeric(Protein,E),
- phrase_contains_some_internal_cap_word(Protein,E)
, - phrase_next(Protein,_),
- different_phrases(Protein,Location),
- one_POS_in_phrase(Location,noun),
- phrase_contains_some_arg2_10x_word(Location,_),
- phrase_previous(Location,_),
- avg_length_sentence(Sentence).
23Rule Evaluation
- Prediction vs Actual
- Positive or Negative
- True or False
- Focus on positive examples
- Recall
- Precision
F1 Score
24Protein Localization Rule 1
- prot_loc(Protein,Location,Sentence) -
- phrase_contains_some_alphanumeric(Protein,E),
- phrase_contains_some_internal_cap_word(Protein,E)
, - phrase_next(Protein,_),
- different_phrases(Protein,Location),
- one_POS_in_phrase(Location,noun),
- phrase_contains_some_arg2_10x_word(Location,_),
- phrase_previous(Location,_),
- avg_length_sentence(Sentence).
- 0.15 Recall 0.51 Precision 0.23 F1 Score
25Protein Localization Rule 2
- prot_loc(Protein,Location,Sentence) -
- phrase_contains_some_marked_up_arg2(Location,C)
- phrase_contains_some_internal_cap_word(Protein,_)
, - word_previous(C,_).
- 0.86 Recall 0.12 Precision 0.21 F1 Score
26Precision-Focused Search
27Recall-Focused Search
28F1-Focused Search
29Aleph - Learning
- Aleph learns theories of rules (Srinivasan, v4,
2003) - Pick positive seed example
- Use heuristic search to find best rule
- Pick new seed from uncovered positivesand repeat
until threshold of positives covered - Learning theories is time-consuming
- Can we reduce time with ensembles?
30Gleaner
- Definition of Gleaner
- One who gathers grain left behind by reapers
- Key Ideas of Gleaner
- Use Aleph as underlying ILP rule engine
- Search rule space with Rapid Random Restart
- Keep wide range of rules usually discarded
- Create separate theories for diverse recall
31Gleaner - Learning
- Create B Bins
- Generate Clauses
- Record Best per Bin
Precision
Recall
32Gleaner - Learning
Seed K
. . .
Seed 3
Seed 2
Seed 1
Recall
33Gleaner - Ensemble
Rules from bin 5
.
pos1 prot_loc()
12
. . .
.
34Gleaner - Ensemble
Score
Examples
pos3 prot_loc()
55
neg28 prot_loc()
52
pos2 prot_loc()
47
.
neg4 prot_loc()
18
neg475 prot_loc()
17
pos9 prot_loc()
17
neg15 prot_loc()
16
.
35Gleaner - Overlap
- For each bin, take the topmost curve
36How to use Gleaner
- Generate Test Curve
- User Selects Recall Bin
- Return ClassificationsOrdered By Their Score
Precision
Recall 0.50 Precision 0.70
Recall
37Aleph Ensembles
- We compare to ensembles of theories
- Algorithm (Dutra et al ILP 2002)
- Use K different initial seeds
- Learn K theories containing C rules
- Rank examples by the number of theories
- Need to balance C for high performance
- Small C leads to low recall
- Large C leads to converging theories
38Evaluation Metrics
1.0
- Area Under Recall-Precision Curve (AURPC)
- All curves standardized to cover full recall
range - Averaged AURPC over 5 folds
- Number of clauses considered
- Rough estimate of time
Precision
Recall
1.0
39YPD Protein Localization
- Hand-labeled dataset (Ray Craven 01)
- 7,245 sentences from 871 abstracts
- Examples are phrase-phrase combinations
- 1,810 positive 279,154 negative
- 1.6 GB of background knowledge
- Structural, Statistical, Lexical and Ontological
- In total, 200 distinct background predicates
40Experimental Methodology
- Performed five-fold cross-validation
- Variation of parameters
- Gleaner (20 recall bins)
- seeds 25, 50, 75, 100
- clauses 1K, 10K, 25K, 50K, 100K, 250K, 500K
- Ensembles (0.75 minacc, 1K and 35K nodes)
- theories 10, 25, 50, 75, 100
- clauses per theory 1, 5, 10, 15, 20, 25, 50
41PR Curves - 100,000 Clauses
42PR Curves - 1,000,000 Clauses
43Protein Localization Results
44Genetic Disorder Results
45Current Directions
- Learn diverse rules across seeds
- Calculate probabilistic scores for examples
- Directed Rapid Random Restarts
- Cache rule information to speed scoring
- Transfer learning across seeds
- Explore Active Learning within ILP
46Take-Home Message
- Biology, Gleaner and ILP
- Challenging problems in biology can be naturally
formulated for Inductive Logic Programming - Many rules constructed and evaluated in ILP
hypothesis search - Gleaner makes use of those rules that are not the
highest scoring ones for improved speed and
performance
47Acknowledgements
- USA DARPA Grant F30602-01-2-0571
- USA Air Force Grant F30602-01-2-0571
- USA NLM Grant 5T15LM007359-02
- USA NLM Grant 1R01LM07050-01
- UW Condor Group
- David Page, Vitor Santos Costa, Ines Dutra,
Soumya Ray, Marios Skounakis, Mark Craven, Burr
Settles, Jesse Davis, Sarah Cunningham, David
Haight, Ameet Soni