Gleaning Relational Information from Biomedical Text - PowerPoint PPT Presentation

About This Presentation

Title:

Gleaning Relational Information from Biomedical Text

Description:

Car. Fungus. Novel. Money. Hive. Positive. Apple. Feet. Luggage. Mushrooms. Books. Wallet. Beekeeper ... Biomedical Information Extraction ... – PowerPoint PPT presentation

Number of Views:46

Avg rating:3.0/5.0

Slides: 48

Provided by: bryan46

Category:

more less

Transcript and Presenter's Notes

Title: Gleaning Relational Information from Biomedical Text

1
Gleaning Relational Information from Biomedical
Text

Mark Goadrich
Computer Sciences Department
University of Wisconsin - Madison
Joint Work with Jude Shavlik and Louis Oliphant
CIBM Seminar - Dec 5th 2006

2
Outline

The Vacation Game
Formalizing with Logic
Biomedical Information Extraction
Evaluating Hypotheses
Gleaning Logical Rules
Experiments
Current Directions

3
The Vacation Game

Positive

Negative

4
The Vacation Game

Positive
Apple
Feet
Luggage
Mushrooms
Books
Wallet
Beekeeper

Negative
Pear
Socks
Car
Fungus
Novel
Money
Hive

Positive
Apple
Feet
Luggage
Mushrooms
Books
Wallet
Beekeeper

5
The Vacation Game

My Secret Rule
The word must have two adjacent letters which are
the same letter.
Found by using inductive logic
Positive and Negative Examples
Formulating and Eliminating Hypotheses
Evaluating Success and Failure

6
Inductive Logic Programming

Machine Learning
Classify data into categories
Divide data into train and test sets
Generate hypotheses on train set and then measure
performance on test set
In ILP, data are Objects
person, block, molecule, word, phrase,
and Relations between them
grandfather, has_bond, is_member,

7
Formalizing with Logic
apple
w2169
a b c d e f g h i j k l m n o p q r s t u v w x
y z
8
Formalizing with Logic

word(w2169). letter(w2169_1).
has_letter(w2169, w2169_2).
has_letter(w2169, w2169_3).
next(w2169_2, w2169_3).
letter_value(w2169_2, p).
letter_value(w2169_3, p).

apple'
w2169
a b c d e f g h i j k l m n o p q r s t u v w x
y z
pos(X) - has_letter(X, A), has_letter(X,
B), next(A, B), letter_value(A, C),
letter_value(B, C).
9
Biomedical Information Extraction
image courtesy of SEER Cancer Training Site
10
Biomedical Information Extraction

http//www.geneontology.org

11
Biomedical Information Extraction

NPL3 encodes a nuclear protein with an RNA
recognition motif and similarities to a family of
proteins involved in RNA metabolism.
ykuD was transcribed by SigK RNA polymerase from
T4 of sporulation.
Mutations in the COL3A1 gene have been
implicated as a cause of type IV Ehlers-Danlos
syndrome, a disease leading to aortic rupture in
early adult life.

12
Biomedical Information Extraction

The dog running down the street tackled and bit
my little sister.

13
Biomedical Information Extraction

NPL3 encodes a nuclear protein with

14
MedDict Background Knowledge

http//cancerweb.ncl.ac.uk/omd/

15
MeSH Background Knowledge

http//www.nlm.nih.gov/mesh/MBrowser.html

16
GO Background Knowledge

http//www.geneontology.org

17
Some Prolog Predicates

Biomedical Predicates
phrase_contains_medDict_term(Phrase, Word,
WordText)
phrase_contains_mesh_term(Phrase, Word, WordText)
phrase_contains_mesh_disease(Phrase, Word,
WordText)
phrase_contains_go_term(Phrase, Word, WordText)
Lexical Predicates
internal_caps(Word) alphanumeric(Word)
Look-ahead Phrase Predicates
few_POS_in_phrase(Phrase, POS)
phrase_contains_specific_word_triple(Phrase, W1,
W2, W3)
phrase_contains_some_marked_up_arg(Phrase, Arg,
Word, Fold)
Relative Location of Phrases
protein_before_location(ExampleID)
word_pair_in_between_target_phrases(ExampleID,
W1, W2)

18
Still More Predicate

High-scoring words in protein phrases
bifunction, repress, pmr1,
High-scoring words in location phrases
golgi, cytoplasm, er
High-scoring BETWEEN protein location
across, cofractionate, inside,

19
Biomedical Information Extraction

Given Medical Journal abstracts tagged
with biological relations
Do Construct system to extract related
phrases from unseen text
Our Gleaner Approach
Develop fast ensemble algorithms focused on
recall and precision evaluation

20
Using Modes to Chain Relations
Word
Phrase
Sentence
21
Growing Rules From Seed

NPL3 encodes a nuclear protein with
prot_loc(ab1392078_sen7_ph0, ab1392078_sen7_ph2,
ab1392078_sen7).

phrase_contains_novelword(ab1392078_sen7_ph0,
ab1392078_sen7_ph0_w0). phrase_next(ab1392078_sen7
_ph0, ab1392078_sen7_ph1). noun_phrase(ab139207
8_sen7_ph2). word_child(ab1392078_sen7_ph2,
ab9018277_sen5_ph11_w3). avg_length_sentence(ab
1392078_sen7).
Word
Phrase
Word
22
Growing Rules From Seed

prot_loc(Protein,Location,Sentence) -
phrase_contains_some_alphanumeric(Protein,E),
phrase_contains_some_internal_cap_word(Protein,E)
,
phrase_next(Protein,_),
different_phrases(Protein,Location),
one_POS_in_phrase(Location,noun),
phrase_contains_some_arg2_10x_word(Location,_),
phrase_previous(Location,_),
avg_length_sentence(Sentence).

23
Rule Evaluation

Prediction vs Actual
Positive or Negative
True or False

Focus on positive examples
Recall
Precision

F1 Score
24
Protein Localization Rule 1

prot_loc(Protein,Location,Sentence) -
phrase_contains_some_alphanumeric(Protein,E),
phrase_contains_some_internal_cap_word(Protein,E)
,
phrase_next(Protein,_),
different_phrases(Protein,Location),
one_POS_in_phrase(Location,noun),
phrase_contains_some_arg2_10x_word(Location,_),
phrase_previous(Location,_),
avg_length_sentence(Sentence).
0.15 Recall 0.51 Precision 0.23 F1 Score

25
Protein Localization Rule 2

prot_loc(Protein,Location,Sentence) -
phrase_contains_some_marked_up_arg2(Location,C)
phrase_contains_some_internal_cap_word(Protein,_)
,
word_previous(C,_).
0.86 Recall 0.12 Precision 0.21 F1 Score

26
Precision-Focused Search
27
Recall-Focused Search
28
F1-Focused Search
29
Aleph - Learning

Aleph learns theories of rules (Srinivasan, v4,
2003)
Pick positive seed example
Use heuristic search to find best rule
Pick new seed from uncovered positivesand repeat
until threshold of positives covered
Learning theories is time-consuming
Can we reduce time with ensembles?

30
Gleaner

Definition of Gleaner
One who gathers grain left behind by reapers
Key Ideas of Gleaner
Use Aleph as underlying ILP rule engine
Search rule space with Rapid Random Restart
Keep wide range of rules usually discarded
Create separate theories for diverse recall

31
Gleaner - Learning

Create B Bins
Generate Clauses
Record Best per Bin

Precision
Recall
32
Gleaner - Learning
Seed K
. . .
Seed 3
Seed 2
Seed 1
Recall
33
Gleaner - Ensemble
Rules from bin 5
.
pos1 prot_loc()
12
. . .
.
34
Gleaner - Ensemble
Score
Examples
pos3 prot_loc()
55
neg28 prot_loc()
52
pos2 prot_loc()
47
.
neg4 prot_loc()
18
neg475 prot_loc()
17
pos9 prot_loc()
17
neg15 prot_loc()
16
.
35
Gleaner - Overlap

For each bin, take the topmost curve

36
How to use Gleaner

Generate Test Curve
User Selects Recall Bin
Return ClassificationsOrdered By Their Score

Precision
Recall 0.50 Precision 0.70
Recall
37
Aleph Ensembles

We compare to ensembles of theories
Algorithm (Dutra et al ILP 2002)
Use K different initial seeds
Learn K theories containing C rules
Rank examples by the number of theories
Need to balance C for high performance
Small C leads to low recall
Large C leads to converging theories

38
Evaluation Metrics
1.0

Area Under Recall-Precision Curve (AURPC)
All curves standardized to cover full recall
range
Averaged AURPC over 5 folds
Number of clauses considered
Rough estimate of time

Precision
Recall
1.0
39
YPD Protein Localization

Hand-labeled dataset (Ray Craven 01)
7,245 sentences from 871 abstracts
Examples are phrase-phrase combinations
1,810 positive 279,154 negative
1.6 GB of background knowledge
Structural, Statistical, Lexical and Ontological
In total, 200 distinct background predicates

40
Experimental Methodology

Performed five-fold cross-validation
Variation of parameters
Gleaner (20 recall bins)
seeds 25, 50, 75, 100
clauses 1K, 10K, 25K, 50K, 100K, 250K, 500K
Ensembles (0.75 minacc, 1K and 35K nodes)
theories 10, 25, 50, 75, 100
clauses per theory 1, 5, 10, 15, 20, 25, 50

41
PR Curves - 100,000 Clauses
42
PR Curves - 1,000,000 Clauses
43
Protein Localization Results
44
Genetic Disorder Results
45
Current Directions

Learn diverse rules across seeds
Calculate probabilistic scores for examples
Directed Rapid Random Restarts
Cache rule information to speed scoring
Transfer learning across seeds
Explore Active Learning within ILP

46
Take-Home Message

Biology, Gleaner and ILP
Challenging problems in biology can be naturally
formulated for Inductive Logic Programming
Many rules constructed and evaluated in ILP
hypothesis search
Gleaner makes use of those rules that are not the
highest scoring ones for improved speed and
performance

47
Acknowledgements

USA DARPA Grant F30602-01-2-0571
USA Air Force Grant F30602-01-2-0571
USA NLM Grant 5T15LM007359-02
USA NLM Grant 1R01LM07050-01
UW Condor Group
David Page, Vitor Santos Costa, Ines Dutra,
Soumya Ray, Marios Skounakis, Mark Craven, Burr
Settles, Jesse Davis, Sarah Cunningham, David
Haight, Ameet Soni

Write a Comment

User Comments (0)