Gleaning Relational Information from Biomedical Text - PowerPoint PPT Presentation

About This Presentation
Title:

Gleaning Relational Information from Biomedical Text

Description:

Car. Fungus. Novel. Money. Hive. Positive. Apple. Feet. Luggage. Mushrooms. Books. Wallet. Beekeeper ... Biomedical Information Extraction ... – PowerPoint PPT presentation

Number of Views:46
Avg rating:3.0/5.0
Slides: 48
Provided by: bryan46
Category:

less

Transcript and Presenter's Notes

Title: Gleaning Relational Information from Biomedical Text


1
Gleaning Relational Information from Biomedical
Text
  • Mark Goadrich
  • Computer Sciences Department
  • University of Wisconsin - Madison
  • Joint Work with Jude Shavlik and Louis Oliphant
  • CIBM Seminar - Dec 5th 2006

2
Outline
  • The Vacation Game
  • Formalizing with Logic
  • Biomedical Information Extraction
  • Evaluating Hypotheses
  • Gleaning Logical Rules
  • Experiments
  • Current Directions

3
The Vacation Game
  • Positive
  • Negative

4
The Vacation Game
  • Positive
  • Apple
  • Feet
  • Luggage
  • Mushrooms
  • Books
  • Wallet
  • Beekeeper
  • Negative
  • Pear
  • Socks
  • Car
  • Fungus
  • Novel
  • Money
  • Hive
  • Positive
  • Apple
  • Feet
  • Luggage
  • Mushrooms
  • Books
  • Wallet
  • Beekeeper

5
The Vacation Game
  • My Secret Rule
  • The word must have two adjacent letters which are
    the same letter.
  • Found by using inductive logic
  • Positive and Negative Examples
  • Formulating and Eliminating Hypotheses
  • Evaluating Success and Failure

6
Inductive Logic Programming
  • Machine Learning
  • Classify data into categories
  • Divide data into train and test sets
  • Generate hypotheses on train set and then measure
    performance on test set
  • In ILP, data are Objects
  • person, block, molecule, word, phrase,
  • and Relations between them
  • grandfather, has_bond, is_member,

7
Formalizing with Logic
apple
w2169
a b c d e f g h i j k l m n o p q r s t u v w x
y z
8
Formalizing with Logic
  • word(w2169). letter(w2169_1).
  • has_letter(w2169, w2169_2).
  • has_letter(w2169, w2169_3).
  • next(w2169_2, w2169_3).
  • letter_value(w2169_2, p).
  • letter_value(w2169_3, p).

apple'
w2169
a b c d e f g h i j k l m n o p q r s t u v w x
y z
pos(X) - has_letter(X, A), has_letter(X,
B), next(A, B), letter_value(A, C),
letter_value(B, C).
9
Biomedical Information Extraction
image courtesy of SEER Cancer Training Site
10
Biomedical Information Extraction
  • http//www.geneontology.org

11
Biomedical Information Extraction
  • NPL3 encodes a nuclear protein with an RNA
    recognition motif and similarities to a family of
    proteins involved in RNA metabolism.
  • ykuD was transcribed by SigK RNA polymerase from
    T4 of sporulation.
  • Mutations in the COL3A1 gene have been
    implicated as a cause of type IV Ehlers-Danlos
    syndrome, a disease leading to aortic rupture in
    early adult life.

12
Biomedical Information Extraction
  • The dog running down the street tackled and bit
    my little sister.

13
Biomedical Information Extraction
  • NPL3 encodes a nuclear protein with

14
MedDict Background Knowledge
  • http//cancerweb.ncl.ac.uk/omd/

15
MeSH Background Knowledge
  • http//www.nlm.nih.gov/mesh/MBrowser.html

16
GO Background Knowledge
  • http//www.geneontology.org

17
Some Prolog Predicates
  • Biomedical Predicates
  • phrase_contains_medDict_term(Phrase, Word,
    WordText)
  • phrase_contains_mesh_term(Phrase, Word, WordText)
  • phrase_contains_mesh_disease(Phrase, Word,
    WordText)
  • phrase_contains_go_term(Phrase, Word, WordText)
  • Lexical Predicates
  • internal_caps(Word) alphanumeric(Word)
  • Look-ahead Phrase Predicates
  • few_POS_in_phrase(Phrase, POS)
  • phrase_contains_specific_word_triple(Phrase, W1,
    W2, W3)
  • phrase_contains_some_marked_up_arg(Phrase, Arg,
    Word, Fold)
  • Relative Location of Phrases
  • protein_before_location(ExampleID)
  • word_pair_in_between_target_phrases(ExampleID,
    W1, W2)

18
Still More Predicate
  • High-scoring words in protein phrases
  • bifunction, repress, pmr1,
  • High-scoring words in location phrases
  • golgi, cytoplasm, er
  • High-scoring BETWEEN protein location
  • across, cofractionate, inside,

19
Biomedical Information Extraction
  • Given Medical Journal abstracts tagged
    with biological relations
  • Do Construct system to extract related
    phrases from unseen text
  • Our Gleaner Approach
  • Develop fast ensemble algorithms focused on
    recall and precision evaluation

20
Using Modes to Chain Relations
Word
Phrase
Sentence
21
Growing Rules From Seed
  • NPL3 encodes a nuclear protein with
  • prot_loc(ab1392078_sen7_ph0, ab1392078_sen7_ph2,
    ab1392078_sen7).

phrase_contains_novelword(ab1392078_sen7_ph0,
ab1392078_sen7_ph0_w0). phrase_next(ab1392078_sen7
_ph0, ab1392078_sen7_ph1). noun_phrase(ab139207
8_sen7_ph2). word_child(ab1392078_sen7_ph2,
ab9018277_sen5_ph11_w3). avg_length_sentence(ab
1392078_sen7).
Word
Phrase
Word
22
Growing Rules From Seed
  • prot_loc(Protein,Location,Sentence) -
  • phrase_contains_some_alphanumeric(Protein,E),
  • phrase_contains_some_internal_cap_word(Protein,E)
    ,
  • phrase_next(Protein,_),
  • different_phrases(Protein,Location),
  • one_POS_in_phrase(Location,noun),
  • phrase_contains_some_arg2_10x_word(Location,_),
  • phrase_previous(Location,_),
  • avg_length_sentence(Sentence).

23
Rule Evaluation
  • Prediction vs Actual
  • Positive or Negative
  • True or False
  • Focus on positive examples
  • Recall
  • Precision

F1 Score
24
Protein Localization Rule 1
  • prot_loc(Protein,Location,Sentence) -
  • phrase_contains_some_alphanumeric(Protein,E),
  • phrase_contains_some_internal_cap_word(Protein,E)
    ,
  • phrase_next(Protein,_),
  • different_phrases(Protein,Location),
  • one_POS_in_phrase(Location,noun),
  • phrase_contains_some_arg2_10x_word(Location,_),
  • phrase_previous(Location,_),
  • avg_length_sentence(Sentence).
  • 0.15 Recall 0.51 Precision 0.23 F1 Score

25
Protein Localization Rule 2
  • prot_loc(Protein,Location,Sentence) -
  • phrase_contains_some_marked_up_arg2(Location,C)
  • phrase_contains_some_internal_cap_word(Protein,_)
    ,
  • word_previous(C,_).
  • 0.86 Recall 0.12 Precision 0.21 F1 Score

26
Precision-Focused Search
27
Recall-Focused Search
28
F1-Focused Search
29
Aleph - Learning
  • Aleph learns theories of rules (Srinivasan, v4,
    2003)
  • Pick positive seed example
  • Use heuristic search to find best rule
  • Pick new seed from uncovered positivesand repeat
    until threshold of positives covered
  • Learning theories is time-consuming
  • Can we reduce time with ensembles?

30
Gleaner
  • Definition of Gleaner
  • One who gathers grain left behind by reapers
  • Key Ideas of Gleaner
  • Use Aleph as underlying ILP rule engine
  • Search rule space with Rapid Random Restart
  • Keep wide range of rules usually discarded
  • Create separate theories for diverse recall

31
Gleaner - Learning
  • Create B Bins
  • Generate Clauses
  • Record Best per Bin

Precision
Recall
32
Gleaner - Learning
Seed K
. . .
Seed 3
Seed 2
Seed 1
Recall
33
Gleaner - Ensemble
Rules from bin 5
.
pos1 prot_loc()
12
. . .
.
34
Gleaner - Ensemble
Score
Examples
pos3 prot_loc()
55
neg28 prot_loc()
52
pos2 prot_loc()
47
.
neg4 prot_loc()
18
neg475 prot_loc()
17
pos9 prot_loc()
17
neg15 prot_loc()
16
.
35
Gleaner - Overlap
  • For each bin, take the topmost curve

36
How to use Gleaner
  • Generate Test Curve
  • User Selects Recall Bin
  • Return ClassificationsOrdered By Their Score

Precision
Recall 0.50 Precision 0.70
Recall
37
Aleph Ensembles
  • We compare to ensembles of theories
  • Algorithm (Dutra et al ILP 2002)
  • Use K different initial seeds
  • Learn K theories containing C rules
  • Rank examples by the number of theories
  • Need to balance C for high performance
  • Small C leads to low recall
  • Large C leads to converging theories

38
Evaluation Metrics
1.0
  • Area Under Recall-Precision Curve (AURPC)
  • All curves standardized to cover full recall
    range
  • Averaged AURPC over 5 folds
  • Number of clauses considered
  • Rough estimate of time

Precision
Recall
1.0
39
YPD Protein Localization
  • Hand-labeled dataset (Ray Craven 01)
  • 7,245 sentences from 871 abstracts
  • Examples are phrase-phrase combinations
  • 1,810 positive 279,154 negative
  • 1.6 GB of background knowledge
  • Structural, Statistical, Lexical and Ontological
  • In total, 200 distinct background predicates

40
Experimental Methodology
  • Performed five-fold cross-validation
  • Variation of parameters
  • Gleaner (20 recall bins)
  • seeds 25, 50, 75, 100
  • clauses 1K, 10K, 25K, 50K, 100K, 250K, 500K
  • Ensembles (0.75 minacc, 1K and 35K nodes)
  • theories 10, 25, 50, 75, 100
  • clauses per theory 1, 5, 10, 15, 20, 25, 50

41
PR Curves - 100,000 Clauses
42
PR Curves - 1,000,000 Clauses
43
Protein Localization Results
44
Genetic Disorder Results
45
Current Directions
  • Learn diverse rules across seeds
  • Calculate probabilistic scores for examples
  • Directed Rapid Random Restarts
  • Cache rule information to speed scoring
  • Transfer learning across seeds
  • Explore Active Learning within ILP

46
Take-Home Message
  • Biology, Gleaner and ILP
  • Challenging problems in biology can be naturally
    formulated for Inductive Logic Programming
  • Many rules constructed and evaluated in ILP
    hypothesis search
  • Gleaner makes use of those rules that are not the
    highest scoring ones for improved speed and
    performance

47
Acknowledgements
  • USA DARPA Grant F30602-01-2-0571
  • USA Air Force Grant F30602-01-2-0571
  • USA NLM Grant 5T15LM007359-02
  • USA NLM Grant 1R01LM07050-01
  • UW Condor Group
  • David Page, Vitor Santos Costa, Ines Dutra,
    Soumya Ray, Marios Skounakis, Mark Craven, Burr
    Settles, Jesse Davis, Sarah Cunningham, David
    Haight, Ameet Soni
Write a Comment
User Comments (0)
About PowerShow.com