Title: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations
1Natural Language Processing in Bioinformatics
Uncovering Semantic Relations
- Barbara Rosario
- SIMS
- UC Berkeley
2Outline of Talk
- Goal Extract semantics from text
- Information and relation extraction
- Protein-protein interactions
3Text Mining
- Text Mining is the discovery by computers of new,
previously unknown information, via automatic
extraction of information from text
4Text Mining
- Text
- Stress is associated with migraines
- Stress can lead to loss of magnesium
- Calcium channel blockers prevent some migraines
- Magnesium is a natural calcium channel blocker
1 Extract semantic entities from text
5Text Mining
- Text
- Stress is associated with migraines
- Stress can lead to loss of magnesium
- Calcium channel blockers prevent some migraines
- Magnesium is a natural calcium channel blocker
1 Extract semantic entities from text
Stress
Migraine
Magnesium
Calcium channel blockers
6Text Mining (cont.)
- Text
- Stress is associated with migraines
- Stress can lead to loss of magnesium
- Calcium channel blockers prevent some migraines
- Magnesium is a natural calcium channel blocker
2 Classify relations between entities
7Text Mining (cont.)
- Text
- Stress is associated with migraines
- Stress can lead to loss of magnesium
- Calcium channel blockers prevent some migraines
- Magnesium is a natural calcium channel blocker
3 Do reasoning find new correlations
Stress
Migraine
Prevent
Magnesium
Calcium channel blockers
Subtype-of (is a)
8Text Mining (cont.)
- Text
- Stress is associated with migraines
- Stress can lead to loss of magnesium
- Calcium channel blockers prevent some migraines
- Magnesium is a natural calcium channel blocker
4 Do reasoning infer causality
Stress
Migraine
No prevention
Prevent
Subtype-of (is a)
Magnesium
Calcium channel blockers
Deficiency of magnesium ? migraine
9My research
Information Extraction
- Stress is associated with migraines
- Stress can lead to loss of magnesium
- Calcium channel blockers prevent some migraines
- Magnesium is a natural calcium channel blocker
10My research
Relation extraction
11Information and relation extraction
- Problems
- Given biomedical text
- Find all the treatments and all the diseases
- Find the relations that hold between them
12Hepatitis Examples
- Cure
- These results suggest that con A-induced
hepatitis was ameliorated by pretreatment with
TJ-135. - Prevent
- A two-dose combined hepatitis A and B vaccine
would facilitate immunization programs - Vague
- Effect of interferon on hepatitis B
13Two tasks
- Relationship extraction
- Identify the several semantic relations that can
occur between the entities disease and treatment
in bioscience text - Information extraction (IE)
- Related problem identify such entities
14Outline of IE
- Data and semantic relations
- Quick intro to graphical models
- Models and results
- Features
- Conclusions
15Data and Relations
- MEDLINE, abstracts and titles
- 3662 sentences labeled
- Relevant 1724
- Irrelevant 1771
- e.g., Patients were followed up for 6 months
- 2 types of Entities
- treatment and disease
- 7 Relationships between these entities
The labeled data are available at
http//biotext.berkeley.edu
16Semantic Relationships
- 810 Cure
- Intravenous immune globulin for recurrent
spontaneous abortion - 616 Only Disease
- Social ties and susceptibility to the common cold
- 166 Only Treatment
- Flucticasone propionate is safe in recommended
doses - 63 Prevent
- Statins for prevention of stroke
17Semantic Relationships
- 36 Vague
- Phenylbutazone and leukemia
- 29 Side Effect
- Malignant mesodermal mixed tumor of the uterus
following irradiation - 4 Does NOT cure
- Evidence for double resistance to permethrin and
malathion in head lice
18Outline of IE
- Data and semantic relations
- Quick intro to graphical models
- Models and results
- Features
- Conclusions
19Graphical Models
- Unifying framework for developing Machine
Learning algorithms - Graph theory plus probability theory
- Widely used
- Error correcting codes
- Systems diagnosis
- Computer vision
- Filtering (Kalman filters)
- Bioinformatics
20(Quick intro to) Graphical Models
- Nodes are random variables
- Edges are annotated with conditional
probabilities - Absence of an edge between nodes implies
conditional independence - Probabilistic database
A
21Graphical Models
- Define a joint probability distribution
- P(X1, ..XN) ?i P(Xi Par(Xi) )
- P(A,B,C,D)
- P(A)P(D)P(BA)P(CA,D)
- Learning
- Given data, estimate P(A), P(BA), P(D), P(C A,
D)
A
22Graphical Models
- Define a joint probability distribution
- P(X1, ..XN) ?i P(Xi Par(Xi) )
- P(A,B,C,D)
- P(A)P(D)P(BA)P(C,A,D)
- Learning
- Given data, estimate P(A), P(BA), P(D), P(C A,
D)
A
- Inference compute conditional probabilities,
e.g., P(AB, D) - Inference Probabilistic queries. General
inference algorithms (Junction Tree)
23Naïve Bayes models
- Simple graphical model
- Xi depend on Y
- Naïve Bayes assumption all Xi are independent
given Y - Currently used for text classification and spam
detection
Y
x1
x2
x3
24Dynamic Graphical Models
- Graphical model composed of repeated segments
- HMMs (Hidden Markov Models)
- POS tagging, speech recognition, IE
tN
wN
25HMMs
- Joint probability distribution
- P(t1,.., tN, w1,.., wN) P(t1) ?
P(titi-1)P(witi) - Estimate P(t1), P(titi-1), P(witi) from labeled
data
26HMMs
- Joint probability distribution
- P(t1,.., tN, w1,.., wN) P(t1) ?
P(titi-1)P(witi) - Estimate P(t1), P(titi-1), P(witi) from labeled
data - Inference P(ti w1 , w2 , wN)
tN
wN
27Graphical Models for IE
- Different dependencies between the features and
the relation nodes
Dynamic
Static
28Graphical Model
- Relation node
- Semantic relation (cure, prevent, none..)
expressed in the sentence - Relation generate the state sequence and the
observations
Relation
29Graphical Model
- Markov sequence of states (roles)
- Role nodes
- Rolet treatment, disease, none
Rolet-1
Rolet
Rolet1
30Graphical Model
- Roles generate multiple observations
- Feature nodes (observed)
- word, POS, MeSH
Features
31Graphical Model
- Inference Find Relation and Roles given the
features observed
?
?
?
?
32Features
- Word
- Part of speech
- Phrase constituent
- Orthographic features
- is number, all letters are capitalized,
first letter is capitalized - Semantic features (MeSH)
33MeSH
- MeSH Tree Structures
- 1. Anatomy A
- 2. Organisms B
- 3. Diseases C
- 4. Chemicals and Drugs D
- 5. Analytical, Diagnostic and Therapeutic
Techniques and Equipment E - 6. Psychiatry and Psychology F
- 7. Biological Sciences G
- 8. Physical Sciences H
- 9. Anthropology, Education, Sociology and
Social Phenomena I - 10. Technology and Food and Beverages J
- 11. Humanities K
- 12. Information Science L
- 13. Persons M
- 14. Health Care N
- 15. Geographic Locations Z
34MeSH (cont.)
- 1. Anatomy A
- Body Regions A01
- Musculoskeletal System A02
Digestive System A03 - Respiratory System A04
- Urogenital System A05
- Endocrine System A06
- Cardiovascular System A07
- Nervous System A08
- Sense Organs A09
- Tissues A10
- Cells A11
- Fluids and Secretions A12
- Animal Structures A13
- Stomatognathic System A14
- (..)
- Body Regions A01
- Abdomen A01.047
- Groin A01.047.365
- Inguinal Canal A01.047.412
- Peritoneum A01.047.596
- Umbilicus A01.047.849
- Axilla A01.133
- Back A01.176
- Breast A01.236
- Buttocks A01.258
- Extremities A01.378
- Head A01.456
- Neck A01.598
- (.)
35Use of lexical Hierarchies in NLP
- Big problem in NLP few words occur a lot, most
of them occur very rarely (Zipfs law) - Difficult to do statistics
- One solution use lexical hierarchies
- Another example WordNet
- Statistics on classes of words instead of words
36Mapping Words to MeSH Concepts
- headache pain
- C23.888.592.612.441 G11.561.796.444
- C23.888 G11.561
- Neurologic ManifestationsNervous System
Physiology - C23 G11
- Pathological Conditions, Signs and
SymptomsMusculoskeletal, Neural, and Ocular
Physiology - headache recurrence
- C23.888.592.612.441 C23.550.291.937
- breast cancer cells
- A01.236 C04 A11
37Graphical Model
- Joint probability distribution over relation,
roles and features nodes - Parameters estimated with maximum likelihood and
absolute discounting smoothing
38Graphical Model
- Inference Find Relation and Roles given the
features observed
?
?
?
?
39Relation extraction
- Results in terms of classification accuracy (with
and without irrelevant sentences) - 2 cases
- Roles given
- Roles hidden (only features)
40Relation classification Results
Accuracy Accuracy
Sentences Input Base. GM D2
Only rel. only feat. 46.7 72.6
Only rel. roles given 46.7 76.6
Rel. irrel. only feat. 50.6 74.9
Rel. irrel. roles given 50.6 82.0
- Good results for a difficult task
- One of the few systems to tackle several
DIFFERENT relations between the same types of
entities thus differs from the problem statement
of other work on relations
41Role Extraction Results
- Junction tree algorithm
- F-measure (2PrecRecall)/(Prec Recall)
- (Related work extracting diseases and genes
reports F-measure of 0.50)
Sentences F-measure
Only rel. 0.73
Rel. irrel. 0.71
42Features impact Role extraction
- Most important features
- 1)Word 2)MeSH
- Rel. irrel. Only
rel. - All features 0.71 0.73
- No word 0.61 0.66
- -14.1 -9.6
- No MeSH 0.65 0.69
- -8.4 -5.5
43Features impact Relation classification
- Most important features Roles
-
Accuracy - All feat. roles 82.0
- All feat. roles 74.9
- -8.7
- All feat. roles Word 79.8
- -2.8
- All feat. roles MeSH 84.6
- 3.1
(rel. irrel.)
44Features impact Relation classification
- Most realistic case Roles not known
- Most important features 1) Word 2) Mesh
- Accuracy
- All feat. roles 74.9
- All feat. - roles Word 66.1
- -11.8
- All feat. - roles MeSH 72.5
- -3.2
(rel. irrel.)
45Conclusions
- Classification of subtle semantic relations in
bioscience text - Graphical models for the simultaneous extraction
of entities and relationships - Importance of MeSH, lexical hierarchy
46Outline of Talk
- Goal Extract semantics from text
- Information and relation extraction
- Protein-protein interactions using an existing
database to gather labeled data
47Protein-Protein interactions
- One of the most important challenges in modern
genomics, with many applications throughout
biology - There are several protein-protein interaction
databases (BIND, MINT,..), all manually curated
48Protein-Protein interactions
- Supervised systems require manually labeled data,
while purely unsupervised are still to be proven
effective for these tasks. - Some other approaches semi-supervised, active
learning, co-training. - We propose the use of resources developed in the
biomedical domain to address the problem of
gathering labeled data for the task of
classifying interactions between proteins
49HIV-1, Protein Interaction Database
- Documents interactions between HIV-1 proteins and
- host cell proteins
- other HIV-1 proteins
- disease associated with HIV/AIDS
- 2224 pairs of interacting proteins, 65 types
http//www.ncbi.nlm.nih.gov/RefSeq/HIVInteractions
50HIV-1, Protein Interaction Database
Protein 1 Protein 2 Paper ID Interaction Type
Tat, p14 AKT3 11156964, 11994280.. activates
AIP1 Gag, Pr55 14519844, binds
Tat, p14 CDK2 9223324 induces
Tat, p14 CDK2 7716549 enhances
Tat, p14 CDK2 9525916 downregulates
.
51Most common interactions
52Protein-Protein interactions
- Idea use this to label data
Protein 1 Protein 2 Interaction Paper ID
Tat, p14 AKT3 activates 11156964
53Protein-Protein interactions
- Idea use this to label data
Protein 1 Protein 2 Interaction Paper ID
Tat, p14 AKT3 activates 11156964
Extract from the paper all the sentences with
Protein 1 and Protein 2
activates
activates
Label them with the interaction given in the
database
54Protein-Protein interactions
- Use citations
- Find all the papers
- that cite the papers
- in the database
Protein 1 Protein 2 Interaction Paper ID
Tat, p14 AKT3 activates 11156964
ID 9918876
ID 9971769
55Protein-Protein interactions
- From the papers, extract
- the citation sentences
- from these extract the
- sentences with Protein 1
- and Protein 2
- Label them
Protein 1 Protein 2 Interaction Paper ID
Tat, p14 AKT3 activates 11156964
56Examples of sentences
- Papers
- The interpretation of these results was slightly
complicated by the fact that AIP-1/ALIX depletion
by using siRNA likely had deleterious effects on
cell viability , because a Western blot analysis
showed slightly reduced Gag expression at later
time points (fig. 5C ). - Citations
- They also demonstrate that the GAG protein from
membrane - containing viruses , such as HIV ,
binds to Alix / AIP1 , thereby recruiting the
ESCRT machinery to allow budding of the virus
from the cell surface (TARGET_CITATION CITATION
) .
5710 Interaction types
58Protein-Protein interactions
- Tasks
- Given sentences from Paper ID, and/or citation
sentences to ID - Predict the interaction type given in the HIV
database for Paper ID - Extract the proteins involved
- 10-way classification problem
59Protein-Protein interactions
- Models
- Dynamic graphical model
- Naïve Bayes
60Graphical Models
61Evaluation
- Evaluation at document level
- All (sentences from papers citations)
- Papers (only sentences from papers)
- Citations (only citation sentences)
- Trigger word approach
- List of keywords (ex for inhibits inhibitor,
inhibition, inhibitetc. - If keyword presents assign corresponding
interaction
62Results
- Accuracies on interaction classification
Model All Papers Citations
Markov Model 60.5 57.8 53.4
Naïve Bayes 58.1 57.8 55.7
Baselines
Most freq. inter. 21.8 11.1 26.1
TriggerW 20.1 24.4 20.4
TriggerW BO 25.8 40.0 26.1
(Roles hidden)
63Results confusion matrix
For All. Overall accuracy 60.5
64Hiding the protein names
- Replaced protein names with tokens PROT_NAME
- Selective CXCR4 antagonism by Tat
- Selective PROT_NAME antagonism by PROT_NAME
65Results with no protein names
Model Papers Citations
Markov Model 44.4 (-23.1) 52.3 (-2.0)
Naïve Bayes 46.7 (-19.2) 53.4 (-4.1 )
66Protein extraction
- (Protein name tagging, role extraction)
- The identification of all the proteins present in
the sentence that are involved in the interaction - These results suggest that Tat - induced
phosphorylation of serine 5 by CDK9 might be
important after transcription has reached the 36
position, at which time CDK7 has been released
from the complex. - Tat might regulate the phosphorylation of the RNA
polymerase II carboxyl - terminal domain in pre -
initiation complexes by activating CDK7
67Protein extraction results
Recall Precision F-measure
All 0.74 0.85 0.79
Papers 0.56 0.83 0.67
Citations 0.75 0.84 0.79
No dictionary used
68Conclusions of protein-protein interaction project
- Encouraging results for the automatic
classification of protein-protein interactions - Use of an existing database for gathering labeled
data - Use of citations
69Conclusion
- Machine Learning methods for NLP tasks
- Three lines of research in this area,
state-of-the art results - Information and relation extraction for
treatments and diseases - Protein-protein interactions
- (Noun compounds)
70Thank you!
- Barbara Rosario
- SIMS, UC Berkeley
rosario_at_sims.berkeley.edu