Natural Language Processing in Bioinformatics: Uncovering Semantic Relations

About This Presentation

Title:

Natural Language Processing in Bioinformatics: Uncovering Semantic Relations

Description:

Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario SIMS UC Berkeley – PowerPoint PPT presentation

Number of Views:6

Avg rating:3.0/5.0

Slides: 71

Provided by: rosario

Learn more at: https://biotext.berkeley.edu

more less

Transcript and Presenter's Notes

Title: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations

1
Natural Language Processing in Bioinformatics
Uncovering Semantic Relations

Barbara Rosario
SIMS
UC Berkeley

2
Outline of Talk

Goal Extract semantics from text
Information and relation extraction
Protein-protein interactions

3
Text Mining

Text Mining is the discovery by computers of new,
previously unknown information, via automatic
extraction of information from text

4
Text Mining

Text
Stress is associated with migraines
Stress can lead to loss of magnesium
Calcium channel blockers prevent some migraines
Magnesium is a natural calcium channel blocker

1 Extract semantic entities from text
5
Text Mining

Text
Stress is associated with migraines
Stress can lead to loss of magnesium
Calcium channel blockers prevent some migraines
Magnesium is a natural calcium channel blocker

1 Extract semantic entities from text
Stress
Migraine
Magnesium
Calcium channel blockers
6
Text Mining (cont.)

Text
Stress is associated with migraines
Stress can lead to loss of magnesium
Calcium channel blockers prevent some migraines
Magnesium is a natural calcium channel blocker

2 Classify relations between entities
7
Text Mining (cont.)

Text
Stress is associated with migraines
Stress can lead to loss of magnesium
Calcium channel blockers prevent some migraines
Magnesium is a natural calcium channel blocker

3 Do reasoning find new correlations
Stress
Migraine
Prevent
Magnesium
Calcium channel blockers
Subtype-of (is a)
8
Text Mining (cont.)

Text
Stress is associated with migraines
Stress can lead to loss of magnesium
Calcium channel blockers prevent some migraines
Magnesium is a natural calcium channel blocker

4 Do reasoning infer causality
Stress
Migraine
No prevention
Prevent
Subtype-of (is a)
Magnesium
Calcium channel blockers
Deficiency of magnesium ? migraine
9
My research
Information Extraction

Stress is associated with migraines
Stress can lead to loss of magnesium
Calcium channel blockers prevent some migraines
Magnesium is a natural calcium channel blocker

10
My research
Relation extraction
11
Information and relation extraction

Problems
Given biomedical text
Find all the treatments and all the diseases
Find the relations that hold between them

12
Hepatitis Examples

Cure
These results suggest that con A-induced
hepatitis was ameliorated by pretreatment with
TJ-135.
Prevent
A two-dose combined hepatitis A and B vaccine
would facilitate immunization programs
Vague
Effect of interferon on hepatitis B

13
Two tasks

Relationship extraction
Identify the several semantic relations that can
occur between the entities disease and treatment
in bioscience text
Information extraction (IE)
Related problem identify such entities

14
Outline of IE

Data and semantic relations
Quick intro to graphical models
Models and results
Features
Conclusions

15
Data and Relations

MEDLINE, abstracts and titles
3662 sentences labeled
Relevant 1724
Irrelevant 1771
e.g., Patients were followed up for 6 months
2 types of Entities
treatment and disease
7 Relationships between these entities

The labeled data are available at
http//biotext.berkeley.edu
16
Semantic Relationships

810 Cure
Intravenous immune globulin for recurrent
spontaneous abortion
616 Only Disease
Social ties and susceptibility to the common cold
166 Only Treatment
Flucticasone propionate is safe in recommended
doses
63 Prevent
Statins for prevention of stroke

17
Semantic Relationships

36 Vague
Phenylbutazone and leukemia
29 Side Effect
Malignant mesodermal mixed tumor of the uterus
following irradiation
4 Does NOT cure
Evidence for double resistance to permethrin and
malathion in head lice

18
Outline of IE

Data and semantic relations
Quick intro to graphical models
Models and results
Features
Conclusions

19
Graphical Models

Unifying framework for developing Machine
Learning algorithms
Graph theory plus probability theory
Widely used
Error correcting codes
Systems diagnosis
Computer vision
Filtering (Kalman filters)
Bioinformatics

20
(Quick intro to) Graphical Models

Nodes are random variables
Edges are annotated with conditional
probabilities
Absence of an edge between nodes implies
conditional independence
Probabilistic database

A
21
Graphical Models

Define a joint probability distribution
P(X1, ..XN) ?i P(Xi Par(Xi) )
P(A,B,C,D)
P(A)P(D)P(BA)P(CA,D)
Learning
Given data, estimate P(A), P(BA), P(D), P(C A,
D)

A
22
Graphical Models

Define a joint probability distribution
P(X1, ..XN) ?i P(Xi Par(Xi) )
P(A,B,C,D)
P(A)P(D)P(BA)P(C,A,D)
Learning
Given data, estimate P(A), P(BA), P(D), P(C A,
D)

Inference compute conditional probabilities,
e.g., P(AB, D)
Inference Probabilistic queries. General
inference algorithms (Junction Tree)

23
Naïve Bayes models

Simple graphical model
Xi depend on Y
Naïve Bayes assumption all Xi are independent
given Y
Currently used for text classification and spam
detection

Y
x1
x2
x3
24
Dynamic Graphical Models

Graphical model composed of repeated segments
HMMs (Hidden Markov Models)
POS tagging, speech recognition, IE

tN
wN
25
HMMs

Joint probability distribution
P(t1,.., tN, w1,.., wN) P(t1) ?
P(titi-1)P(witi)
Estimate P(t1), P(titi-1), P(witi) from labeled
data

26
HMMs

Joint probability distribution
P(t1,.., tN, w1,.., wN) P(t1) ?
P(titi-1)P(witi)
Estimate P(t1), P(titi-1), P(witi) from labeled
data
Inference P(ti w1 , w2 , wN)

tN
wN
27
Graphical Models for IE

Different dependencies between the features and
the relation nodes

Dynamic
Static
28
Graphical Model

Relation node
Semantic relation (cure, prevent, none..)
expressed in the sentence
Relation generate the state sequence and the
observations

Relation
29
Graphical Model

Markov sequence of states (roles)
Role nodes
Rolet treatment, disease, none

Rolet-1
Rolet
Rolet1
30
Graphical Model

Roles generate multiple observations
Feature nodes (observed)
word, POS, MeSH

Features
31
Graphical Model

Inference Find Relation and Roles given the
features observed

?
?
?
?
32
Features

Word
Part of speech
Phrase constituent
Orthographic features
is number, all letters are capitalized,
first letter is capitalized
Semantic features (MeSH)

33
MeSH

MeSH Tree Structures
1. Anatomy A
2. Organisms B
3. Diseases C
4. Chemicals and Drugs D
5. Analytical, Diagnostic and Therapeutic
Techniques and Equipment E
6. Psychiatry and Psychology F
7. Biological Sciences G
8. Physical Sciences H
9. Anthropology, Education, Sociology and
Social Phenomena I
10. Technology and Food and Beverages J
11. Humanities K
12. Information Science L
13. Persons M
14. Health Care N
15. Geographic Locations Z

34
MeSH (cont.)

1. Anatomy A
Body Regions A01
Musculoskeletal System A02
Digestive System A03
Respiratory System A04
Urogenital System A05
Endocrine System A06
Cardiovascular System A07
Nervous System A08
Sense Organs A09
Tissues A10
Cells A11
Fluids and Secretions A12
Animal Structures A13
Stomatognathic System A14
(..)

Body Regions A01
Abdomen A01.047
Groin A01.047.365
Inguinal Canal A01.047.412
Peritoneum A01.047.596
Umbilicus A01.047.849
Axilla A01.133
Back A01.176
Breast A01.236
Buttocks A01.258
Extremities A01.378
Head A01.456
Neck A01.598
(.)

35
Use of lexical Hierarchies in NLP

Big problem in NLP few words occur a lot, most
of them occur very rarely (Zipfs law)
Difficult to do statistics
One solution use lexical hierarchies
Another example WordNet
Statistics on classes of words instead of words

36
Mapping Words to MeSH Concepts

headache pain
C23.888.592.612.441 G11.561.796.444
C23.888 G11.561
Neurologic ManifestationsNervous System
Physiology
C23 G11
Pathological Conditions, Signs and
SymptomsMusculoskeletal, Neural, and Ocular
Physiology
headache recurrence
C23.888.592.612.441 C23.550.291.937
breast cancer cells
A01.236 C04 A11

37
Graphical Model

Joint probability distribution over relation,
roles and features nodes
Parameters estimated with maximum likelihood and
absolute discounting smoothing

38
Graphical Model

Inference Find Relation and Roles given the
features observed

?
?
?
?
39
Relation extraction

Results in terms of classification accuracy (with
and without irrelevant sentences)
2 cases
Roles given
Roles hidden (only features)

40
Relation classification Results
Accuracy Accuracy
Sentences Input Base. GM D2
Only rel. only feat. 46.7 72.6
Only rel. roles given 46.7 76.6
Rel. irrel. only feat. 50.6 74.9
Rel. irrel. roles given 50.6 82.0

Good results for a difficult task
One of the few systems to tackle several
DIFFERENT relations between the same types of
entities thus differs from the problem statement
of other work on relations

41
Role Extraction Results

Junction tree algorithm
F-measure (2PrecRecall)/(Prec Recall)
(Related work extracting diseases and genes
reports F-measure of 0.50)

Sentences F-measure
Only rel. 0.73
Rel. irrel. 0.71
42
Features impact Role extraction

Most important features
1)Word 2)MeSH
Rel. irrel. Only
rel.
All features 0.71 0.73

No word 0.61 0.66
-14.1 -9.6
No MeSH 0.65 0.69
-8.4 -5.5

43
Features impact Relation classification

Most important features Roles
Accuracy
All feat. roles 82.0

All feat. roles 74.9
-8.7
All feat. roles Word 79.8
-2.8
All feat. roles MeSH 84.6
3.1

(rel. irrel.)
44
Features impact Relation classification

Most realistic case Roles not known
Most important features 1) Word 2) Mesh
Accuracy
All feat. roles 74.9

All feat. - roles Word 66.1
-11.8
All feat. - roles MeSH 72.5
-3.2

(rel. irrel.)
45
Conclusions

Classification of subtle semantic relations in
bioscience text
Graphical models for the simultaneous extraction
of entities and relationships
Importance of MeSH, lexical hierarchy

46
Outline of Talk

Goal Extract semantics from text
Information and relation extraction
Protein-protein interactions using an existing
database to gather labeled data

47
Protein-Protein interactions

One of the most important challenges in modern
genomics, with many applications throughout
biology
There are several protein-protein interaction
databases (BIND, MINT,..), all manually curated

48
Protein-Protein interactions

Supervised systems require manually labeled data,
while purely unsupervised are still to be proven
effective for these tasks.
Some other approaches semi-supervised, active
learning, co-training.
We propose the use of resources developed in the
biomedical domain to address the problem of
gathering labeled data for the task of
classifying interactions between proteins

49
HIV-1, Protein Interaction Database

Documents interactions between HIV-1 proteins and
host cell proteins
other HIV-1 proteins
disease associated with HIV/AIDS
2224 pairs of interacting proteins, 65 types

http//www.ncbi.nlm.nih.gov/RefSeq/HIVInteractions
50
HIV-1, Protein Interaction Database
Protein 1 Protein 2 Paper ID Interaction Type
Tat, p14 AKT3 11156964, 11994280.. activates
AIP1 Gag, Pr55 14519844, binds
Tat, p14 CDK2 9223324 induces
Tat, p14 CDK2 7716549 enhances
Tat, p14 CDK2 9525916 downregulates
.
51
Most common interactions
52
Protein-Protein interactions

Idea use this to label data

Protein 1 Protein 2 Interaction Paper ID
Tat, p14 AKT3 activates 11156964
53
Protein-Protein interactions

Idea use this to label data

Protein 1 Protein 2 Interaction Paper ID
Tat, p14 AKT3 activates 11156964
Extract from the paper all the sentences with
Protein 1 and Protein 2
activates
activates

Label them with the interaction given in the
database
54
Protein-Protein interactions

Use citations
Find all the papers
that cite the papers
in the database

Protein 1 Protein 2 Interaction Paper ID
Tat, p14 AKT3 activates 11156964
ID 9918876
ID 9971769
55
Protein-Protein interactions

From the papers, extract
the citation sentences
from these extract the
sentences with Protein 1
and Protein 2
Label them

Protein 1 Protein 2 Interaction Paper ID
Tat, p14 AKT3 activates 11156964
56
Examples of sentences

Papers
The interpretation of these results was slightly
complicated by the fact that AIP-1/ALIX depletion
by using siRNA likely had deleterious effects on
cell viability , because a Western blot analysis
showed slightly reduced Gag expression at later
time points (fig. 5C ).
Citations
They also demonstrate that the GAG protein from
membrane - containing viruses , such as HIV ,
binds to Alix / AIP1 , thereby recruiting the
ESCRT machinery to allow budding of the virus
from the cell surface (TARGET_CITATION CITATION
) .

57
10 Interaction types
58
Protein-Protein interactions

Tasks
Given sentences from Paper ID, and/or citation
sentences to ID
Predict the interaction type given in the HIV
database for Paper ID
Extract the proteins involved
10-way classification problem

59
Protein-Protein interactions

Models
Dynamic graphical model
Naïve Bayes

60
Graphical Models
61
Evaluation

Evaluation at document level
All (sentences from papers citations)
Papers (only sentences from papers)
Citations (only citation sentences)
Trigger word approach
List of keywords (ex for inhibits inhibitor,
inhibition, inhibitetc.
If keyword presents assign corresponding
interaction

62
Results

Accuracies on interaction classification

Model All Papers Citations
Markov Model 60.5 57.8 53.4
Naïve Bayes 58.1 57.8 55.7
Baselines
Most freq. inter. 21.8 11.1 26.1
TriggerW 20.1 24.4 20.4
TriggerW BO 25.8 40.0 26.1
(Roles hidden)
63
Results confusion matrix
For All. Overall accuracy 60.5
64
Hiding the protein names

Replaced protein names with tokens PROT_NAME
Selective CXCR4 antagonism by Tat
Selective PROT_NAME antagonism by PROT_NAME

65
Results with no protein names
Model Papers Citations
Markov Model 44.4 (-23.1) 52.3 (-2.0)
Naïve Bayes 46.7 (-19.2) 53.4 (-4.1 )
66
Protein extraction

(Protein name tagging, role extraction)
The identification of all the proteins present in
the sentence that are involved in the interaction
These results suggest that Tat - induced
phosphorylation of serine 5 by CDK9 might be
important after transcription has reached the 36
position, at which time CDK7 has been released
from the complex.
Tat might regulate the phosphorylation of the RNA
polymerase II carboxyl - terminal domain in pre -
initiation complexes by activating CDK7

67
Protein extraction results
Recall Precision F-measure
All 0.74 0.85 0.79
Papers 0.56 0.83 0.67
Citations 0.75 0.84 0.79
No dictionary used
68
Conclusions of protein-protein interaction project

Encouraging results for the automatic
classification of protein-protein interactions
Use of an existing database for gathering labeled
data
Use of citations

69
Conclusion