Natural Language Processing in Bioinformatics: Uncovering Semantic Relations

About This Presentation
Title:

Natural Language Processing in Bioinformatics: Uncovering Semantic Relations

Description:

Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario SIMS UC Berkeley – PowerPoint PPT presentation

Number of Views:6
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations


1
Natural Language Processing in Bioinformatics
Uncovering Semantic Relations
  • Barbara Rosario
  • SIMS
  • UC Berkeley

2
Outline of Talk
  • Goal Extract semantics from text
  • Information and relation extraction
  • Protein-protein interactions

3
Text Mining
  • Text Mining is the discovery by computers of new,
    previously unknown information, via automatic
    extraction of information from text

4
Text Mining
  • Text
  • Stress is associated with migraines
  • Stress can lead to loss of magnesium
  • Calcium channel blockers prevent some migraines
  • Magnesium is a natural calcium channel blocker

1 Extract semantic entities from text
5
Text Mining
  • Text
  • Stress is associated with migraines
  • Stress can lead to loss of magnesium
  • Calcium channel blockers prevent some migraines
  • Magnesium is a natural calcium channel blocker

1 Extract semantic entities from text
Stress
Migraine
Magnesium
Calcium channel blockers
6
Text Mining (cont.)
  • Text
  • Stress is associated with migraines
  • Stress can lead to loss of magnesium
  • Calcium channel blockers prevent some migraines
  • Magnesium is a natural calcium channel blocker

2 Classify relations between entities
7
Text Mining (cont.)
  • Text
  • Stress is associated with migraines
  • Stress can lead to loss of magnesium
  • Calcium channel blockers prevent some migraines
  • Magnesium is a natural calcium channel blocker

3 Do reasoning find new correlations
Stress
Migraine
Prevent
Magnesium
Calcium channel blockers
Subtype-of (is a)
8
Text Mining (cont.)
  • Text
  • Stress is associated with migraines
  • Stress can lead to loss of magnesium
  • Calcium channel blockers prevent some migraines
  • Magnesium is a natural calcium channel blocker

4 Do reasoning infer causality
Stress
Migraine
No prevention
Prevent
Subtype-of (is a)
Magnesium
Calcium channel blockers
Deficiency of magnesium ? migraine
9
My research
Information Extraction
  • Stress is associated with migraines
  • Stress can lead to loss of magnesium
  • Calcium channel blockers prevent some migraines
  • Magnesium is a natural calcium channel blocker

10
My research
Relation extraction
11
Information and relation extraction
  • Problems
  • Given biomedical text
  • Find all the treatments and all the diseases
  • Find the relations that hold between them

12
Hepatitis Examples
  • Cure
  • These results suggest that con A-induced
    hepatitis was ameliorated by pretreatment with
    TJ-135.
  • Prevent
  • A two-dose combined hepatitis A and B vaccine
    would facilitate immunization programs
  • Vague
  • Effect of interferon on hepatitis B

13
Two tasks
  • Relationship extraction
  • Identify the several semantic relations that can
    occur between the entities disease and treatment
    in bioscience text
  • Information extraction (IE)
  • Related problem identify such entities

14
Outline of IE
  • Data and semantic relations
  • Quick intro to graphical models
  • Models and results
  • Features
  • Conclusions

15
Data and Relations
  • MEDLINE, abstracts and titles
  • 3662 sentences labeled
  • Relevant 1724
  • Irrelevant 1771
  • e.g., Patients were followed up for 6 months
  • 2 types of Entities
  • treatment and disease
  • 7 Relationships between these entities

The labeled data are available at
http//biotext.berkeley.edu
16
Semantic Relationships
  • 810 Cure
  • Intravenous immune globulin for recurrent
    spontaneous abortion
  • 616 Only Disease
  • Social ties and susceptibility to the common cold
  • 166 Only Treatment
  • Flucticasone propionate is safe in recommended
    doses
  • 63 Prevent
  • Statins for prevention of stroke

17
Semantic Relationships
  • 36 Vague
  • Phenylbutazone and leukemia
  • 29 Side Effect
  • Malignant mesodermal mixed tumor of the uterus
    following irradiation
  • 4 Does NOT cure
  • Evidence for double resistance to permethrin and
    malathion in head lice

18
Outline of IE
  • Data and semantic relations
  • Quick intro to graphical models
  • Models and results
  • Features
  • Conclusions

19
Graphical Models
  • Unifying framework for developing Machine
    Learning algorithms
  • Graph theory plus probability theory
  • Widely used
  • Error correcting codes
  • Systems diagnosis
  • Computer vision
  • Filtering (Kalman filters)
  • Bioinformatics

20
(Quick intro to) Graphical Models
  • Nodes are random variables
  • Edges are annotated with conditional
    probabilities
  • Absence of an edge between nodes implies
    conditional independence
  • Probabilistic database

A
21
Graphical Models
  • Define a joint probability distribution
  • P(X1, ..XN) ?i P(Xi Par(Xi) )
  • P(A,B,C,D)
  • P(A)P(D)P(BA)P(CA,D)
  • Learning
  • Given data, estimate P(A), P(BA), P(D), P(C A,
    D)

A
22
Graphical Models
  • Define a joint probability distribution
  • P(X1, ..XN) ?i P(Xi Par(Xi) )
  • P(A,B,C,D)
  • P(A)P(D)P(BA)P(C,A,D)
  • Learning
  • Given data, estimate P(A), P(BA), P(D), P(C A,
    D)

A
  • Inference compute conditional probabilities,
    e.g., P(AB, D)
  • Inference Probabilistic queries. General
    inference algorithms (Junction Tree)

23
Naïve Bayes models
  • Simple graphical model
  • Xi depend on Y
  • Naïve Bayes assumption all Xi are independent
    given Y
  • Currently used for text classification and spam
    detection

Y
x1
x2
x3
24
Dynamic Graphical Models
  • Graphical model composed of repeated segments
  • HMMs (Hidden Markov Models)
  • POS tagging, speech recognition, IE

tN
wN
25
HMMs
  • Joint probability distribution
  • P(t1,.., tN, w1,.., wN) P(t1) ?
    P(titi-1)P(witi)
  • Estimate P(t1), P(titi-1), P(witi) from labeled
    data

26
HMMs
  • Joint probability distribution
  • P(t1,.., tN, w1,.., wN) P(t1) ?
    P(titi-1)P(witi)
  • Estimate P(t1), P(titi-1), P(witi) from labeled
    data
  • Inference P(ti w1 , w2 , wN)

tN
wN
27
Graphical Models for IE
  • Different dependencies between the features and
    the relation nodes

Dynamic
Static
28
Graphical Model
  • Relation node
  • Semantic relation (cure, prevent, none..)
    expressed in the sentence
  • Relation generate the state sequence and the
    observations

Relation
29
Graphical Model
  • Markov sequence of states (roles)
  • Role nodes
  • Rolet treatment, disease, none

Rolet-1
Rolet
Rolet1
30
Graphical Model
  • Roles generate multiple observations
  • Feature nodes (observed)
  • word, POS, MeSH

Features
31
Graphical Model
  • Inference Find Relation and Roles given the
    features observed

?
?
?
?
32
Features
  • Word
  • Part of speech
  • Phrase constituent
  • Orthographic features
  • is number, all letters are capitalized,
    first letter is capitalized
  • Semantic features (MeSH)

33
MeSH
  • MeSH Tree Structures
  • 1. Anatomy A
  • 2. Organisms B
  • 3. Diseases C
  • 4. Chemicals and Drugs D
  • 5. Analytical, Diagnostic and Therapeutic
    Techniques and Equipment E
  • 6. Psychiatry and Psychology F
  • 7. Biological Sciences G
  • 8. Physical Sciences H
  • 9. Anthropology, Education, Sociology and
    Social Phenomena I
  • 10. Technology and Food and Beverages J
  • 11. Humanities K
  • 12. Information Science L
  • 13. Persons M
  • 14. Health Care N
  • 15. Geographic Locations Z

34
MeSH (cont.)
  • 1. Anatomy A
  • Body Regions A01
  • Musculoskeletal System A02
    Digestive System A03
  • Respiratory System A04
  • Urogenital System A05
  • Endocrine System A06
  • Cardiovascular System A07
  • Nervous System A08
  • Sense Organs A09
  • Tissues A10
  • Cells A11
  • Fluids and Secretions A12
  • Animal Structures A13
  • Stomatognathic System A14
  • (..)
  • Body Regions A01
  • Abdomen A01.047
  • Groin A01.047.365
  • Inguinal Canal A01.047.412
  • Peritoneum A01.047.596
  • Umbilicus A01.047.849
  • Axilla A01.133
  • Back A01.176
  • Breast A01.236
  • Buttocks A01.258
  • Extremities A01.378
  • Head A01.456
  • Neck A01.598
  • (.)

35
Use of lexical Hierarchies in NLP
  • Big problem in NLP few words occur a lot, most
    of them occur very rarely (Zipfs law)
  • Difficult to do statistics
  • One solution use lexical hierarchies
  • Another example WordNet
  • Statistics on classes of words instead of words

36
Mapping Words to MeSH Concepts
  • headache pain
  • C23.888.592.612.441 G11.561.796.444
  • C23.888 G11.561
  • Neurologic ManifestationsNervous System
    Physiology
  • C23 G11
  • Pathological Conditions, Signs and
    SymptomsMusculoskeletal, Neural, and Ocular
    Physiology
  • headache recurrence
  • C23.888.592.612.441 C23.550.291.937
  • breast cancer cells
  • A01.236 C04 A11

37
Graphical Model
  • Joint probability distribution over relation,
    roles and features nodes
  • Parameters estimated with maximum likelihood and
    absolute discounting smoothing

38
Graphical Model
  • Inference Find Relation and Roles given the
    features observed

?
?
?
?
39
Relation extraction
  • Results in terms of classification accuracy (with
    and without irrelevant sentences)
  • 2 cases
  • Roles given
  • Roles hidden (only features)

40
Relation classification Results
Accuracy Accuracy
Sentences Input Base. GM D2
Only rel. only feat. 46.7 72.6
Only rel. roles given 46.7 76.6
Rel. irrel. only feat. 50.6 74.9
Rel. irrel. roles given 50.6 82.0
  • Good results for a difficult task
  • One of the few systems to tackle several
    DIFFERENT relations between the same types of
    entities thus differs from the problem statement
    of other work on relations

41
Role Extraction Results
  • Junction tree algorithm
  • F-measure (2PrecRecall)/(Prec Recall)
  • (Related work extracting diseases and genes
    reports F-measure of 0.50)

Sentences F-measure
Only rel. 0.73
Rel. irrel. 0.71
42
Features impact Role extraction
  • Most important features
  • 1)Word 2)MeSH
  • Rel. irrel. Only
    rel.
  • All features 0.71 0.73
  • No word 0.61 0.66
  • -14.1 -9.6
  • No MeSH 0.65 0.69
  • -8.4 -5.5

43
Features impact Relation classification
  • Most important features Roles

  • Accuracy
  • All feat. roles 82.0
  • All feat. roles 74.9
  • -8.7
  • All feat. roles Word 79.8
  • -2.8
  • All feat. roles MeSH 84.6
  • 3.1

(rel. irrel.)
44
Features impact Relation classification
  • Most realistic case Roles not known
  • Most important features 1) Word 2) Mesh
  • Accuracy
  • All feat. roles 74.9
  • All feat. - roles Word 66.1
  • -11.8
  • All feat. - roles MeSH 72.5
  • -3.2

(rel. irrel.)
45
Conclusions
  • Classification of subtle semantic relations in
    bioscience text
  • Graphical models for the simultaneous extraction
    of entities and relationships
  • Importance of MeSH, lexical hierarchy

46
Outline of Talk
  • Goal Extract semantics from text
  • Information and relation extraction
  • Protein-protein interactions using an existing
    database to gather labeled data

47
Protein-Protein interactions
  • One of the most important challenges in modern
    genomics, with many applications throughout
    biology
  • There are several protein-protein interaction
    databases (BIND, MINT,..), all manually curated

48
Protein-Protein interactions
  • Supervised systems require manually labeled data,
    while purely unsupervised are still to be proven
    effective for these tasks.
  • Some other approaches semi-supervised, active
    learning, co-training.
  • We propose the use of resources developed in the
    biomedical domain to address the problem of
    gathering labeled data for the task of
    classifying interactions between proteins

49
HIV-1, Protein Interaction Database
  • Documents interactions between HIV-1 proteins and
  • host cell proteins
  • other HIV-1 proteins
  • disease associated with HIV/AIDS
  • 2224 pairs of interacting proteins, 65 types

http//www.ncbi.nlm.nih.gov/RefSeq/HIVInteractions
50
HIV-1, Protein Interaction Database
Protein 1 Protein 2 Paper ID Interaction Type
Tat, p14 AKT3 11156964, 11994280.. activates
AIP1 Gag, Pr55 14519844, binds
Tat, p14 CDK2 9223324 induces
Tat, p14 CDK2 7716549 enhances
Tat, p14 CDK2 9525916 downregulates
.
51
Most common interactions
52
Protein-Protein interactions
  • Idea use this to label data

Protein 1 Protein 2 Interaction Paper ID
Tat, p14 AKT3 activates 11156964
53
Protein-Protein interactions
  • Idea use this to label data

Protein 1 Protein 2 Interaction Paper ID
Tat, p14 AKT3 activates 11156964
Extract from the paper all the sentences with
Protein 1 and Protein 2
activates
activates

Label them with the interaction given in the
database
54
Protein-Protein interactions
  • Use citations
  • Find all the papers
  • that cite the papers
  • in the database

Protein 1 Protein 2 Interaction Paper ID
Tat, p14 AKT3 activates 11156964
ID 9918876
ID 9971769
55
Protein-Protein interactions
  • From the papers, extract
  • the citation sentences
  • from these extract the
  • sentences with Protein 1
  • and Protein 2
  • Label them

Protein 1 Protein 2 Interaction Paper ID
Tat, p14 AKT3 activates 11156964
56
Examples of sentences
  • Papers
  • The interpretation of these results was slightly
    complicated by the fact that AIP-1/ALIX depletion
    by using siRNA likely had deleterious effects on
    cell viability , because a Western blot analysis
    showed slightly reduced Gag expression at later
    time points (fig. 5C ).
  • Citations
  • They also demonstrate that the GAG protein from
    membrane - containing viruses , such as HIV ,
    binds to Alix / AIP1 , thereby recruiting the
    ESCRT machinery to allow budding of the virus
    from the cell surface (TARGET_CITATION CITATION
    ) .

57
10 Interaction types
58
Protein-Protein interactions
  • Tasks
  • Given sentences from Paper ID, and/or citation
    sentences to ID
  • Predict the interaction type given in the HIV
    database for Paper ID
  • Extract the proteins involved
  • 10-way classification problem

59
Protein-Protein interactions
  • Models
  • Dynamic graphical model
  • Naïve Bayes

60
Graphical Models
61
Evaluation
  • Evaluation at document level
  • All (sentences from papers citations)
  • Papers (only sentences from papers)
  • Citations (only citation sentences)
  • Trigger word approach
  • List of keywords (ex for inhibits inhibitor,
    inhibition, inhibitetc.
  • If keyword presents assign corresponding
    interaction

62
Results
  • Accuracies on interaction classification

Model All Papers Citations
Markov Model 60.5 57.8 53.4
Naïve Bayes 58.1 57.8 55.7
Baselines
Most freq. inter. 21.8 11.1 26.1
TriggerW 20.1 24.4 20.4
TriggerW BO 25.8 40.0 26.1
(Roles hidden)
63
Results confusion matrix
For All. Overall accuracy 60.5
64
Hiding the protein names
  • Replaced protein names with tokens PROT_NAME
  • Selective CXCR4 antagonism by Tat
  • Selective PROT_NAME antagonism by PROT_NAME

65
Results with no protein names
Model Papers Citations
Markov Model 44.4 (-23.1) 52.3 (-2.0)
Naïve Bayes 46.7 (-19.2) 53.4 (-4.1 )
66
Protein extraction
  • (Protein name tagging, role extraction)
  • The identification of all the proteins present in
    the sentence that are involved in the interaction
  • These results suggest that Tat - induced
    phosphorylation of serine 5 by CDK9 might be
    important after transcription has reached the 36
    position, at which time CDK7 has been released
    from the complex.
  • Tat might regulate the phosphorylation of the RNA
    polymerase II carboxyl - terminal domain in pre -
    initiation complexes by activating CDK7

67
Protein extraction results
Recall Precision F-measure
All 0.74 0.85 0.79
Papers 0.56 0.83 0.67
Citations 0.75 0.84 0.79
No dictionary used
68
Conclusions of protein-protein interaction project
  • Encouraging results for the automatic
    classification of protein-protein interactions
  • Use of an existing database for gathering labeled
    data
  • Use of citations

69
Conclusion
  • Machine Learning methods for NLP tasks
  • Three lines of research in this area,
    state-of-the art results
  • Information and relation extraction for
    treatments and diseases
  • Protein-protein interactions
  • (Noun compounds)

70
Thank you!
  • Barbara Rosario
  • SIMS, UC Berkeley

rosario_at_sims.berkeley.edu
Write a Comment
User Comments (0)