Semantic Relation Detection in Bioscience Text - PowerPoint PPT Presentation

About This Presentation
Title:

Semantic Relation Detection in Bioscience Text

Description:

Provide flexible, intelligent access to information for use in biosciences ... II carboxyl - terminal domain in pre - initiation complexes by activating CDK7 ... – PowerPoint PPT presentation

Number of Views:32
Avg rating:3.0/5.0
Slides: 35
Provided by: rosa6
Category:

less

Transcript and Presenter's Notes

Title: Semantic Relation Detection in Bioscience Text


1
Semantic Relation Detectionin Bioscience Text
  • Marti Hearst
  • SIMS, UC Berkeley
  • http//biotext.berkeley.edu
  • Supported by NSF DBI-0317510 and a gift from
    Genentech

2
BioText Project Goals
  • Provide flexible, intelligent access to
    information for use in biosciences applications.
  • Focus on
  • Textual Information from Journal Articles
  • Tightly integrated with other resources
  • Ontologies
  • Record-based databases

3
Project Team
  • Project Leaders
  • PI Marti Hearst
  • Co-PI Adam Arkin
  • Computational Linguistics
  • Barbara Rosario (graduated)
  • Presley Nakov
  • Database Research
  • Ariel Schwartz
  • Gaurav Bhalotia (graduated)
  • User Interface / IR
  • Rowena Luk
  • Dr. Emilia Stoica
  • Bioscience
  • Dr. TingTing Zhang
  • Janice Hamer

Supported primarily by NSF DBI-0317510 and a
gift from Genentech
4
BioText Architecture
Sophisticated Text Analysis
Annotations in Database
Improved Search Interface
5
The Nature of Bioscience Text
  • Claim
  • Bioscience semantics are simultaneously easier
    and harder than general text.

easier
harder
Fewer subtleties Fewer ambiguities Systematic
meanings
Enormous terminology Complex sentence structure
6
Entity-EntityRelation Recognition
7
Two tasks
  • Relationship Extraction
  • Identify the several semantic relations that can
    occur between two entities (in this case, protein
    names) in bioscience text.
  • Entity extraction
  • Related problem identify the entities

8
The Approach
  • Data MEDLINE abstracts and titles
  • Graphical models
  • Combine in one framework both relation and entity
    extraction
  • Both static and dynamic models
  • Simple discriminative approach
  • Neural network
  • Lexical, syntactic and semantic features

9
Protein-Protein interactions
  • Tasks
  • Given sentences from Paper ID, and/or citation
    sentences to ID
  • Predict the interaction type given in the HIV
    database for Paper ID
  • Extract the proteins involved
  • 10-way classification problem

10
Protein-Protein interactions
  • Models
  • Dynamic graphical model
  • Naïve Bayes

11
Graphical Models
12
Evaluation
  • Evaluation at document level
  • All (sentences from papers citations)
  • Papers (only sentences from papers)
  • Citations (only citation sentences)
  • Trigger word approach
  • List of keywords (ex for inhibits inhibitor,
    inhibition, inhibitetc.
  • If keyword presents assign corresponding
    interaction

13
Results
  • Accuracies on interaction classification

Model All Papers Citations
Markov Model 60.5 57.8 53.4
Naïve Bayes 58.1 57.8 55.7
Baselines
Most freq. inter. 21.8 11.1 26.1
TriggerW 20.1 24.4 20.4
TriggerW BO 25.8 40.0 26.1
(Roles hidden)
14
Results confusion matrix
For All. Overall accuracy 60.5
15
Hiding the protein names
  • Replaced protein names with tokens PROT_NAME
  • Selective CXCR4 antagonism by Tat
  • Selective PROT_NAME antagonism by PROT_NAME

16
Results with no protein names
Model Papers Citations
Markov Model 44.4 (-23.1) 52.3 (-2.0)
Naïve Bayes 46.7 (-19.2) 53.4 (-4.1 )
17
Protein extraction
  • (Protein name tagging, role extraction)
  • The identification of all the proteins present in
    the sentence that are involved in the interaction
  • These results suggest that Tat - induced
    phosphorylation of serine 5 by CDK9 might be
    important after transcription has reached the 36
    position, at which time CDK7 has been released
    from the complex.
  • Tat might regulate the phosphorylation of the RNA
    polymerase II carboxyl - terminal domain in pre -
    initiation complexes by activating CDK7

18
Protein extraction results
Recall Precision F-measure
All 0.74 0.85 0.79
Papers 0.56 0.83 0.67
Citations 0.75 0.84 0.79
No dictionary used
19
Conclusions of protein-protein interaction project
  • Encouraging results for the automatic
    classification of protein-protein interactions
  • Use of an existing database for gathering labeled
    data
  • Use of citations

20
Acquiring Labeled Data using Citances
21
BioScience Researchers
  • Read A LOT!
  • Cite A LOT!
  • Curate A LOT!
  • Are interested in specific relations, e.g.
  • What is the role of this protein in that pathway?
  • Show me articles in which a comparison between
    two values is significant.

22
Acquiring Labeled Data using Citances
23
A discovery is made
A paper is written
24
That paper is cited
and cited
and cited
as the evidence for some fact(s) F.
25
Each of these in turn are cited for some fact(s)
until it is the case that all important facts
in the field can be found in citation sentences
alone!
26
Citances
  • Nearly every statement in a bioscience journal
    article is backed up with a cite.
  • It is quite common for papers to be cited 30-100
    times.
  • The text around the citation tends to state
    biological facts. (Call these citances.)
  • Different citances will state the same facts in
    different ways
  • so can we use these for creating models of
    language expressing semantic relations?

27
Using Citances
  • Potential uses of citation sentences (citances)
  • creation of training and testing data for
    semantic analysis,
  • synonym set creation,
  • database curation,
  • document summarization,
  • and information retrieval generally.
  • Some preliminary results
  • Citances to a document align well with a
    hand-built curation.
  • Citances are good candidates for paraphrase
    creation.

28
Issues for Processing Citances
  • Text span
  • Identification of the appropriate phrase, clause,
    or sentence that constructs a citance.
  • Correct mapping of citations when shown as lists
    or groups (e.g., 22-25).
  • Grouping citances by topic
  • Citances that cite the same document should be
    grouped by the facts they state.
  • Normalizing or paraphrasing citances
  • For IR, summarization, learning synonyms,
    relation extraction, question answering, and
    machine translation.

29
Early resultsParaphrase Creation from Citances
30
Sample Sentences
  • NGF withdrawal from sympathetic neurons induces
    Bim, which then contributes to death.
  • Nerve growth factor withdrawal induces the
    expression of Bim and mediates Bax dependent
    cytochrome c release and apoptosis.
  • The proapoptotic Bcl-2 family member Bim is
    strongly induced in sympathetic neurons in
    response to NGF withdrawal.
  • In neurons, the BH3 only Bcl2 member, Bim, and
    JNK are both implicated in apoptosis caused by
    nerve growth factor deprivation.

31
Their Paraphrases
  • NGF withdrawal induces Bim.
  • Nerve growth factor withdrawal induces the
    expression of Bim.
  • Bim has been shown to be upregulated following
    nerve growth factor withdrawal.
  • Bim implicated in apoptosis caused by nerve
    growth factor deprivation.
  • They all paraphrase
  • Bim is induced after NGF withdrawal.

32
Paraphrase Creation Algorithm
  • 1. Extract the sentences that cite the target.
  • 2. Mark the NEs of interest (genes/proteins, MeSH
    terms)
  • and normalize.
  • 3. Dependency parse (MiniPar).
  • 4. For each parse
  • For each pair of NEs of interest
  • i. Extract the path between them.
  • ii. Create a paraphrase from the path.
  • 5. Rank the candidates for a given pair of NEs.
  • 6. Select only the ones above a threshold.
  • 7. Generalize.

33
Relevant Papers
  • Citances Citation Sentences for Semantic
    Analysis of Bioscience Text, Preslav Nakov, Ariel
    Schwartz, and Marti Hearst, in the SIGIR'04
    workshop on Search and Discovery in
    Bioinformatics.  
  • Classifying Semantic Relations in Bioscience
    Text, Barbara Rosario and Marti Hearst, in ACL
    2004.  
  • The Descent of Hierarchy, and Selection in
    Relational Semantics, Barbara Rosario, Marti
    Hearst, and Charles Fillmore, in ACL 2002.

34
Thank you!
  • Marti Hearst
  • SIMS, UC Berkeley
  • http//biotext.berkeley.edu
Write a Comment
User Comments (0)
About PowerShow.com