Classifying Semantic Relations in Bioscience Texts

1 / 55
About This Presentation
Title:

Classifying Semantic Relations in Bioscience Texts

Description:

Adapted from by Lucian Vlad Lita. 2. Natural Language Processing ... Adapted from by Lucian Vlad Lita. 11. Semantic Roles. Define roles to be extracted ... – PowerPoint PPT presentation

Number of Views:126
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: Classifying Semantic Relations in Bioscience Texts


1
Classifying Semantic Relations in Bioscience Texts
  • Barbara Rosario
  • Marti Hearst
  • SIMS, UC Berkeley
  • http//biotext.berkeley.edu
  • Supported by NSF DBI-0317510 and a gift from
    Genentech

2
Natural Language Processing
  • Goal Deep understand of broad language
  • Itd be great if machines could
  • Translate for us
  • Write up our research
  • Find out information for us
  • Summarize
  • But they cant
  • Language is ambiguous, flexible, complex, subtle

3
NLP in practice
  • Syntactic analysis
  • Part-of-Speech Tagging
  • Parsing
  • Shallow parsing
  • Applications
  • Text Classification
  • (sort of) Question Answering
  • Spelling Correction
  • (sort of) Machine Translation
  • Information retrieval
  • Information Extraction

4
Information Extraction
  • Identification and classification of small units
    within documents

5
Extracting Job Openings from the Web
6
What is Information Extraction
As a task
Filling slots in a database from sub-segments of
text.
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
7
What is Information Extraction
As a task
Filling slots in a database from sub-segments of
text.
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
IE
NAME TITLE ORGANIZATION Bill Gates
CEO Microsoft Bill Veghte VP
Microsoft Richard Stallman founder Free
Soft..
8
What is Information Extraction
As a familyof techniques
Information Extraction segmentation
classification association
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation
aka named entity extraction
9
What is Information Extraction
A familyof techniques
Information Extraction segmentation
classification association
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation
10
What is Information Extraction
A familyof techniques
Information Extraction segmentation
classification association
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation
11
Semantic Roles
  • Define roles to be extracted
  • Application dependent
  • JobTitle, Employer, JobCategory, JobLocation
  • But we would like them to be more general
  • Linguistic theories, granularity of roles
  • Proto-agent, proto-patient
  • Fillmores case theory has 9 roles (agent
    patient, location, experimenter, etc)
  • Extreme view each verb has its own set of roles
  • Buyer, bought_thing, seller, sold_thing
  • Middle view roles are particular to a semantic
    Frame (like transaction)

12
Roles in the Biomedical domain
  • Treatment and Disease
  • A two-dose combined hepatitis A and B vaccine
    would facilitate immunization programs
  • Proteins
  • A caveolin - 1 - dependent coupling of PrPc to
    the tyrosine kinase Fyn was observed

13
Relations
  • Person-affiliation Affiliation(Gates, Microsoft)
    CEO
  • Location Location(Microsoft) Redmond
  • Protein1 inhibits (or activates, releases)
    protein2

14
Problem Which relations hold between 2 entities?
Cure?
Prevent?
Side Effect?
15
Hepatitis Examples
  • Cure
  • These results suggest that con A-induced
    hepatitis was ameliorated by pretreatment with
    TJ-135.
  • Prevent
  • A two-dose combined hepatitis A and B vaccine
    would facilitate immunization programs
  • Vague
  • Effect of interferon on hepatitis B

16
Two tasks
  • Relationship Extraction
  • Identify the several semantic relations that can
    occur between the entities disease and treatment
    in bioscience text
  • Entity extraction
  • Related problem identify such entities
  • Much of the important, late-breaking bioscience
    information is found only in textual form.
  • We need both task to extract useful information
    from text and to make inference

17
The Approach
  • Data MEDLINE abstracts and titles
  • Collection of 4,600 biomedical journals
  • Graphical models and Neural Network
  • Lexical, syntactic and semantic features

18
Data and Relations
  • MEDLINE, abstracts and titles
  • 3662 sentences labeled
  • Relevant 1724
  • Irrelevant 1771
  • e.g., Patients were followed up for 6 months
  • 2 types of Entities, many instances
  • treatment and disease
  • 7 Relationships between these entities

The labeled data is available at
http//biotext.berkeley.edu
19
Semantic Relationships
  • 810 Cure
  • Intravenous immune globulin for recurrent
    spontaneous abortion
  • 616 Only Disease
  • Social ties and susceptibility to the common cold
  • 166 Only Treatment
  • Flucticasone propionate is safe in recommended
    doses
  • 63 Prevent
  • Statins for prevention of stroke

20
Semantic Relationships
  • 36 Vague
  • Phenylbutazone and leukemia
  • 29 Side Effect
  • Malignant mesodermal mixed tumor of the uterus
    following irradiation
  • 4 Does NOT cure
  • Evidence for double resistance to permethrin and
    malathion in head lice

21
Features
  • Word
  • Part of speech
  • Phrase constituent
  • Orthographic features
  • is number, all letters are capitalized,
    first letter is capitalized
  • MeSH (semantic features)
  • Replace words, or sequences of words, with
    generalizations via MeSH categories
  • Peritoneum - Abdomen

22
Features (cont.) MeSH
  • MeSH Tree Structures
  • 1. Anatomy A
  • 2. Organisms B
  • 3. Diseases C
  • 4. Chemicals and Drugs D
  • 5. Analytical, Diagnostic and Therapeutic
    Techniques and Equipment E
  • 6. Psychiatry and Psychology F
  • 7. Biological Sciences G
  • 8. Physical Sciences H
  • 9. Anthropology, Education, Sociology and
    Social Phenomena I
  • 10. Technology and Food and Beverages J
  • 11. Humanities K
  • 12. Information Science L
  • 13. Persons M
  • 14. Health Care N
  • 15. Geographic Locations Z

23
Features (cont.) MeSH
  • 1. Anatomy A
  • Body Regions A01
  • Musculoskeletal System A02
    Digestive System A03
  • Respiratory System A04
  • Urogenital System A05
  • Endocrine System A06
  • Cardiovascular System A07
  • Nervous System A08
  • Sense Organs A09
  • Tissues A10
  • Cells A11
  • Fluids and Secretions A12
  • Animal Structures A13
  • Stomatognathic System A14
  • (..)
  • Body Regions A01
  • Abdomen A01.047
  • Groin A01.047.365
  • Inguinal Canal A01.047.412
  • Peritoneum A01.047.596
  • Umbilicus A01.047.849
  • Axilla A01.133
  • Back A01.176
  • Breast A01.236
  • Buttocks A01.258
  • Extremities A01.378
  • Head A01.456
  • Neck A01.598
  • (.)

24
Models
  • Graphical Models (static and dynamic)
  • Neural networks

25
Graphical Models
  • Graph theory plus probability theory
  • Nodes are variables
  • Edges are conditional probabilities
  • Absence of an edge between nodes implies
    conditional independence between the variables of
    the nodes

A
P(A) P(BA) P(CA)
26
Graphical Models for Role and Relation Extraction
Static
Dynamic
27
Graphical Models
  • Relation node
  • Semantic relation (cure, prevent, none..)
    expressed in the sentence

28
Graphical Models
  • Role nodes
  • 3 choices treatment, disease, or none

29
Graphical Models
  • Feature nodes (observed)
  • word, POS, MeSH

30
Graphical Models
  • Joint probability distribution over relation,
    roles and features nodes
  • Parameters estimated with maximum likelihood and
    absolute discounting smoothing
  • Task Find P(Role observable features)
  • P(Relation observable
    features)

31
Neural Networks
  • Feed-forward network (MATLAB)
  • Same features

32
Relation extraction
  • Results in terms of classification accuracy (with
    and without irrelevant sentences)
  • 2 cases
  • Roles hidden
  • Roles given

33
Relation classification Results
34
Relation classification Results
35
Role extraction
  • Results in terms of F-measure
  • NN Couldnt run it (features vectors too large)
  • Graphical models can do role extraction and
    relationship classification simultaneously

36
Role Extraction Results
  • F-measures

37
Features impact Role Extraction
  • Most important features 1)Word, 2)MeSH
  • Models Dynamic
  • All features 0.71
  • No word 0.61
  • -14.1
  • No MeSH 0.65
  • -8.4

(rel. irrel.)
38
Features impact Relation classification
  • Most important features Roles
  • Accuracy GM
    NN
  • All feat. roles 82.0
    96.9
  • All feat. roles 74.9
    79.6
  • -8.7 -17.8
  • All feat. roles Word 79.8 96.4
  • -2.8 -0.5
  • All feat. roles MeSH 84.6 97.3
  • 3.1
    0.4

(rel. irrel.)
39
Features impact Relation classification
  • Most realistic case Roles not known
  • Most important features 1) Mesh for NN and word
    for GM
  • Accuracy GM NN
  • All feat. roles 74.9
    79.6
  • All feat. - roles Word 66.1 76.2
  • -11.8 -4.3
  • All feat. - roles MeSH 72.5 74.1
  • -3.2 -6.9

(rel. irrel.)
40
Conclusions
  • Classification of subtle semantic relations in
    bioscience text
  • Discriminative model (neural network) achieves
    high classification accuracy
  • Graphical models for the simultaneous extraction
    of entities and relationships
  • Importance of lexical hierarchy
  • Future work
  • A new collection of disease/treatment data
  • Different entities/relations
  • Unsupervised learning to discover relation types

41
Thank you!
  • Barbara Rosario
  • Marti Hearst
  • SIMS, UC Berkeley
  • http//biotext.berkeley.edu

42
Additional slides
43
Several DIFFERENT Relations between the Same
Types of Entities
  • Thus differs from the problem statement of other
    work on relations
  • Many find one relation which holds between two
    entities (many based on ACE)
  • Agichtein and Gravano (2000), lexical patterns
    for location of
  • Zelenko et al. (2002) SVM for person affiliation
    and organization-location
  • Hasegawa et al. (ACL 2004) Person-Organization -
    President relation
  • Craven (1999, 2001) HMM for subcellular-location
    and disorder-association
  • Doesnt identify the actual relation

44
Related work Bioscience
  • Many hand-built rules
  • Feldman et al. (2002),
  • Friedman et al. (2001)
  • Pustejovsky et al. (2002)
  • Saric et al. this conference

45
MUC the genesis of IE
  • DARPA funded significant efforts in IE in the
    early to mid 1990s.
  • Message Understanding Conference (MUC) was an
    annual event/competition where results were
    presented.
  • Focused on extracting information from news
    articles
  • Terrorist events
  • Industrial joint ventures
  • Company management changes
  • Information extraction of particular interest to
    the intelligence community (CIA, NSA). (Note
    early 90s)

46
Message Understanding Conference (MUC)
  • Named entity
  • Person, Organization, Location
  • Co-reference
  • Clinton ? President Bill Clinton
  • Template element
  • Perpetrator, Target
  • Template relation
  • Incident
  • Multilingual

47
MUC Typical Text
  • Bridgestone Sports Co. said Friday it has set up
    a joint venture in Taiwan with a local concern
    and a Japanese trading house to produce golf
    clubs to be shipped to Japan. The joint venture,
    Bridgestone Sports Taiwan Co., capitalized at 20
    million new Taiwan dollars, will start production
    of 20,000 iron and metal wood clubs a month

48
MUC Typical Text
  • Bridgestone Sports Co. said Friday it has set up
    a joint venture in Taiwan with a local concern
    and a Japanese trading house to produce golf
    clubs to be shipped to Japan. The joint venture,
    Bridgestone Sports Taiwan Co., capitalized at 20
    million new Taiwan dollars, will start production
    of 20,000 iron and metal wood clubs a month

49
MUC Templates
  • Relationship
  • tie-up
  • Entities
  • Bridgestone Sports Co, a local concern, a
    Japanese trading house
  • Joint venture company
  • Bridgestone Sports Taiwan Co
  • Activity
  • ACTIVITY 1
  • Amount
  • NT2,000,000

50
MUC Templates
  • ATIVITY 1
  • Activity
  • Production
  • Company
  • Bridgestone Sports Taiwan Co
  • Product
  • Iron and metal wood clubs
  • Start Date
  • January 1990

51
Graphical Models
  • Different dependencies between the features and
    the relation nodes

52
Relation classification Confusion Matrix
  • Computed for the model D2, rel irrel., only
    features

53
Smoothing absolute discounting
  • Lower the probability of seen events by
    subtracting a constant from their count (ML
    estimate )
  • The remaining probability is evenly divided by
    the unseen events

54
F-measures for role extraction in function of
smoothing factors
55
Relation accuracies in function of smoothing
factors
Write a Comment
User Comments (0)