Title: Classifying Semantic Relations in Bioscience Texts
1Classifying Semantic Relations in Bioscience Texts
- Barbara Rosario
- Marti Hearst
- SIMS, UC Berkeley
- http//biotext.berkeley.edu
- Supported by NSF DBI-0317510 and a gift from
Genentech
2Natural Language Processing
- Goal Deep understand of broad language
- Itd be great if machines could
- Translate for us
- Write up our research
- Find out information for us
- Summarize
- But they cant
- Language is ambiguous, flexible, complex, subtle
3NLP in practice
- Syntactic analysis
- Part-of-Speech Tagging
- Parsing
- Shallow parsing
- Applications
- Text Classification
- (sort of) Question Answering
- Spelling Correction
- (sort of) Machine Translation
- Information retrieval
- Information Extraction
4Information Extraction
- Identification and classification of small units
within documents
5Extracting Job Openings from the Web
6What is Information Extraction
As a task
Filling slots in a database from sub-segments of
text.
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
7What is Information Extraction
As a task
Filling slots in a database from sub-segments of
text.
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
IE
NAME TITLE ORGANIZATION Bill Gates
CEO Microsoft Bill Veghte VP
Microsoft Richard Stallman founder Free
Soft..
8What is Information Extraction
As a familyof techniques
Information Extraction segmentation
classification association
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation
aka named entity extraction
9What is Information Extraction
A familyof techniques
Information Extraction segmentation
classification association
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation
10What is Information Extraction
A familyof techniques
Information Extraction segmentation
classification association
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation
11Semantic Roles
- Define roles to be extracted
- Application dependent
- JobTitle, Employer, JobCategory, JobLocation
- But we would like them to be more general
- Linguistic theories, granularity of roles
- Proto-agent, proto-patient
- Fillmores case theory has 9 roles (agent
patient, location, experimenter, etc) - Extreme view each verb has its own set of roles
- Buyer, bought_thing, seller, sold_thing
- Middle view roles are particular to a semantic
Frame (like transaction)
12Roles in the Biomedical domain
- Treatment and Disease
- A two-dose combined hepatitis A and B vaccine
would facilitate immunization programs - Proteins
- A caveolin - 1 - dependent coupling of PrPc to
the tyrosine kinase Fyn was observed
13Relations
- Person-affiliation Affiliation(Gates, Microsoft)
CEO - Location Location(Microsoft) Redmond
- Protein1 inhibits (or activates, releases)
protein2
14Problem Which relations hold between 2 entities?
Cure?
Prevent?
Side Effect?
15Hepatitis Examples
- Cure
- These results suggest that con A-induced
hepatitis was ameliorated by pretreatment with
TJ-135. - Prevent
- A two-dose combined hepatitis A and B vaccine
would facilitate immunization programs - Vague
- Effect of interferon on hepatitis B
16Two tasks
- Relationship Extraction
- Identify the several semantic relations that can
occur between the entities disease and treatment
in bioscience text - Entity extraction
- Related problem identify such entities
- Much of the important, late-breaking bioscience
information is found only in textual form. - We need both task to extract useful information
from text and to make inference
17The Approach
- Data MEDLINE abstracts and titles
- Collection of 4,600 biomedical journals
- Graphical models and Neural Network
- Lexical, syntactic and semantic features
18Data and Relations
- MEDLINE, abstracts and titles
- 3662 sentences labeled
- Relevant 1724
- Irrelevant 1771
- e.g., Patients were followed up for 6 months
- 2 types of Entities, many instances
- treatment and disease
- 7 Relationships between these entities
The labeled data is available at
http//biotext.berkeley.edu
19Semantic Relationships
- 810 Cure
- Intravenous immune globulin for recurrent
spontaneous abortion - 616 Only Disease
- Social ties and susceptibility to the common cold
- 166 Only Treatment
- Flucticasone propionate is safe in recommended
doses - 63 Prevent
- Statins for prevention of stroke
20Semantic Relationships
- 36 Vague
- Phenylbutazone and leukemia
- 29 Side Effect
- Malignant mesodermal mixed tumor of the uterus
following irradiation - 4 Does NOT cure
- Evidence for double resistance to permethrin and
malathion in head lice
21Features
- Word
- Part of speech
- Phrase constituent
- Orthographic features
- is number, all letters are capitalized,
first letter is capitalized - MeSH (semantic features)
- Replace words, or sequences of words, with
generalizations via MeSH categories - Peritoneum - Abdomen
22Features (cont.) MeSH
- MeSH Tree Structures
- 1. Anatomy A
- 2. Organisms B
- 3. Diseases C
- 4. Chemicals and Drugs D
- 5. Analytical, Diagnostic and Therapeutic
Techniques and Equipment E - 6. Psychiatry and Psychology F
- 7. Biological Sciences G
- 8. Physical Sciences H
- 9. Anthropology, Education, Sociology and
Social Phenomena I - 10. Technology and Food and Beverages J
- 11. Humanities K
- 12. Information Science L
- 13. Persons M
- 14. Health Care N
- 15. Geographic Locations Z
23Features (cont.) MeSH
- 1. Anatomy A
- Body Regions A01
- Musculoskeletal System A02
Digestive System A03 - Respiratory System A04
- Urogenital System A05
- Endocrine System A06
- Cardiovascular System A07
- Nervous System A08
- Sense Organs A09
- Tissues A10
- Cells A11
- Fluids and Secretions A12
- Animal Structures A13
- Stomatognathic System A14
- (..)
- Body Regions A01
- Abdomen A01.047
- Groin A01.047.365
- Inguinal Canal A01.047.412
- Peritoneum A01.047.596
- Umbilicus A01.047.849
- Axilla A01.133
- Back A01.176
- Breast A01.236
- Buttocks A01.258
- Extremities A01.378
- Head A01.456
- Neck A01.598
- (.)
24Models
- Graphical Models (static and dynamic)
- Neural networks
25Graphical Models
- Graph theory plus probability theory
- Nodes are variables
- Edges are conditional probabilities
- Absence of an edge between nodes implies
conditional independence between the variables of
the nodes
A
P(A) P(BA) P(CA)
26Graphical Models for Role and Relation Extraction
Static
Dynamic
27Graphical Models
- Relation node
- Semantic relation (cure, prevent, none..)
expressed in the sentence
28Graphical Models
- Role nodes
- 3 choices treatment, disease, or none
29Graphical Models
- Feature nodes (observed)
- word, POS, MeSH
30Graphical Models
- Joint probability distribution over relation,
roles and features nodes - Parameters estimated with maximum likelihood and
absolute discounting smoothing - Task Find P(Role observable features)
- P(Relation observable
features)
31Neural Networks
- Feed-forward network (MATLAB)
- Same features
32Relation extraction
- Results in terms of classification accuracy (with
and without irrelevant sentences) - 2 cases
- Roles hidden
- Roles given
33Relation classification Results
34Relation classification Results
35Role extraction
- Results in terms of F-measure
- NN Couldnt run it (features vectors too large)
- Graphical models can do role extraction and
relationship classification simultaneously
36Role Extraction Results
37Features impact Role Extraction
- Most important features 1)Word, 2)MeSH
- Models Dynamic
- All features 0.71
- No word 0.61
- -14.1
- No MeSH 0.65
- -8.4
(rel. irrel.)
38Features impact Relation classification
- Most important features Roles
- Accuracy GM
NN - All feat. roles 82.0
96.9 - All feat. roles 74.9
79.6 - -8.7 -17.8
- All feat. roles Word 79.8 96.4
- -2.8 -0.5
- All feat. roles MeSH 84.6 97.3
- 3.1
0.4
(rel. irrel.)
39Features impact Relation classification
- Most realistic case Roles not known
- Most important features 1) Mesh for NN and word
for GM - Accuracy GM NN
- All feat. roles 74.9
79.6 - All feat. - roles Word 66.1 76.2
- -11.8 -4.3
- All feat. - roles MeSH 72.5 74.1
- -3.2 -6.9
(rel. irrel.)
40Conclusions
- Classification of subtle semantic relations in
bioscience text - Discriminative model (neural network) achieves
high classification accuracy - Graphical models for the simultaneous extraction
of entities and relationships - Importance of lexical hierarchy
- Future work
- A new collection of disease/treatment data
- Different entities/relations
- Unsupervised learning to discover relation types
41Thank you!
- Barbara Rosario
- Marti Hearst
- SIMS, UC Berkeley
- http//biotext.berkeley.edu
42Additional slides
43Several DIFFERENT Relations between the Same
Types of Entities
- Thus differs from the problem statement of other
work on relations - Many find one relation which holds between two
entities (many based on ACE) - Agichtein and Gravano (2000), lexical patterns
for location of - Zelenko et al. (2002) SVM for person affiliation
and organization-location - Hasegawa et al. (ACL 2004) Person-Organization -
President relation - Craven (1999, 2001) HMM for subcellular-location
and disorder-association - Doesnt identify the actual relation
44Related work Bioscience
- Many hand-built rules
- Feldman et al. (2002),
- Friedman et al. (2001)
- Pustejovsky et al. (2002)
- Saric et al. this conference
45MUC the genesis of IE
- DARPA funded significant efforts in IE in the
early to mid 1990s. - Message Understanding Conference (MUC) was an
annual event/competition where results were
presented. - Focused on extracting information from news
articles - Terrorist events
- Industrial joint ventures
- Company management changes
- Information extraction of particular interest to
the intelligence community (CIA, NSA). (Note
early 90s)
46Message Understanding Conference (MUC)
- Named entity
- Person, Organization, Location
- Co-reference
- Clinton ? President Bill Clinton
- Template element
- Perpetrator, Target
- Template relation
- Incident
- Multilingual
47MUC Typical Text
- Bridgestone Sports Co. said Friday it has set up
a joint venture in Taiwan with a local concern
and a Japanese trading house to produce golf
clubs to be shipped to Japan. The joint venture,
Bridgestone Sports Taiwan Co., capitalized at 20
million new Taiwan dollars, will start production
of 20,000 iron and metal wood clubs a month
48MUC Typical Text
- Bridgestone Sports Co. said Friday it has set up
a joint venture in Taiwan with a local concern
and a Japanese trading house to produce golf
clubs to be shipped to Japan. The joint venture,
Bridgestone Sports Taiwan Co., capitalized at 20
million new Taiwan dollars, will start production
of 20,000 iron and metal wood clubs a month
49MUC Templates
- Relationship
- tie-up
- Entities
- Bridgestone Sports Co, a local concern, a
Japanese trading house - Joint venture company
- Bridgestone Sports Taiwan Co
- Activity
- ACTIVITY 1
- Amount
- NT2,000,000
50MUC Templates
- ATIVITY 1
- Activity
- Production
- Company
- Bridgestone Sports Taiwan Co
- Product
- Iron and metal wood clubs
- Start Date
- January 1990
51Graphical Models
- Different dependencies between the features and
the relation nodes
52Relation classification Confusion Matrix
- Computed for the model D2, rel irrel., only
features
53Smoothing absolute discounting
- Lower the probability of seen events by
subtracting a constant from their count (ML
estimate ) - The remaining probability is evenly divided by
the unseen events -
54F-measures for role extraction in function of
smoothing factors
55Relation accuracies in function of smoothing
factors