Title: The BioText Project
1The BioText Project
- Myers Seminar
- Sept 22, 2003
- Marti Hearst
- Associate Professor
- SIMS, UC Berkeley
- Projected sponsored by NSF DBI-0317510,
- ARDA AQUAINT, and a gift from Genentech
2BioText Project Goals
- Provide fast, flexible, intelligent access to
information for use in biosciences applications. - Focus on
- Textual Information
- Tightly integrated with other resources
- Ontologies
- Record-based databases
3People
- Project Leaders
- PI Marti Hearst Co-PI Adam Arkin
- Computational Linguistics
- Barbara Rosario
- Presley Nakov
- Database Research
- Ariel Schwartz
- Gaurav Bhalotia (graduated)
- User Interface / Information Retrieval
- Kevin Li
- Emilia Stoica
- Bioscience
- Dr. TingTing Zhang
4Outline
- Main Goals
- System Architecture
- Apoptosis problem statement
- Recent results in
- Abbreviation definition recognition
- Semantic relation recognition (from text)
- Search User Interfaces
- Hierarchical grouping of journals
5BioText Main Goals
Sophisticated Text Analysis
Annotations in Database
Improved Search Interface
6Recent Result (Schwartz Hearst 03)
- Fast, simple algorithm for recognizing
abbreviation definitions. - Simpler and faster than the rest
- Higher precision and recall
- Idea Work backwards from the end
- Examples
- In eukaryotes, the key to transcriptional
regulation of the Heat Shock Response is the Heat
Shock Transcription Factor (HSF). - Gcn5-related N-acetyltransferase (GNAT)
- Idea use redundancy across abstracts to figure
out abbreviation meaning even when definition is
not present.
7BioText A Two-Sided Approach
Empirical Computational Linguistics Algorithms
Sophisticated Database Design Algorithms
8Apoptosis Network
Survival Factors Signaling
Genotoxic Stress
Lost of Attachment Cell Cycle stress, etc
ER Stress
Initiator Caspases (8, 10)
P53 pathway
BH3 only
Bcl-2 like
NFkB
Bax, Bak
Mitochondria Cytochrome c
Smac
Caspase 12
IAPs
Apaf 1
AIF
Caspase 9
Apoptosis
Slide courtesy TingTing Zhang
9The issues (courtesy TingTing Zhang)
- The network nodes are deduced from reading and
processing of experimental knowledge by experts.
Every month gt1000 apoptosis papers are published.
- The supporting experimental data are gathered in
different organs, tissues, cells using various
techniques. - There are various levels of uncertainty
associated with different techniques used to
answer certain questions. - Depending on the expression patterns for the
players in the network, the observation may or
may not be extended to other contexts. - We need to keep track of ALL the information in
order to understand the system better.
10- Simple cases
- Mouse Bim proteins (isoforms EL, L, S) binds to
human Bcl-2 (bacteriophoage screening using cDNA
expression library from T-Lymphoma cell line
KO52DA20). - Human BimEL protein is 89 identical to mouse
BimEL, Human BimL is 85 identical to mouse BimL
(Hybridization of mouse bim cDNA to human fetal
spleen and peripheral blood cDNA library). - Bim mRNA is detected in B and T lyphoid cells
(Northern blot analysis of mouse KO52DA20, WEHI
703, WEHI 707, WEHI7.1, CH1, WEHI231 WEHI415,
B6.23.16BW2 cell extracts). - BimL protein interact with Bcl-2 OR Bcl-XL, or
Bcl-w proteins (Immuno-precipitation (anti-Bcl-2
OR Bcl-XL OR Bcl-w)) followed by Western blot
(anti-EEtag) using extracts human 293T cells
co-transfected with EE-tagged BimL AND (bcl-2 OR
bcl-XL OR bcl-w) plasmids) - BimL deleted of the BH3 domain does not bind to
Bcl-2 OR Bcl-XL, or Bcl-w proteins (under
experimental conditions mentioned above)
11Computational Language Goals
- Recognizing and annotating entities within
textual documents - Identifying semantic relations among entities
- To (eventually) be used in tandem with
semi-automated reasoning systems.
12Main Ideas for NLP Approach
- Assign Semantics using
- Statistics
- Hierarchical Lexical Ontologies to generalize
- Redundancy in the data
- Build up Layers of Representation
- Syntactic and Semantic
- Use these in a feedback loop
13Computational Linguistics Goals
- Mark up text with semantic relations
14Recent ResultDescent of Hierarchy
- Idea
- Use the top levels of a lexical hierarchy to
identify semantic relations - Hypothesis
- A particular semantic relation holds between all
2-word Noun Compounds that can be categorized by
a MeSH pair.
15Definition
- NC Any sequence of nouns that itself functions
as a noun - asthma hospitalizations
- health care personnel hand wash
- Technical text is rich with NCs
- Open-labeled long-term study of the
subcutaneous sumatriptan efficacy and
tolerability in acute migraine treatment.
16NCs Three tasks
- Identification
- Syntactic analysis (attachments)
- Baseline headache frequency
- Tension headache patient
- Our Goal Semantic analysis
- Headache treatment ? treatment for
headache - Corticosteroid treatment ? treatment that
uses
corticosteroid
17Main Idea
- Top-level MESH categories can be used to indicate
which relations hold between noun compounds - headache recurrence
- C23.888.592.612.441 C23.550.291.937
- headache pain
- C23.888.592.612.441 G11.561.796.444
- breast cancer cells
- A01.236 C04 A11
18Linguistic Motivation
- Can cast NC into head-modifier relation, and
assume head noun has an argument and qualia
structure. - (used-in) kitchen knife
- (made-of) steel knife
- (instrument-for) carving knife
- (used-on) putty knife
- (used-by) butchers knife
19Distribution of Frequent Category Pairs
20How Far to Descend?
- Anatomy 250 CPs
- 187 (75) remain first level
- 56 (22) descend one level
- 7 (3) descend two levels
- Natural Science (H01) 21 CPs
- 1 (4) remain first level
- 8 (39) descend one level
- 12 (57) descend two levels
- Neoplasm (C04) 3 CPs
- 3 (100) descend one level
21Evaluation
- Apply the rules to a test set
- Accuracy
- Anatomy 91 accurate
- Natural Science 79
- Diseases 100
- Total
- 89.6 via intra-category averaging
- 90.8 via extra-category averaging
22Summary of NC Work
- Lexical hierarchy useful for inferring semantic
relations - Works because semantics are constrained and word
sense ambiguity is not too much of a problem - Can it be extended to other types of relations?
- Preliminary results on one set of relations are
promising.
23Database Research Issues
- Efficiently and effectively combining
- Relational databases Text
- Hierarchical Ontologies
- Layers of Annotations
24Interface Issues
- Create intuitive, appealing interfaces that are
better than whats currently out there. - Start with existing assigned metadata
- As text analysis improves, incorporate the
results into the interface.
25(No Transcript)
26(No Transcript)
27(No Transcript)
28(No Transcript)
29Some Recent Work
- Organizing BioScience Journal Names
- Currently there are gt 3500
30(No Transcript)
31(No Transcript)
32Some Recent Work
- Organizing BioScience Journal Names
- Currently there are gt 3500
- Idea
- Group them into faceted hierarchies
semi-automatically - Using clustering of title terms, synonym
similarity via WordNet, and other techniques
33(No Transcript)
34(No Transcript)
35Summary
- BioText aims to improve access to bioscience
information via - Sophisticated language analysis
- Integration of results into
- Annotated database
- Flexible user interface
- Eventual goal
- Semi-automated mining and discovery
36Theres lots to do!
For more information