Title: Text Mining for Bioscience Applications: The State of the Art
1(No Transcript)
2Text Mining for Bioscience ApplicationsThe
State of the Art
- Marti Hearst
- University of California, Berkeley
3Outline
- Search vs. Discovery
- Why is text analysis difficult?
- Some current approaches
- Future directions
4My Background
- Computer Scientist by training
- NOT a biologist
- Professor in an interdisciplinary program
- School of Information Management Systems (SIMS)
- Affiliated with the UCSF Bioinformatics Grad
Group - Research fields are
- Computational Linguistics
- Search (Information Retrieval)
- User Interfaces and Information Visualization
- Have focused for a while on bioscience text
- Have received research support from Genentech
5Search vs. Discovery
Search Finding hay in a haystack
Discovery Creating a new kind of hay
6Search Goals
- More accurate results
- More comprehensive results
- Thesaurus expansion
- Intelligent summaries of results
- Organize results along biologically relevant
lines - Better user interfaces
7Knowledge Discovery from Text
- How to discover new information
- As opposed to looking up whats already known.
- Method
- Create hypotheses
- Use large text collections to gather evidence to
refute or support hypotheses - Do lab tests to verify promising results
8Discovery Goals
- Genomics
- Automatically build gene networks
- Discover gene functions
- Pharmacology
- Help determine which drugs can help cure a
disease - Help determine which genetic traits will lead to
a reaction to a drug - Etiology
- Discover underlying causes of disease
9Why is Automated Text Analysis Difficult?
10Why is automated text analysis difficult?
- Avastin, developed by South San Francisco-based
Genentech (DNA), was approved for advanced
colorectal cancer and for patients who haven't
received other chemotherapy, according to the
Food and Drug Administration. - What is approved doing in this sentence?
- John was approved for advancement -gt gets a
promotion. - Avastin was approved for cancer -gt to fight
cancer. - Avastin was approved for patients -gt to consume
to fight cancer. - What kind of patients approved for?
- Ambiguous. Could be for anyone who hasnt
received chemotherapy, or only those patients
with advanced colorectal cancer who havent
received chemotherapy.
11Why is automated text analysis difficult?
- This could easily be a multibillion-dollar
drug," McCamant says. - Refers to concepts mentioned in earlier
sentences.
12Why is automated text analysis difficult?
- "Avastin opens up this new gateway for cancer
care," says William Li, president of the
Angiogenesis Foundation in Massachusetts. "It's
the first in a fleet of other drugs. - Is Avastin a vehicle? It opens gateways and
travels in a fleet!
13Why is automated text analysis difficult?
- There are many indirect ways to say things
- A two-dose combined hepatitis A and B vaccine
would facilitate immunization programs. - The vaccine helps prevent hep B.
- These results suggest that con A-induced
hepatitis was ameliorated by pretreatment with
TJ-135. - The treatment TJ-135 helps cure hep.
- Effect of interferon on hepatitis B.
- There is an unspecified effect of interferon on
hep B.
14What do we do?
- Solve sub-problems
- Extract certain types of entities
- Gene/protein names
- Abbreviation definitions
- Classify the noun phrases using ontologies
- MeSH, LocusLink, GO, etc.
- Define relationship types try to recognize them.
- Many other subproblems are actively being worked
on - Word sense disambiguation
- Co-reference resolution
15Two Main Approaches
Hand-built Rules
Machine Learning
16Two Main Approaches
- Hand-built rules
- Can be very accurate
- Are also very brittle
- Dont scale
- Machine learning
- Usually requires labeled training data
- Unsupervised methods under development
- Can be made to scale
- Is the way of the future
17Abbreviation Definition Recognition
- A Simple Algorithm for Identifying Abbreviation
Definitions in Biomedical Text, Ariel Schwartz
and Marti Hearst, PSB 2003 Kauai, Jan 2003 - Fast, simple algorithm for recognizing
abbreviation definitions. - Simpler and faster than the rest
- Other approaches are cubic or quadratic in time
- Higher precision and recall
- Idea Work backwards from the end
- Examples
- In eukaryotes, the key to transcriptional
regulation of the Heat Shock Response is the Heat
Shock Transcription Factor (HSF). - Gcn5-related N-acetyltransferase (GNAT)
- In future
- Use redundancy across abstracts to figure out
abbreviation meaning even when definition is not
present.
18Gene name co-occurence
- A literature network of human genes for
high-throughput analysis of gene expression.
Jenssen TK, Laegreid A, Komorowski J, Hovig E.
Nat Genet. 2001 May28(1)21-8. - PubGene Assumption
- If two genes are co-mentioned in a MEDLINE
record, there is an underlying biological
relationship.
Example Genes highly upregulated at time point 6
h (6H) in the fibroblast serum response.
Green upregulation Red downregulation
19Gene name co-occurence
- A literature network of human genes for
high-throughput analysis of gene expression.
Jenssen TK, Laegreid A, Komorowski J, Hovig E.
Nat Genet. 2001 May28(1)21-8. - Evaluation
- 29-40 of the pairs were incorrect
- 45 of OMIM pairs found
- 51 of DIP pairs found (DB of Interacting
Proteins)
20How to find functions of genes?
- Have the genetic sequence
- Dont know what it does
- But
- Know which genes it coexpresses with
- Some of these have known function
- So infer function based on function of
co-expressed genes - This is problem suggested by Michael Walker and
others at Incyte Pharmaceuticals
21Gene Co-expressionRole in the genetic pathway
Kall.
Kall.
g?
h?
PSA
PSA
PAP
PAP
g?
Other possibilities as well
22Make use of the literature
- Look up what is known about the other genes.
- Different articles in different collections
- Look for commonalities
- Similar topics indicated by Subject Descriptors
- Similar words in titles and abstracts
- adenocarcinoma, neoplasm, prostate, prostatic
neoplasms, tumor markers, antibodies ...
23Formulate a Hypothesis
- Hypothesis mystery gene has to do with
regulation of expression of genes leading to
prostate cancer - New tack do some lab tests
- See if mystery gene is similar in molecular
structure to the others - If so, it might do some of the same things they
do
24Etiology Example
- Complementary structures in disjoint science
literatures. Don R. Swanson. In Proceedings of
SIGIR 91 - Goal find cause of disease
- Magnesium-migraine connection
- Given
- medical titles and abstracts
- a problem (incurable rare disease)
- some medical expertise
- Find causal links among titles
- symptoms
- drugs
- results
25Gathering Evidence
migraine
26Gathering Evidence
migraine
magnesium
27Swansons Linking Approach
- Two of his hypotheses have received some
experimental verification. - His technique
- Only partially automated
- Required medical expertise
- Recently others have made progress automating it.
28Automating Swanson-style Discovery
- Text Mining Generating Hypotheses from MEDLINE,
Padmini Srinivasan. To appear in JASIST. - UMLS defines Semantic Types
- Every MeSH term is assigned one or more Semantic
Types - Interferon type II falls within both
- Immunologic Factor and
- Pharmacologic Substance
- Each PubMed article is assigned a set of MeSH
terms - The idea is to characterize a set of articles
according to which semantic types their MeSH
terms fall into.
29Automating Swanson-style Discovery
- Text Mining Generating Hypotheses from MEDLINE,
Padmini Srinivasan. To appear in JASIST. - Approach
- User inputs topic T of interest
- User selects 2 sets from a small number of sets
of UMLS semantic types - System
- Searches PubMed for articles about T
- Selects out the important MeSH terms as
determined by the user-chosen semantic type
categories - Searches PubMed for articles that contain these
MeSH terms - Combines the MeSH terms that result from these
retrieved documents - Call this result C
- If a PubMed search on words from T and c from C
are empty, place c as a candidate in a final
result set R - Report those terms in R that fall into the second
user-selected semantic type set.
30Automating Swanson-style Discovery
- Text Mining Generating Hypotheses from MEDLINE,
Padmini Srinivasan. To appear in JASIST. - Results have successfully reproduced the 7
examples they tried, with very little manual
intervention - Example input topic is Raynauds disease
31Main Ideas for NLP Approach
- Assign Semantics using
- Statistics
- Hierarchical Lexical Ontologies to generalize
- Redundancy in the data
- Build up Layers of Representation
- Syntactic and Semantic
- Use these in a feedback loop
32Automated Relation Assignment
- Recall the problem
- A two-dose combined hepatitis A and B vaccine
would facilitate immunization programs. - The vaccine helps prevent hep B.
- Identified 7 relations that can hold between
Treatments and Diseases - Used Machine Learning to address this
- Graphical models
- Neural nets
- Marked up the text with syntactic and semantic
information - MeSH labels turn out to be very important
33Automated Relation Assignment
- Use Machine Learning to address this
- Graphical models
- Neural nets
- Mark up the text with syntactic and semantic
information - MeSH labels turn out to be very important
34Automated Relation Assignment
35Future Directions
- In text analysis
- Move away from hand-built rules
- More focus on labeling with semantics
- In problems tackled
- There are so many possibilities!
- Help with automated curation
36Thank you!
- Visit our site
- biotext.berkeley.edu