Text Mining for Bioscience Applications: The State of the Art - PowerPoint PPT Presentation

About This Presentation

Title:

Text Mining for Bioscience Applications: The State of the Art

Description:

School of Information Management & Systems (SIMS) Affiliated with the UCSF Bioinformatics Grad Group ... Gcn5-related N-acetyltransferase (GNAT) In future: ... – PowerPoint PPT presentation

Number of Views:44

Avg rating:3.0/5.0

Slides: 37

Provided by: bla

Learn more at: https://people.ischool.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: Text Mining for Bioscience Applications: The State of the Art

1
(No Transcript)
2
Text Mining for Bioscience ApplicationsThe
State of the Art

Marti Hearst
University of California, Berkeley

3
Outline

Search vs. Discovery
Why is text analysis difficult?
Some current approaches
Future directions

4
My Background

Computer Scientist by training
NOT a biologist
Professor in an interdisciplinary program
School of Information Management Systems (SIMS)
Affiliated with the UCSF Bioinformatics Grad
Group
Research fields are
Computational Linguistics
Search (Information Retrieval)
User Interfaces and Information Visualization
Have focused for a while on bioscience text
Have received research support from Genentech

5
Search vs. Discovery
Search Finding hay in a haystack
Discovery Creating a new kind of hay
6
Search Goals

More accurate results
More comprehensive results
Thesaurus expansion
Intelligent summaries of results
Organize results along biologically relevant
lines
Better user interfaces

7
Knowledge Discovery from Text

How to discover new information
As opposed to looking up whats already known.
Method
Create hypotheses
Use large text collections to gather evidence to
refute or support hypotheses
Do lab tests to verify promising results

8
Discovery Goals

Genomics
Automatically build gene networks
Discover gene functions
Pharmacology
Help determine which drugs can help cure a
disease
Help determine which genetic traits will lead to
a reaction to a drug
Etiology
Discover underlying causes of disease

9
Why is Automated Text Analysis Difficult?
10
Why is automated text analysis difficult?

Avastin, developed by South San Francisco-based
Genentech (DNA), was approved for advanced
colorectal cancer and for patients who haven't
received other chemotherapy, according to the
Food and Drug Administration.
What is approved doing in this sentence?
John was approved for advancement -gt gets a
promotion.
Avastin was approved for cancer -gt to fight
cancer.
Avastin was approved for patients -gt to consume
to fight cancer.
What kind of patients approved for?
Ambiguous. Could be for anyone who hasnt
received chemotherapy, or only those patients
with advanced colorectal cancer who havent
received chemotherapy.

11
Why is automated text analysis difficult?

This could easily be a multibillion-dollar
drug," McCamant says.
Refers to concepts mentioned in earlier
sentences.

12
Why is automated text analysis difficult?

"Avastin opens up this new gateway for cancer
care," says William Li, president of the
Angiogenesis Foundation in Massachusetts. "It's
the first in a fleet of other drugs.
Is Avastin a vehicle? It opens gateways and
travels in a fleet!

13
Why is automated text analysis difficult?

There are many indirect ways to say things
A two-dose combined hepatitis A and B vaccine
would facilitate immunization programs.
The vaccine helps prevent hep B.
These results suggest that con A-induced
hepatitis was ameliorated by pretreatment with
TJ-135.
The treatment TJ-135 helps cure hep.
Effect of interferon on hepatitis B.
There is an unspecified effect of interferon on
hep B.

14
What do we do?

Solve sub-problems
Extract certain types of entities
Gene/protein names
Abbreviation definitions
Classify the noun phrases using ontologies
MeSH, LocusLink, GO, etc.
Define relationship types try to recognize them.
Many other subproblems are actively being worked
on
Word sense disambiguation
Co-reference resolution

15
Two Main Approaches
Hand-built Rules
Machine Learning
16
Two Main Approaches

Hand-built rules
Can be very accurate
Are also very brittle
Dont scale
Machine learning
Usually requires labeled training data
Unsupervised methods under development
Can be made to scale
Is the way of the future

17
Abbreviation Definition Recognition

A Simple Algorithm for Identifying Abbreviation
Definitions in Biomedical Text, Ariel Schwartz
and Marti Hearst, PSB 2003 Kauai, Jan 2003
Fast, simple algorithm for recognizing
abbreviation definitions.
Simpler and faster than the rest
Other approaches are cubic or quadratic in time
Higher precision and recall
Idea Work backwards from the end
Examples
In eukaryotes, the key to transcriptional
regulation of the Heat Shock Response is the Heat
Shock Transcription Factor (HSF).
Gcn5-related N-acetyltransferase (GNAT)
In future
Use redundancy across abstracts to figure out
abbreviation meaning even when definition is not
present.

18
Gene name co-occurence

A literature network of human genes for
high-throughput analysis of gene expression.
Jenssen TK, Laegreid A, Komorowski J, Hovig E.
Nat Genet. 2001 May28(1)21-8.
PubGene Assumption
If two genes are co-mentioned in a MEDLINE
record, there is an underlying biological
relationship.

Example Genes highly upregulated at time point 6
h (6H) in the fibroblast serum response.
Green upregulation Red downregulation
19
Gene name co-occurence

A literature network of human genes for
high-throughput analysis of gene expression.
Jenssen TK, Laegreid A, Komorowski J, Hovig E.
Nat Genet. 2001 May28(1)21-8.
Evaluation
29-40 of the pairs were incorrect
45 of OMIM pairs found
51 of DIP pairs found (DB of Interacting
Proteins)

20
How to find functions of genes?

Have the genetic sequence
Dont know what it does
But
Know which genes it coexpresses with
Some of these have known function
So infer function based on function of
co-expressed genes
This is problem suggested by Michael Walker and
others at Incyte Pharmaceuticals

21
Gene Co-expressionRole in the genetic pathway
Kall.
Kall.
g?
h?
PSA
PSA
PAP
PAP
g?
Other possibilities as well
22
Make use of the literature

Look up what is known about the other genes.
Different articles in different collections
Look for commonalities
Similar topics indicated by Subject Descriptors
Similar words in titles and abstracts
adenocarcinoma, neoplasm, prostate, prostatic
neoplasms, tumor markers, antibodies ...

23
Formulate a Hypothesis

Hypothesis mystery gene has to do with
regulation of expression of genes leading to
prostate cancer
New tack do some lab tests
See if mystery gene is similar in molecular
structure to the others
If so, it might do some of the same things they
do

24
Etiology Example

Complementary structures in disjoint science
literatures. Don R. Swanson. In Proceedings of
SIGIR 91
Goal find cause of disease
Magnesium-migraine connection
Given
medical titles and abstracts
a problem (incurable rare disease)
some medical expertise
Find causal links among titles
symptoms
drugs
results

25
Gathering Evidence
migraine
26
Gathering Evidence
migraine
magnesium
27
Swansons Linking Approach

Two of his hypotheses have received some
experimental verification.
His technique
Only partially automated
Required medical expertise
Recently others have made progress automating it.

28
Automating Swanson-style Discovery

Text Mining Generating Hypotheses from MEDLINE,
Padmini Srinivasan. To appear in JASIST.
UMLS defines Semantic Types
Every MeSH term is assigned one or more Semantic
Types
Interferon type II falls within both
Immunologic Factor and
Pharmacologic Substance
Each PubMed article is assigned a set of MeSH
terms
The idea is to characterize a set of articles
according to which semantic types their MeSH
terms fall into.

29
Automating Swanson-style Discovery

Text Mining Generating Hypotheses from MEDLINE,
Padmini Srinivasan. To appear in JASIST.
Approach
User inputs topic T of interest
User selects 2 sets from a small number of sets
of UMLS semantic types
System
Searches PubMed for articles about T
Selects out the important MeSH terms as
determined by the user-chosen semantic type
categories
Searches PubMed for articles that contain these
MeSH terms
Combines the MeSH terms that result from these
retrieved documents
Call this result C
If a PubMed search on words from T and c from C
are empty, place c as a candidate in a final
result set R
Report those terms in R that fall into the second
user-selected semantic type set.

30
Automating Swanson-style Discovery

Text Mining Generating Hypotheses from MEDLINE,
Padmini Srinivasan. To appear in JASIST.
Results have successfully reproduced the 7
examples they tried, with very little manual
intervention
Example input topic is Raynauds disease

31
Main Ideas for NLP Approach

Assign Semantics using
Statistics
Hierarchical Lexical Ontologies to generalize
Redundancy in the data
Build up Layers of Representation
Syntactic and Semantic
Use these in a feedback loop

32
Automated Relation Assignment

Recall the problem
A two-dose combined hepatitis A and B vaccine
would facilitate immunization programs.
The vaccine helps prevent hep B.
Identified 7 relations that can hold between
Treatments and Diseases
Used Machine Learning to address this
Graphical models
Neural nets
Marked up the text with syntactic and semantic
information
MeSH labels turn out to be very important

33
Automated Relation Assignment