Text Mining for Bioscience Applications: The State of the Art - PowerPoint PPT Presentation

About This Presentation
Title:

Text Mining for Bioscience Applications: The State of the Art

Description:

School of Information Management & Systems (SIMS) Affiliated with the UCSF Bioinformatics Grad Group ... Gcn5-related N-acetyltransferase (GNAT) In future: ... – PowerPoint PPT presentation

Number of Views:44
Avg rating:3.0/5.0
Slides: 37
Provided by: bla
Category:

less

Transcript and Presenter's Notes

Title: Text Mining for Bioscience Applications: The State of the Art


1
(No Transcript)
2
Text Mining for Bioscience ApplicationsThe
State of the Art
  • Marti Hearst
  • University of California, Berkeley

3
Outline
  • Search vs. Discovery
  • Why is text analysis difficult?
  • Some current approaches
  • Future directions

4
My Background
  • Computer Scientist by training
  • NOT a biologist
  • Professor in an interdisciplinary program
  • School of Information Management Systems (SIMS)
  • Affiliated with the UCSF Bioinformatics Grad
    Group
  • Research fields are
  • Computational Linguistics
  • Search (Information Retrieval)
  • User Interfaces and Information Visualization
  • Have focused for a while on bioscience text
  • Have received research support from Genentech

5
Search vs. Discovery
Search Finding hay in a haystack
Discovery Creating a new kind of hay
6
Search Goals
  • More accurate results
  • More comprehensive results
  • Thesaurus expansion
  • Intelligent summaries of results
  • Organize results along biologically relevant
    lines
  • Better user interfaces

7
Knowledge Discovery from Text
  • How to discover new information
  • As opposed to looking up whats already known.
  • Method
  • Create hypotheses
  • Use large text collections to gather evidence to
    refute or support hypotheses
  • Do lab tests to verify promising results

8
Discovery Goals
  • Genomics
  • Automatically build gene networks
  • Discover gene functions
  • Pharmacology
  • Help determine which drugs can help cure a
    disease
  • Help determine which genetic traits will lead to
    a reaction to a drug
  • Etiology
  • Discover underlying causes of disease

9
Why is Automated Text Analysis Difficult?
10
Why is automated text analysis difficult?
  • Avastin, developed by South San Francisco-based
    Genentech (DNA), was approved for advanced
    colorectal cancer and for patients who haven't
    received other chemotherapy, according to the
    Food and Drug Administration.
  • What is approved doing in this sentence?
  • John was approved for advancement -gt gets a
    promotion.
  • Avastin was approved for cancer -gt to fight
    cancer.
  • Avastin was approved for patients -gt to consume
    to fight cancer.
  • What kind of patients approved for?
  • Ambiguous. Could be for anyone who hasnt
    received chemotherapy, or only those patients
    with advanced colorectal cancer who havent
    received chemotherapy.

11
Why is automated text analysis difficult?
  • This could easily be a multibillion-dollar
    drug," McCamant says.
  • Refers to concepts mentioned in earlier
    sentences.

12
Why is automated text analysis difficult?
  • "Avastin opens up this new gateway for cancer
    care," says William Li, president of the
    Angiogenesis Foundation in Massachusetts. "It's
    the first in a fleet of other drugs.
  • Is Avastin a vehicle? It opens gateways and
    travels in a fleet!

13
Why is automated text analysis difficult?
  • There are many indirect ways to say things
  • A two-dose combined hepatitis A and B vaccine
    would facilitate immunization programs.
  • The vaccine helps prevent hep B.
  • These results suggest that con A-induced
    hepatitis was ameliorated by pretreatment with
    TJ-135.
  • The treatment TJ-135 helps cure hep.
  • Effect of interferon on hepatitis B.
  • There is an unspecified effect of interferon on
    hep B.

14
What do we do?
  • Solve sub-problems
  • Extract certain types of entities
  • Gene/protein names
  • Abbreviation definitions
  • Classify the noun phrases using ontologies
  • MeSH, LocusLink, GO, etc.
  • Define relationship types try to recognize them.
  • Many other subproblems are actively being worked
    on
  • Word sense disambiguation
  • Co-reference resolution

15
Two Main Approaches
Hand-built Rules
Machine Learning
16
Two Main Approaches
  • Hand-built rules
  • Can be very accurate
  • Are also very brittle
  • Dont scale
  • Machine learning
  • Usually requires labeled training data
  • Unsupervised methods under development
  • Can be made to scale
  • Is the way of the future

17
Abbreviation Definition Recognition
  • A Simple Algorithm for Identifying Abbreviation
    Definitions in Biomedical Text, Ariel Schwartz
    and Marti Hearst, PSB 2003 Kauai, Jan 2003
  • Fast, simple algorithm for recognizing
    abbreviation definitions.
  • Simpler and faster than the rest
  • Other approaches are cubic or quadratic in time
  • Higher precision and recall
  • Idea Work backwards from the end
  • Examples
  • In eukaryotes, the key to transcriptional
    regulation of the Heat Shock Response is the Heat
    Shock Transcription Factor (HSF).
  • Gcn5-related N-acetyltransferase (GNAT)
  • In future
  • Use redundancy across abstracts to figure out
    abbreviation meaning even when definition is not
    present.

18
Gene name co-occurence
  • A literature network of human genes for
    high-throughput analysis of gene expression.
    Jenssen TK, Laegreid A, Komorowski J, Hovig E.
    Nat Genet. 2001 May28(1)21-8.
  • PubGene Assumption
  • If two genes are co-mentioned in a MEDLINE
    record, there is an underlying biological
    relationship.

Example Genes highly upregulated at time point 6
h (6H) in the fibroblast serum response.
Green upregulation Red downregulation
19
Gene name co-occurence
  • A literature network of human genes for
    high-throughput analysis of gene expression.
    Jenssen TK, Laegreid A, Komorowski J, Hovig E.
    Nat Genet. 2001 May28(1)21-8.
  • Evaluation
  • 29-40 of the pairs were incorrect
  • 45 of OMIM pairs found
  • 51 of DIP pairs found (DB of Interacting
    Proteins)

20
How to find functions of genes?
  • Have the genetic sequence
  • Dont know what it does
  • But
  • Know which genes it coexpresses with
  • Some of these have known function
  • So infer function based on function of
    co-expressed genes
  • This is problem suggested by Michael Walker and
    others at Incyte Pharmaceuticals

21
Gene Co-expressionRole in the genetic pathway
Kall.
Kall.
g?
h?
PSA
PSA
PAP
PAP
g?
Other possibilities as well
22
Make use of the literature
  • Look up what is known about the other genes.
  • Different articles in different collections
  • Look for commonalities
  • Similar topics indicated by Subject Descriptors
  • Similar words in titles and abstracts
  • adenocarcinoma, neoplasm, prostate, prostatic
    neoplasms, tumor markers, antibodies ...

23
Formulate a Hypothesis
  • Hypothesis mystery gene has to do with
    regulation of expression of genes leading to
    prostate cancer
  • New tack do some lab tests
  • See if mystery gene is similar in molecular
    structure to the others
  • If so, it might do some of the same things they
    do

24
Etiology Example
  • Complementary structures in disjoint science
    literatures. Don R. Swanson. In Proceedings of
    SIGIR 91
  • Goal find cause of disease
  • Magnesium-migraine connection
  • Given
  • medical titles and abstracts
  • a problem (incurable rare disease)
  • some medical expertise
  • Find causal links among titles
  • symptoms
  • drugs
  • results

25
Gathering Evidence
migraine
26
Gathering Evidence
migraine
magnesium
27
Swansons Linking Approach
  • Two of his hypotheses have received some
    experimental verification.
  • His technique
  • Only partially automated
  • Required medical expertise
  • Recently others have made progress automating it.

28
Automating Swanson-style Discovery
  • Text Mining Generating Hypotheses from MEDLINE,
    Padmini Srinivasan. To appear in JASIST.
  • UMLS defines Semantic Types
  • Every MeSH term is assigned one or more Semantic
    Types
  • Interferon type II falls within both
  • Immunologic Factor and
  • Pharmacologic Substance
  • Each PubMed article is assigned a set of MeSH
    terms
  • The idea is to characterize a set of articles
    according to which semantic types their MeSH
    terms fall into.

29
Automating Swanson-style Discovery
  • Text Mining Generating Hypotheses from MEDLINE,
    Padmini Srinivasan. To appear in JASIST.
  • Approach
  • User inputs topic T of interest
  • User selects 2 sets from a small number of sets
    of UMLS semantic types
  • System
  • Searches PubMed for articles about T
  • Selects out the important MeSH terms as
    determined by the user-chosen semantic type
    categories
  • Searches PubMed for articles that contain these
    MeSH terms
  • Combines the MeSH terms that result from these
    retrieved documents
  • Call this result C
  • If a PubMed search on words from T and c from C
    are empty, place c as a candidate in a final
    result set R
  • Report those terms in R that fall into the second
    user-selected semantic type set.

30
Automating Swanson-style Discovery
  • Text Mining Generating Hypotheses from MEDLINE,
    Padmini Srinivasan. To appear in JASIST.
  • Results have successfully reproduced the 7
    examples they tried, with very little manual
    intervention
  • Example input topic is Raynauds disease

31
Main Ideas for NLP Approach
  • Assign Semantics using
  • Statistics
  • Hierarchical Lexical Ontologies to generalize
  • Redundancy in the data
  • Build up Layers of Representation
  • Syntactic and Semantic
  • Use these in a feedback loop

32
Automated Relation Assignment
  • Recall the problem
  • A two-dose combined hepatitis A and B vaccine
    would facilitate immunization programs.
  • The vaccine helps prevent hep B.
  • Identified 7 relations that can hold between
    Treatments and Diseases
  • Used Machine Learning to address this
  • Graphical models
  • Neural nets
  • Marked up the text with syntactic and semantic
    information
  • MeSH labels turn out to be very important

33
Automated Relation Assignment
  • Use Machine Learning to address this
  • Graphical models
  • Neural nets
  • Mark up the text with syntactic and semantic
    information
  • MeSH labels turn out to be very important

34
Automated Relation Assignment
  • Results

35
Future Directions
  • In text analysis
  • Move away from hand-built rules
  • More focus on labeling with semantics
  • In problems tackled
  • There are so many possibilities!
  • Help with automated curation

36
Thank you!
  • Visit our site
  • biotext.berkeley.edu
Write a Comment
User Comments (0)
About PowerShow.com