The BioText Project - PowerPoint PPT Presentation

About This Presentation
Title:

The BioText Project

Description:

Provide fast, flexible, intelligent access to information for use in biosciences ... To (eventually) be used in tandem with semi-automated reasoning systems. 23 ... – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0
Slides: 47
Provided by: unkn942
Category:

less

Transcript and Presenter's Notes

Title: The BioText Project


1
The BioText Project
  • SIMS Affiliates Meeting
  • Nov 14, 2003
  • Marti Hearst
  • Associate Professor
  • SIMS, UC Berkeley
  • Projected sponsored by NSF DBI-0317510,
  • ARDA AQUAINT, and a gift from Genentech

2
BioText Project Goals
  • Provide fast, flexible, intelligent access to
    information for use in biosciences applications.
  • Better search results
  • Text mining
  • Focus on
  • Textual Information
  • Tightly integrated with other resources
  • Ontologies
  • Record-based databases

3
People
  • Project Leaders
  • PI Marti Hearst Co-PI Adam Arkin
  • Computational Linguistics
  • Barbara Rosario
  • Presley Nakov
  • Database Research
  • Ariel Schwartz
  • Gaurav Bhalotia (graduated)
  • User Interface / Information Retrieval
  • Kevin Li
  • Dr. Emilia Stoica
  • Bioscience
  • Dr. TingTing Zhang

4
Outline
  • Main Goals
  • Text Mining Examples
  • System Architecture
  • Apoptosis problem statement
  • Recent results in
  • Abbreviation definition recognition
  • Semantic relation recognition (from text)
  • Search User Interfaces
  • Hierarchical grouping of journals

5
Text Mining Example 1
  • How to discover new information
  • As opposed to discovering which statistical
    patterns characterize occurrence of known
    information.
  • Method
  • Use large text collections to gather evidence to
    support (or refute) hypotheses
  • Make Connections
  • Gather Evidence

6
Etiology Example
  • Don Swanson example, 1991
  • Goal find cause of disease
  • Magnesium-migraine connection
  • Given
  • medical titles and abstracts
  • a problem (incurable rare disease)
  • some medical expertise
  • find causal links among titles
  • symptoms
  • drugs
  • results

7
Gathering Evidence
stress
CCB
migraine
magnesium
magnesium
PA
SCD
magnesium
magnesium
8
Gathering Evidence
migraine
magnesium
9
Swansons Linking Approach
  • Two of his hypotheses have received some
    experimental verification.
  • His technique
  • Only partially automated
  • Required medical expertise

10
Text Mining Example 2
  • How to find functions of genes?
  • Have the genetic sequence
  • Dont know what it does
  • But
  • Know which genes it coexpresses with
  • Some of these have known function
  • So infer function based on function of
    co-expressed genes
  • This is problem suggested by Michael Walker and
    others at Incyte Pharmaceuticals

11
Gene Co-expressionRole in the genetic pathway
Kall.
Kall.
g?
h?
PSA
PSA
PAP
PAP
g?
Other possibilities as well
12
Make use of the literature
  • Look up what is known about the other genes.
  • Different articles in different collections
  • Look for commonalities
  • Similar topics indicated by Subject Descriptors
  • Similar words in titles and abstracts
  • adenocarcinoma, neoplasm, prostate, prostatic
    neoplasms, tumor markers, antibodies ...

13
(No Transcript)
14
Formulate a Hypothesis
  • Hypothesis mystery gene has to do with
    regulation of expression of genes leading to
    prostate cancer
  • New tack do some lab tests
  • See if mystery gene is similar in molecular
    structure to the others
  • If so, it might do some of the same things they
    do

15
Outline
  • Main Goals
  • Text Mining Examples
  • System Architecture
  • Apoptosis problem statement
  • Recent results in
  • Abbreviation definition recognition
  • Semantic relation recognition (from text)
  • Search User Interfaces
  • Hierarchical grouping of journals

16
BioText Architecture
Sophisticated Text Analysis
Annotations in Database
Improved Search Interface
17
Recent Result (Schwartz Hearst 03)
  • Fast, simple algorithm for recognizing
    abbreviation definitions.
  • Simpler and faster than the rest
  • Higher precision and recall
  • Idea Work backwards from the end
  • Examples
  • In eukaryotes, the key to transcriptional
    regulation of the Heat Shock Response is the Heat
    Shock Transcription Factor (HSF).
  • Gcn5-related N-acetyltransferase (GNAT)
  • Idea use redundancy across abstracts to figure
    out abbreviation meaning even when definition is
    not present.

18
BioText A Two-Sided Approach
Empirical Computational Linguistics Algorithms
Sophisticated Database Design Algorithms
19
Apoptosis Network
Survival Factors Signaling
Genotoxic Stress
Lost of Attachment Cell Cycle stress, etc
ER Stress
Initiator Caspases (8, 10)
P53 pathway
BH3 only
Bcl-2 like
NFkB
Bax, Bak
Mitochondria Cytochrome c
Smac
Caspase 12
IAPs
Apaf 1
AIF
Caspase 9
Apoptosis
Slide courtesy TingTing Zhang
20
The issues (courtesy TingTing Zhang)
  • The network nodes are deduced from reading and
    processing of experimental knowledge by experts.
    Every month gt1000 apoptosis papers are published.
  • The supporting experimental data are gathered in
    different organs, tissues, cells using various
    techniques.
  • There are various levels of uncertainty
    associated with different techniques used to
    answer certain questions.
  • Depending on the expression patterns for the
    players in the network, the observation may or
    may not be extended to other contexts.
  • We need to keep track of ALL the information in
    order to understand the system better.

21
  • Simple cases
  • Mouse Bim proteins (isoforms EL, L, S) binds to
    human Bcl-2 (bacteriophoage screening using cDNA
    expression library from T-Lymphoma cell line
    KO52DA20).
  • Human BimEL protein is 89 identical to mouse
    BimEL, Human BimL is 85 identical to mouse BimL
    (Hybridization of mouse bim cDNA to human fetal
    spleen and peripheral blood cDNA library).
  • Bim mRNA is detected in B and T lyphoid cells
    (Northern blot analysis of mouse KO52DA20, WEHI
    703, WEHI 707, WEHI7.1, CH1, WEHI231 WEHI415,
    B6.23.16BW2 cell extracts).
  • BimL protein interact with Bcl-2 OR Bcl-XL, or
    Bcl-w proteins (Immuno-precipitation (anti-Bcl-2
    OR Bcl-XL OR Bcl-w)) followed by Western blot
    (anti-EEtag) using extracts human 293T cells
    co-transfected with EE-tagged BimL AND (bcl-2 OR
    bcl-XL OR bcl-w) plasmids)
  • BimL deleted of the BH3 domain does not bind to
    Bcl-2 OR Bcl-XL, or Bcl-w proteins (under
    experimental conditions mentioned above)

22
Computational Language Goals
  • Recognizing and annotating entities within
    textual documents
  • Identifying semantic relations among entities
  • To (eventually) be used in tandem with
    semi-automated reasoning systems.

23
Main Ideas for NLP Approach
  • Assign Semantics using
  • Statistics
  • Hierarchical Lexical Ontologies to generalize
  • Redundancy in the data
  • Build up Layers of Representation
  • Syntactic and Semantic
  • Use these in a feedback loop

24
Computational Linguistics Goals
  • Mark up text with semantic relations

25
Recent ResultDescent of Hierarchy
  • Idea
  • Use the top levels of a lexical hierarchy to
    identify semantic relations
  • Hypothesis
  • A particular semantic relation holds between all
    2-word Noun Compounds that can be categorized by
    a MeSH pair.

26
Definition
  • NC Any sequence of nouns that itself functions
    as a noun
  • asthma hospitalizations
  • health care personnel hand wash
  • Technical text is rich with NCs
  • Open-labeled long-term study of the
    subcutaneous sumatriptan efficacy and
    tolerability in acute migraine treatment.

27
NCs Three tasks
  • Identification
  • Syntactic analysis (attachments)
  • Baseline headache frequency
  • Tension headache patient
  • Our Goal Semantic analysis
  • Headache treatment ? treatment for
    headache
  • Corticosteroid treatment ? treatment that
    uses
    corticosteroid

28
Main Idea
  • Top-level MESH categories can be used to indicate
    which relations hold between noun compounds
  • headache recurrence
  • C23.888.592.612.441 C23.550.291.937
  • headache pain
  • C23.888.592.612.441 G11.561.796.444
  • breast cancer cells
  • A01.236 C04 A11

29
Linguistic Motivation
  • Can cast NC into head-modifier relation, and
    assume head noun has an argument and qualia
    structure.
  • (used-in) kitchen knife
  • (made-of) steel knife
  • (instrument-for) carving knife
  • (used-on) putty knife
  • (used-by) butchers knife

30
Distribution of Frequent Category Pairs
31
How Far to Descend?
  • Anatomy 250 CPs
  • 187 (75) remain first level
  • 56 (22) descend one level
  • 7 (3) descend two levels
  • Natural Science (H01) 21 CPs
  • 1 (4) remain first level
  • 8 (39) descend one level
  • 12 (57) descend two levels
  • Neoplasm (C04) 3 CPs
  • 3 (100) descend one level

32
Evaluation
  • Apply the rules to a test set
  • Accuracy
  • Anatomy 91 accurate
  • Natural Science 79
  • Diseases 100
  • Total
  • 89.6 via intra-category averaging
  • 90.8 via extra-category averaging

33
Summary of NC Work
  • Lexical hierarchy useful for inferring semantic
    relations
  • Works because semantics are constrained and word
    sense ambiguity is not too much of a problem
  • Can it be extended to other types of relations?
  • Preliminary results on one set of relations are
    promising.

34
Database Research Issues
  • Efficiently and effectively combining
  • Relational databases Text
  • Hierarchical Ontologies
  • Layers of Annotations

35
Interface Issues
  • Create intuitive, appealing interfaces that are
    better than whats currently out there.
  • Start with existing assigned metadata
  • As text analysis improves, incorporate the
    results into the interface.

36
(No Transcript)
37
(No Transcript)
38
(No Transcript)
39
(No Transcript)
40
Some Recent Work
  • Organizing BioScience Journal Names
  • Currently there are gt 3500

41
(No Transcript)
42
(No Transcript)
43
Some Recent Work
  • Organizing BioScience Journal Names
  • Currently there are gt 3500
  • Idea
  • Group them into faceted hierarchies
    semi-automatically
  • Using clustering of title terms, synonym
    similarity via WordNet, and other techniques

44
(No Transcript)
45
(No Transcript)
46
Summary
  • BioText aims to improve access to bioscience
    information via
  • Sophisticated language analysis
  • Integration of results into
  • Annotated database
  • Flexible user interface
  • Eventual goal
  • Semi-automated mining and discovery

47
Theres lots to do!
For more information
  • biotext.berkeley.edu
Write a Comment
User Comments (0)
About PowerShow.com