Vladimir Bajic - PowerPoint PPT Presentation

About This Presentation
Title:

Vladimir Bajic

Description:

Vladimir Bajic – PowerPoint PPT presentation

Number of Views:32
Avg rating:3.0/5.0
Slides: 73
Provided by: Alek172
Category:
Tags: bajic | haha | vladimir

less

Transcript and Presenter's Notes

Title: Vladimir Bajic


1
Deeper insights from text-mining Dragon
Exploratory System
Vladimir Bajic InCoB 2009 Singapore
2
Information exploratory systemsMotivation
3
Search for complete Information is tedious
Biomedical field
Multitude of information types
Multiple structures of information records
Tools for Information search and retrieval need
improvements
Exploratory systems
Extreme volume of information
Distributed Information repositories
Variety of modes to access information
4
Information exploratory systemsCharacteristics
5
Can generate reports
Biomedical field
Integrate Information from various resources
Allow for exploring information from
various viewpoints
Links with other information repositories
Exploratory systems
Navigate easily through information retrieved
Convenient graphical representation
Convenient tabular representation
6
Knowledge extraction Information exploratory
systems
Knowledge extraction
Exploratory systems
Computers
7
Knowledge Extraction Why ?
  • In the biomedical field the number of published
    scientific reports increases dramatically per
    year (PubMed 19 million documents increases by
    more than 0.5 million per year)
  • The volume of experimentally generated and
    computationally derived data is tremendous
  • Still the amount of facts related to a specific
    topic provided in a summarized format is very
    small
  • For this reason databases of curated information
    of specialized topics are still easily publishable

8
Knowledge Extraction Why (2) ?
  • Summarized collections of accurate
    information/facts on a specialized topic are rare
  • These are difficult to compile
  • require curation
  • slow
  • costly
  • What can we do about it?
  • We can automate and semi-automate by the
    computerized extraction of KNOWLEDGE from various
    resources, most frequently textual

9
Knowledge Extraction How ?
  • We define what we mean by knowledge in a way
    suitable for computer analysis
  • By KNOWLEDGE we will consider the information
    that accurately links two concepts A and B.
  • Knowledge Edge is the pair
  • A relates to B in a specific way under specific
    conditions

10
Knowledge Edge
Specific conditions
concept
concept
A
B
relation
11
Knowledge Edge Examples
A gene EntrezGeneID 9821 B 2h after
stimulation Relation highly express
Other examples A binds to B A blocks activity of
B etc.
12
Knowledge Extraction Methods
  • Most rewarding text analysis

Find concepts of interest in textual data
Textual data
Determine how are concepts linked
Organize results for convenient utilization
Dictionaries of concepts of interest
13
Knowledge Extraction Methods (2)
Data resources
AI system for refining results
Text-mining/ Data-mining tools
Preliminary results
Final results
Curators
Other data resources
14
Knowledge Extraction
How is the relation between A and B concepts
derived? (in text-mining) Location document
set document part of document sentence Form/ru
le defined template patterns Artificial
intelligence trained system Possibly curator
assessment
15
Knowledge Extraction
These provide different levels of accuracy of
derived Knowledge Edges from exploratory, i.e.
potential to accurate Sets of knowledge edges
can be used as building blocks in generation of
knowledge networks This supports various fields
of life sciences
16
High-accuracy knowledge edge sets
Knowledge base for links to other resources
Biomedical field
Knowledge networks generation
Knowledge extraction systems
Automatic generation of biological databases
Hypotheses generation
Curator Support systems
17
How does it look in practice?
18
Bookshelf of US Surgeon General Joseph Lovell
  • In 1818 Joseph Lovell became the surgeon general
    of the US army medical department.
  • Occasionally he purchased medical journals and
    books for his office bookshelf.

19
General Thomas Lawson - the first catalog
  • Lovell's successor, Surgeon General Thomas
    Lawson, continued and expanded the office
    collection.
  • In 1840 he wrote a small catalog containing 134
    titles and calling the collection as the "Library
    of the Surgeon General's Office."

20
The times of John Shaw Billings
  • Early in the 1870s Billings started a card file.
  • Years later, on 1895 he completed sixteen volumes
    index catalog.
  • This was the most comprehensive guide to the
    medical literature of the 19th century.

21
MEDLARS
  • The catalog evolved from first printed, then
    photographic in the 1950s, to computerized
    Medical Literature Analysis and Retrieval
    System (MEDLARS) system in the 1960s.
  • For each requested abstract the entire set of
    magnetic tapes had to be searched sequentially
    which in 1964 took about 40 minutes, and a
    summary mailed back to the library member.
  • In 1971 the library started providing online
    access and in 1993 through the website.

22
National Library of Medicine
  • Today, the National Library of Medicine (NLM) is
    the biggest repository of the biomedical
    information worldwide.
  • Via the National Center for Biotechnology
    Information (NCBI) NLM houses and provides access
    not only to bibliography but also to a number of
    biomedical databases.

23
Entrez
  • The access to biomedical literature is provided
    by NCBIs PubMed service.
  • Pubmed uses the cross-database search and
    retrieval system Entrez.
  • The Entrez query system is also used for services
    including NLM catalog, nucleotide and protein
    database, genome sequences, and many others.
  • Information from published documents only is
    insufficient

http//www.ncbi.nlm.nih.gov/sites
24
PubMed
  • The main PubMed component is Medical Literature
    Analysis and Retrieval System Online (MEDLINE)
  • MEDLINE is a bibliographic database that contains
    citations and abstracts of journal articles in
    life sciences with a focus on biomedicine.
  • In addition PubMed contains citations and links
    to full text articles from other biomedical
    related life science journals.
  • In the year 2000 a new service, PubMed Central
    (PMC) was established as a digital archive of
    full text journal articles.
  • Today PubMed contains more than 19,000,000
    citations.

25
Text Mining System Components
26
Text Mining with the Dragon Exploration System
  • Six design principles
  • Submit
  • Annotate
  • Explore
  • Visualize
  • Hypothesize
  • Mine

27
Text Mining with the Dragon Exploration System
  • Six design principles
  • Submit
  • Annotate
  • Explore
  • Visualize
  • Hypothesize
  • Mine

28
Text Mining with the Dragon Exploration System
  • Six design principles
  • Submit
  • Annotate
  • Explore
  • Visualize
  • Hypothesize
  • Mine

29
Text Mining with the Dragon Exploration System
  • Six design principles
  • Submit
  • Annotate
  • Explore
  • Visualize
  • Hypothesize
  • Mine

30
Text Mining with the Dragon Exploration System
  • Six design principles
  • Submit
  • Annotate
  • Explore
  • Visualize
  • Hypothesize
  • Mine

Tobacco and tobacco smoke (inhibits Monoamine Oxidase) Tobacco and tobacco smoke (inhibits Monoamine Oxidase)
Controls aggression in obsessive-compulsive disorder. ?
Beneficial for Parkinsons disease. ?
Beneficial for bipolar depression. ?
Link to Borderline Mental Retardation and Idiopathic Epilepsy. ?
31
Text Mining with the Dragon Exploration System
  • Six design principles
  • Submit
  • Annotate
  • Explore
  • Visualize
  • Hypothesize
  • Mine

32
or how we did it?
33
Main resources
34
  • Set of curated dictionaries on various topics
  • Pre-indexed PubMed
  • Pre-indexed EntrezGene, Reactome
  • Pre-indexed UniProt (SwissProt, Trembl)
  • Internal annotated human promoter database
    (200,000 TSS 2 sources of experimental
    evidence, 300 TFs, association of TSSs with
    expression libraries)
  • Pre-indexed and annotated human regulatory
    regions for SNPs and affected TFBSs
  • PPI data
  • Hypotheses generation module
  • Rule-based knowledge extraction module

35
Text mining characteristics
36
  • Synonym resolution module
  • Acronym resolution module
  • Free selection of index dictionaries
  • Text is re-indexed on the fly
  • Color-coded index presentation in text
  • Link to the source

37
List of associations per concept
38
Search for an explicit term
39
Text Mining with the Dragon Exploration System
  • Six design principles
  • Submit
  • Annotate
  • Explore
  • Visualize
  • Hypothesize
  • Mine

Frequency of entities table.
40
Tabular presentation of all terms found with
frequencies
41
Find documents with pair of concepts
42
Text Mining with the Dragon Exploration System
  • Six design principles
  • Submit
  • Annotate
  • Explore
  • Visualize
  • Hypothesize
  • Mine

Frequency of pairs table.
43
Text Mining with the Dragon Exploration System
  • Six design principles
  • Submit
  • Annotate
  • Explore
  • Visualize
  • Hypothesize
  • Mine

Recommended reading top documents in annotated
and original PubMed format.
44
Text Mining with the Dragon Exploration System
  • Six design principles
  • Submit
  • Annotate
  • Explore
  • Visualize
  • Hypothesize
  • Mine

Links to external databases e.g. GO
45
Text Mining with the Dragon Exploration System
  • Six design principles
  • Submit
  • Annotate
  • Explore
  • Visualize
  • Hypothesize
  • Mine

Links to external databases e.g. Reactome
46
Text Mining with the Dragon Exploration System
  • Six design principles
  • Submit
  • Annotate
  • Explore
  • Visualize
  • Hypothesize
  • Mine

Links to external databases genes, proteins,
pathways,
47
Text Mining with the Dragon Exploration System
  • Six design principles
  • Submit
  • Annotate
  • Explore
  • Visualize
  • Hypothesize
  • Mine

Entities are linked to external databases, e.g.
Genes, Proteins, Pathways ,
48
Text Mining with the Dragon Exploration System
  • Six design principles
  • Submit
  • Annotate
  • Explore
  • Visualize
  • Hypothesize
  • Mine

Document clustering weight of the entity in the
cluster frequency / number of document.
49
Hypotheses generation
50
Enter the Dragon Swanson ABC model
  • Sets of publications A and C have no articles in
    common, but they are linked through intermediate
    articles.
  • This structure may contain unnoticed information
    that can be obtained by combining pairs of
    intersections ABi and BiC.

Magnesium literature Migraine literature
Mg is calcium channel blocker. Calcium channel blockers can prevent migraine attacks
Stress can lead to loss of Mg. Stress is associated with migraine
Mg has anti-inflammatory properties. Migraine may involve inflammation of the cerebral blood vessels.
51
Enter the Dragon Open and closed discovery
process
  • Open discovery process

Closed discovery process
The researcher is searching for interesting
concepts (B) that links the researched topic (A)
with concept (C). A Magnesium literature. B
Stress. C Migraine.
The search starts in both direction resulting in
overlapping concepts. A Magnesium
literature. C Migraine literature. B Stress.
52
Open discovery process with intermediate links
The researcher is searching for interesting
concepts (D) that links the researched topic (A)
with concept (D) via intermediate links (B C).
53
Hypotheses generation
54
Rule-based extraction of knowledge
55
Enter the Dragon
Concept based knowledge discovery
Real world problem description search
for transcription factor binding to genes
promoter
Concept world concept description protein/gene
interactionprotein/gene promoter
56
Text Mining with the Dragon Exploration System
  • Six design principles
  • Submit
  • Annotate
  • Explore
  • Visualize
  • Hypothesize
  • Mine

Knowledge finder set template and mine for
knowledge.
57
Transcription regulation
Depending on the ways how TFs interact, different
PEs participate in the transcription initiation
process Some combination of events will initiate
transcription, and some will not These show the
combinatorial nature of transcription activation
and promoters ability to address different
requirements for timing, tissue specificity, and
transcription rate and levels
58
TF-1
Distant unrelated pathways
59
  • The key information to deal with the issue of how
    transcription is controlled is to find

bind
TF
gene
promoter
Human 2000 TFs 30,000 genes 200,000
promoters Less than 8000 known
TF-(promoter)-gene edges
60
Question Find TFs that bind to promoter of a gene
61
Find TFs that bind to promoter of a gene
Knowledge Extraction
62
Find TFs that bind to promoter of a gene
Knowledge Extraction
  • For one day a curator can verify 200
    associations of the type
  • In this way we can quickly build repositories of
    curated (thus accurate) information on specific
    types of knowledge edges

bind
TF
gene
promoter
63
Enter the Dragon Machine learning
  1. Supply ML algorithm with classified examples.
  2. ML algorithm learns a prediction rule.
  3. Classify new instance of an unknown class by
    using the rule.

Transcription factor binding to genes promoter? Class
Activated PEA3 binds to MMP-13 promoter and activates its expression. Yes
A HNF-1 binding site was identified in the NNMT basal promoter region. Yes
Runx2 directly binds to the OSE2 elements and transactivates the human NELL-1 promoter. No
64
Enter the Dragon ML learning examples
145,168 abstract produced 1,049,949 sentences
Activated PEA3 binds to MMP-13 promoter
and activates its expression.
3,321 sentences matched the concept
pr/gninteraction pr/gn
promoterActivated PEA3 binds to MMP-13
promoter and activates its expression.
Content evaluation
Expert classification
Transcription factor binding to genes promoter? Class
Activated PEA3 binds to MMP-13 promoter and activates its expression. Yes
A HNF-1 binding site was identified in the NNMT basal promoter region. Yes
Runx2 directly binds to the OSE2 elements and transactivates the human NELL-1 promoter. No
65
From real world to abstract feature vectors space
Transcription factor binding to genes promoter? Class
Activated PEA3 binds to MMP-13 promoter and activates its expression. Yes
A HNF-1 binding site was identified in the NNMT basal promoter region. Yes
Runx2 directly binds to the OSE2 elements and transactivates the human NELL-1 promoter. No
66
The best performing feature
Levenshtein distance
67
ML algorithms comparison
Algorithm Precision Recall F-measure AUC
K 0.709 0.710 0.710 0.741
MLP-NN 0.713 0.712 0.713 0.787
J-48 0.735 0.739 0.735 0.775
Naïve Bayes 0.786 0.788 0.785 0.841
Random Forest 0.795 0.796 0.795 0.837
68
K with Synthetic Minority Oversampling
Experiment Precision Recall F-measure AUC
7 features 0.899 0.892 0.893 0.962
57 features 0.912 0.908 0.909 0.970
68 features 0.963 0961 0.961 0.993
117 features 0.960 0.959 0.959 0.988
Chowdhary et al., 2009 0.920 0.710 0.740 0.870
69
Knowledge discovery
70
Creation of biology related databasesFor a set
of documents and for selected set of
dictionariesAlso, for template related sentence
typesSome publications (different aspects of
DES)Essack M et al. DDEC Dragon database of
genes implicated in esophageal cancer, BMC Cancer
2009 Kaur M et al. Database for exploration of
functional context of genes implicated in ovarian
cancer, Nucleic Acids Research, 2009 Sagar S et
al. DDESC Dragon Database for Exploration of
Sodium Channels in Human, BMC Genomics, 2008
71
Text Mining with the Dragon Exploration System
  • Six design principles
  • Submit
  • Annotate
  • Explore
  • Visualize
  • Hypothesize
  • Mine

Create database by filtering sentences by
dictionary and selected keywords.
72
  • Thank you
  • Shukran
  • ????
Write a Comment
User Comments (0)
About PowerShow.com