Real-time Text Mining for the Biomedical Literature - PowerPoint PPT Presentation

1 / 39
About This Presentation
Title:

Real-time Text Mining for the Biomedical Literature

Description:

Real-time Text Mining for the Biomedical Literature ... Discovery Net: An e-Science testbed for High Throughput Informatics. 2.2M EPSRC Pilot Project ... – PowerPoint PPT presentation

Number of Views:93
Avg rating:3.0/5.0
Slides: 40
Provided by: mmg7
Category:

less

Transcript and Presenter's Notes

Title: Real-time Text Mining for the Biomedical Literature


1
Real-time Text Mining for the Biomedical
Literature a collaboration between Discovery Net
myGrid
Rob Gaizauskas Department of Computer
Science University of Sheffield
Moustafa M. Ghanem Department of
Computing Imperial College London
2
Outline
  • Context
  • Workflows, Services and Text Mining
  • Discovery Net myGrid
  • Aims and Objectives of New Project
  • Architecture of New System
  • Integration of Existing Components
  • Approach to Text Mining
  • Data Resources Evaluation
  • Techniques for Go Tagging
  • Interface and Results Presentation
  • Lessons Learnt So far, Conclusions and Broader
    Applicability of Work

3
Workflows, Web Services and Text Mining for
Bioinformatics
  • Workflows
  • useful computational models for processes that
    require repeated execution of a series of complex
    analytical tasks
  • e.g. biologist researching genetic basis of a
    disease repeatedly
  • maps reactive spot in microarray data to gene
    sequence
  • uses a sequence alignment tool to find
    proteins/DNA of similar structure
  • mines info about these homologues from remote DBs
  • annotates unknown gene sequence with this
    discovered info

4
Workflows, Web Services and Text Mining for
Bioinformatics
  • Web services
  • Processing resources that are
  • available via the Internet
  • use standardised messaging formats, such as XML
  • enable communication between applications without
    being tied to a particular operating
    system/programming language
  • Useful for bioinformatics where data used in
    research is
  • heterogeneous in nature DB records, numerical
    results, NL texts
  • distributed across the internet in research
    institutions around the world
  • available on a variety of platforms and via
    non-uniform interfaces

5
Workflows, Web Services and Text Mining for
Bioinformatics
  • Text mining
  • any process of revealing information
    regularities, patterns or trends in textual
    data
  • includes more established research areas such as
    information extraction (IE), information
    retrieval (IR), natural language processing
    (NLP), knowledge discovery from databases (KDD)
    and traditional data mining (DM)
  • relevant to bioinformatics because of
  • explosive growth of biomedical literature
  • availability of some information in textual form
    only, e.g. clinical records

6
Workflows, Web Services and Text Mining for
Bioinformatics
7
Discovery Net myGrid
  • Discovery Net An e-Science testbed for High
    Throughput Informatics
  • 2.2M EPSRC Pilot Project
  • Started Oct 01, Ended in March 05
  • Service-based infrastructure/workflow model for
    Life Sciences, Environmental Modelling and
    Geo-hazard Modelling
  • Infrastructure for mixed data mining / text
    mining
  • Machine learning methods for text mining
  • myGrid Directly Supporting the e-Scientist
  • 3.5M EPSRC Pilot Project
  • Started Oct 01, Ends June 05
  • Service-based infrastructure/workflow model for
    Life Sciences
  • Infrastructure for Text Collection Server, Text
    Services Workflow Server and Interface/Browsing
    Client
  • Service-based Terminology Servers

8
myGrid
  • Overall aim develop an e-biologists workbench
    a platform allowing biologists to execute,
    analyze, repeat multi-stage in silico experiments
    involving distributed data, code and processing
    resources
  • Workflow model for composing/executing processing
    components
  • Web services for distribution
  • Problem how to integrate text mining into a
    biological workflow?
  • Most text mining runs off-line and supports
    interactive browsing of results
  • Most workflows run end to end with no user
    intervention
  • What are the inputs to text mining to be?
  • Solution tap off result of a workflow step and
    treat as implicit query

9
A myGrid example studying the Genetic Basis of
Disease
  • Graves Disease
  • an autoimmune condition affecting tissues in the
    thyroid and orbit
  • being investigated using the micro-array methods
  • micro-array shows which genes are differentially
    expressed in normal patients vs patients with
    the disease candidate genes
  • sequence alignment search (e.g. BLAST) finds
    genes/proteins with similar structure
  • function of these homologues may suggest
    function of candidate gene
  • key step for text mining follows BLAST search
  • for homologous proteins BLAST report contains
    references to proteins in SWISSPROT protein
    database
  • Swissprot records contain ids of abstracts
    describing the protein in Medline abstract
    database
  • abstracts can be mined directly or used as
    seed'' documents to assemble a set of related
    abstracts

10
myGrid Text Services Architecture
11
myGrid Text Services Architecture
  • 3-way division of labour sensible way to deliver
    distributed text mining services
  • Providers of e-archives, such as Medline, will
    make archives available via web-services
    interface
  • Cannot offer tailored sevices for every
    application
  • Will provide core, common services
  • Specialist workflow designers will add value to
    basic services from archive to meet their
    organizations needs
  • Users will prefer to execute predefined workflows
    via standard light clients such as a browser
  • Architecture appropriate for many research areas,
    not just bioinformatics

12
myGrid Interface/Browsing Client
13
Discovery Net Adding text mining to e-Science
workflows
  • DNet Workflow server executes DPML workflow and
    uses Discovery Nets InfoGrid data access and
    integration wrappers and web services

14
Text Mining in e-Science workflows
  • Problem how to develop new distributed text
    mining applications using a workflow?
  • Most text mining applications require the
    integration of a mixture of components (Services)
    for text processing tasks (e.g. parsing and
    cleaning), natural language processing (e.g.
    named entity recognition), statistics and data
    mining (e.g. classification, clustering, etc).
  • There are many design alternatives and end users
    may want to prototype and compare alternative
    implementations.
  • Once application developed, most workflows run
    end to end with no user intervention
  • Solution Extend service infrastructure to allow
    composition of text mining services.

15
Building text mining applications from workflows
Using workflow technologies to build text mining
applications and services using finer grain
components/services
Text Mining Pipelines
Features are summarized into vector forms which
are suitable for data mining
Results can be document characterization or
hidden relationship extraction
Pre-process documents to enhance the ease of
feature extraction
Retrieve and organize relevant documents
16
Simplified Document Classification Workflow
Predictive Accuracy of Relevance prediction,
using Support Vector Machine classification Ove
rall accuracy 84.5 Precision 78.11 Recall
73.40
17
Text Meta Data Model
Build Classifier training phase using workflow
co-ordinating distributed services Build
Prediction phase using workflow co-ordinating
distributed services Metadata Model Service
Interfaces only tell you how to invoke remote
service but it is up to you to decide what
information flows between services !
18
Aims Objectives of New Project
  • Aim to develop a unified real-time e-Science
    text-mining infrastructure that leverages the
    technologies and methods developed by both
    Discovery Net and myGrid
  • Software engineering challenge integrate
    complementary service-based text mining
    capabilities with different metadata models into
    a single framework
  • Application challenge annotate biomedical
    abstracts with semantic categories from the Gene
    Ontology
  • Deliverables
  • D1 A GO Annotation Service
  • D2 A Generic Shared Infrastructure for
    Grid-enabled Biomedical Document Categorization
  • D3 Infrastructure for Semantic Document
    Annotation
  • D4 A Detailed Case Study (analysing/evaluating
    the GO annotator)
  • D5 Developing a common framework for
    representing exchanging information about
  • 1. Data biomedical documents/doc collections
    metadata, biomedical dictionaries
  • 2. Intermediate data Document indexes and
    Document feature vectors
  • 3. Text Analysis Results

19
Go TAG A Novel Application
  • The GO TAG Application Automatic Assignment of
    GO (Gene Ontology) Codes to Medline Documents

20
A Machine Learning Approach
21
Run-time System
22
GO Annotator Version 1
  • Version 1a
  • Direct search for GO Annotation descriptions and
    synonyms in document text
  • If description is found, document is labelled
    with this GO Annotation
  • Description is also marked-up in document
  • Version 1b
  • 1a search for gene names extracted from yeast
    genome DB
  • If gene name found, document labelled with GO
    annotation(s) associated with gene in DB
  • Gene name also marked up in document
  • Termino web-service, hosted at Sheffield,
    provides lookup capability
  • This is wrapped in a DiscoveryNet workflow to
    include PubMed query, results visualization and
    performance calculations
  • Workflow is deployed as a web application for end
    users which includes applet to interactively
    browse results

23
GO Annotator Version 1Underlying Discovery Net
Workflow
24
GO Annotator Version 1Underlying Discovery Net
Workflow
Enter query and retrieve abstracts from PubMed.
25
GO Annotator Version 1Underlying Discovery Net
Workflow
Use Termino to mark-up abstracts with GO
Annotations when match for GO Annotation
description is found.
26
GO Annotator Version 1Underlying Discovery Net
Workflow
Tabulate GO Annotations by PMID.
27
GO Annotator Version 1Underlying Discovery Net
Workflow
Join PMIDs and matching GO Annotations with
abstracts and titles.
28
Workflow Deployment
29
GO Annotator Version 2
  • Use Saccharomyces (Yeast) Genome Database as
    source of papers expertly curated with GO
    Annotations
  • Train classifier using these papers
  • Hierarchical classification
  • Training data sufficient to classify over 2000 GO
    Annotations
  • Classifier is then applied to assign unseen
    papers with GO Annotations
  • Main Issues
  • Choice of features to be extracted from the
    training documents
  • Choice of feature reduction methods to produce
    accurate classification
  • Choice of classification algorithm to be used?

30
GO Annotator Version 2Underlying Discovery Net
Workflow
31
GO Annotator Version 2Underlying DiscoveryNet
Workflow
Papers expertly curated with GO Annotations from
SGD database.
32
GO Annotator Version 2Underlying Discovery Net
Workflow
Generate vector of features (frequent phrases)
for each paper. This is used to train classifier.
33
GO Annotator Version 2Underlying Discovery Net
Workflow
Generate a Naïve Bayesian classification model.
34
GO Annotator Version 2Underlying Discovery Net
Workflow
Generate vector of features (frequent phrases)
for each paper in test data set. This is used to
test the classifier.
35
GO Annotator Version 2Underlying Discovery Net
Workflow
Apply classification model to test data to
evaluate classification accuracy.
36
Interface Results Presentation
37
Achievements to date
  • Infrastructure Interoperability
  • More than just remote web service invocation
    interoperable metadata models
  • Mark 1 System Implemented
  • Annotation based on terminology lookups
  • 15 Recall 5 Precision (Exact matches for
    18,000 GO terms)
  • Measures inadequate due to incompleteness of gold
    standard
  • In process of Finalising Training Data Sets and
    Evaluation Metrics
  • 4,922 papers referencing 2,455 GO Terms
  • Mark 2 Systems in Progress
  • Naïve Bayesian Approach
  • 41 Recall and 27 Precision
  • User Interfaces
  • Mark 3, 4, Systems and Evaluation

38
Implementation Options
  • Feature Vector Options
  • Bag of words
  • Frequent Phrases
  • Key Phrases (Gene Names, Protein Names, MeSH
    terms, etc).
  • Classifier Options
  • Bayesian Classifiers
  • Support Vector Machines
  • Drag Push (a novel centroid based method)

39
Lessons Learnt and Challenges to Face
  • Infrastructure
  • Interoperability Issues
  • Performance Issues
  • Communication vs Persistence of remote server
  • Off-line vs on-line feature extraction
  • Text Mining
  • Usability Issues
  • Evaluation Issues
Write a Comment
User Comments (0)
About PowerShow.com