Life Sciences: Data Revolution - PowerPoint PPT Presentation

About This Presentation
Title:

Life Sciences: Data Revolution

Description:

MRC Clinical Sciences Centre and Imperial College, UK ... Life Sciences : The Future ' ... Life sciences is generating enormous amount of data using HTP ... – PowerPoint PPT presentation

Number of Views:58
Avg rating:3.0/5.0
Slides: 38
Provided by: mnav
Category:

less

Transcript and Presenter's Notes

Title: Life Sciences: Data Revolution


1
Life Sciences Data Revolution
Session 40382
  • Building Gene Expression Databases

Mahendra Navarange
Microarray Centre MRC Clinical Sciences Centre
and Imperial College, UK
2
Agenda
  • What is Life Science?
  • MiMiR database for gene expression data
  • Data acquisition process and data characteristics
  • System requirements
  • Design issues
  • Code snippets

3
What is Life Sciences ?
  • Includes
  • Biology
  • BioTechnology
  • Chemistry
  • Pharmaceuticals
  • Agriculture / Plant Science
  • Environmental Sciences
  • ????
  • Objective
  • Understand the molecular and evolutionary basis
    of living organisms

4
Focus Areas
  • Genomics
  • Human Genome Project
  • Draft published in 2000
  • Finished version on 14 April 2003
  • Sequencing data doubles every year
  • Transcriptomics
  • Study of transcription (gene expression)
  • Proteomics
  • Study of translation (protein synthesis)

Courtesy F. Hoffmann-La Roche Ltd.
5
DataDataData
  • Sanger Centre 5TB
  • Celera 100TB (2001)

TB
6
Data Revolution in Life Sciences
  • Impact of technology
  • High throughput platforms (HTP)
  • Robotics
  • Miniaturisation
  • Data driven science
  • Datawarehousing technologies
  • Data mining and visualisation software

Information Technology
Life Sciences
7
Databases
  • Genomics
  • Sanger
  • NCBI
  • TIGR
  • KEGG
  • Transcriptomics
  • ArrayExpress
  • Proteomics
  • Protein Databank (PDB)
  • SWISSPROT
  • Entrez

8
Using Life Sciences Data
  • identify causes of genetic diseases
  • discover new drug compounds
  • personalised medicine
  • develop new diagnostics

Drug Discovery Pipeline
HTP Screening
Target Identification
Clinical Trials
Hits
Leads
Leads
FDA
9
Life Sciences The Future
  • ..biology is changing from a purely
    laboratory-based science to an information based
    science.
  • Eric Lander,
  • Director, Whitehead Institute MIT

10
Agenda
  • What is Life Sciences ?
  • MiMiR database for gene expression data
  • Data acquisition process and data characteristics
  • System requirements
  • Design issues
  • Code snippets

11
Transcriptomics
  • Comparing gene expression across databases
  • Collaborate to share expertise
  • Benefits
  • Diagnostics
  • Screen target drug compounds
  • Identify toxic side effects
  • Screen patients for clinical trials

12
Workflow
Literature
Experiment design
Data
HTP
Preliminary Analysis
Further Analysis
Local DB
GO
Collaboration
NCBI
13
HTP Microarray Platform Hardware
Courtesy Affymetrix Inc., Dell Inc
14
Microarray Data Acquisition
Courtesy Affymetrix Inc.
Courtesy Fisher Scientific
15
Microarray Data
  • High density microarray
  • 500,000 spots of
  • 18 µm size
  • gt20,000 genes
  • Typical file size 45MB
  • No. of files produced in typical experiment
    10-20.

Courtesy Affymetrix Inc.
16
(No Transcript)
17
Life Sciences Data Explosion
  • Data Characteristics
  • Image data generated by HTP platforms, annotation
    by researchers
  • Large volume and size
  • Varied data types
  • Datawarehousing challenges
  • Non-summarisable
  • High dimensionality
  • Limited knowledge of underlying biological
    processes
  • No standard industry data models or best practices

18
Agenda
  • What is Life Sciences ?
  • MiMiR database for gene expression data
  • Data acquisition process and data characteristics
  • System requirements
  • Design issues
  • Code snippets

19
System Requirements
  • Seamless data integration
  • Handle wide range of datatypes
  • Processor intensive and I/O intensive
  • Exponential growth in data storage
  • Open architecture, collaboration

20
System Requirements
  • Rapid changes new databases, technologies and
    instruments
  • Competitive pressures, quick response, low access
    times
  • Plug and play capability
  • Security

21
MIcroarray Data MIning Resource
  • MiMiR Microarray Datawarehouse
  • 250GB. Expected to double in next few months
  • 2500 images, over 1500 BioAssays
  • 52 tables, largest table 15GB
  • Infrastructure
  • Oracle 9i Release 1 on Windows 2000
  • Dell PowerEdge Quad Processor, 2 GB memory, 400
    GB hard disk
  • 1 TB NAS capacity

22
Requirements vs. Solutions
  • Integrate different types of data sources
  • Use of XML for data exchange
  • Use of Oracle UltraSearch
  • Efficient data retrieval
  • Stringent response time standards on procedures
  • Indexed Organised Tables, Partitioning
  • Security
  • Firewall
  • Single Sign-On servers (in progress)
  • Rapid change management
  • BC4J framework, Jdeveloper
  • Extreme programming, prototyping

23
MiMiR System Architecture
MiMiR
Application Server
XSQL
XSU
XDK
BC4J
JClient
JSP
ArrayExpress
Private
24
Oracle Products Used
  • Oracle 9i Database Server/Client (Release1)
  • Partitioning
  • Join indexing
  • Oracle 9i JDeveloper (9.0.2)
  • Oracle 9i Application Server (BC4J)
  • Oracle XML features
  • Oracle PL/SQL packages for XML
  • Oracle XSQL publishing framework
  • XDK (DOMParser and SAXParser)
  • XSU
  • Oracle Data Mining (Future)
  • Oracle Collaboration Suite (Future)

25
Why Oracle ?
  • Readily scalable
  • Manage wide variety of data types
  • Integrated development tools
  • Support XML and Java
  • High performance middleware
  • Secure collaboration

26
Agenda
  • What is Life Sciences ?
  • MiMir database for gene expression data
  • Data acquisition and profiling
  • System requirements
  • Design issues
  • Code snippets

27
Oracle and XML Design Issues
  • Storage
  • Storing XML in tables
  • Storing XML in CLOBs
  • Hybrid
  • Generation
  • XDK for Java, PL/SQL
  • XSU
  • Transformation
  • XSL Stylesheet
  • Views
  • Processing
  • XDK DOMParser
  • XDK SAXParser
  • Searching
  • XPATH
  • Oracle Text
  • Publishing
  • XSQL publishing framework
  • XSL

28
Oracle and XML XSQL Example
lt?xml version"1.0" encoding'windows-1252'?gt lt!--
Uncomment the following processing instruction
and replace the stylesheet name to transform
output of your XSQL Page using XSLT lt?xml-styleshe
et type"text/xsl" href"YourStylesheet.xsl"
?gt --gt lt?xml-stylesheet type"text/xsl"
href"mimirArray.xsl"?gt ltxsqlquery
connection"micro" xmlnsxsql"urnoracle-xsql"gt s
elect from array lt/xsqlquerygt
29
Oracle and XML Design Issues
30
Agenda
  • What is Life Sciences ?
  • MiMir database for gene expression data
  • Data profiling
  • System requirements
  • Design issues
  • Code snippets

31
An Example
  • Creating XML from 500,000 records in the database

32
Solution 1
  • Using XSU Java API to get XMLDOM.
  • 1) conncreateConnection.createConnection()
  • 2) String query "SELECT FROM
    IMAGE_QUANTITATION i "
  • "WHERE QUANT_FILENAME 'PMB2002011001Aaa'"
  • 3) OracleXMLQuery q1 new OracleXMLQuery(conn,qu
    ery)
  • 4) q1.keepCursorState(true)
  • 5) XMLDocument xmlDoc(XMLDocument)q1.getXMLDOM()
  • 6) XMLDocument.print(out)

33
Solution 2
  • Using XSU Java API to get XMLString.
  • 1) conncreateConnection.createConnection()
  • 2) String query "SELECT FROM
    IMAGE_QUANTITATION i "
  • "WHERE QUANT_FILENAME 'PMB2002011001Aaa'"
  • 3) OracleXMLQuery q1 new OracleXMLQuery(conn,qu
    ery)
  • 4) q1.keepCursorState(true)
  • 5) XMLDocument xmlDoc(XMLDocument)q1.getXMLDOM
    ()
  • 6) XMLDocument.print(out)
  • 7) System.out.println(q1.getXMLString())

34
Solution 3
  • Using dbms_xmlquery package to get XML output
    from SQL
  • Select dbms_xmlquery.getXML(select from
    IMAGE_QUANTITATION where quant_filenamePMB20020
    11001Aaa) from dual
  • lt?xml version '1.0'?gt
  • ltROWSETgt
  • ltROW num"1"gt
  • ltIMAGE_IDgtPMB2002011003Aaalt/IMAGE_IDgt
  • ltCHIP_TYPEgtMG-U74Av2lt/CHIP_TYPEgt
  • ltELE_SET_NAMEgtAFFX-MurIL2_atlt/ELE_SET_NAMEgt
  • ltPOSITIVEgt2lt/POSITIVEgt
  • ltNEGATIVEgt5lt/NEGATIVEgt
  • ltPAIRSgt20lt/PAIRSgt
  • ltPAIRS_USEDgt20lt/PAIRS_USEDgt
  • ltPAIRS_IN_AVGgt19lt/PAIRS_IN_AVGgt

35
Summary
  • Life sciences is generating enormous amount of
    data using HTP
  • The data is non-summarisable, distributed and has
    varied data types
  • Data integration and secure collaboration is key
    to success
  • MiMiR

36
Acknowledgements
  • Dr. Helen Causton
  • Prof. Tim Aitman
  • Dr. Laurence Game
  • Helen Banks
  • Nicola Cooley
  • Vihar Wadekar
  • Helen Figueira
  • MGED Data Society (www.mged.org)

37
Life Sciences Data Revolution
Session 40382
Building Gene Expression Databases
What Next Opportunities for collaboration for
development of Knowledge Management Systems
for Drug Discovery
Contact mahendra.navarange_at_csc.mrc.ac.uk http//m
icroarray.csc.mrc.ac.uk
Write a Comment
User Comments (0)
About PowerShow.com