Databases as Analytical Engines for Drug Discovery - PowerPoint PPT Presentation

About This Presentation
Title:

Databases as Analytical Engines for Drug Discovery

Description:

Databases as Analytical Engines for Drug Discovery Susie Stephens Principal Product Manager, Life Sciences Oracle Corporation susie.stephens_at_oracle.com – PowerPoint PPT presentation

Number of Views:89
Avg rating:3.0/5.0
Slides: 45
Provided by: Charlie145
Category:

less

Transcript and Presenter's Notes

Title: Databases as Analytical Engines for Drug Discovery


1
Databases as Analytical Engines for Drug Discovery
  • Susie StephensPrincipal Product Manager, Life
    Sciences
  • Oracle Corporation
  • susie.stephens_at_oracle.com

2
Outline
  • Data Challenges
  • Case Studies
  • Summary

3
Access Distributed Data
External Sites
UltraSearch
Distributed query
MySQL
Flat files
Sybase
SRS
DBlinks
Transparent Gateway
Generic Connectivity
Transparent Gateway
External Table
4
Integrate a Variety of Data Types
  • CLOBs
  • XML
  • Text
  • Images
  • Video
  • Relational
  • Users Defined Objects
  • Nucleotide Sequences
  • Gene Expression Data
  • Papers
  • Cell Histology Images
  • Protein Folding Video
  • SwissProt
  • KEGG
  • Chemical Structures

XML
5
Manage Vast Quantities of Data
  • Partitioning
  • Oracle Data Guard
  • Real Application Clusters (RAC)
  • Automated Storage Management
  • Adaptive Instance Tuning
  • Automated Application and SQL Tuning
  • Automated Database Diagnostic Monitor (ADDM)
  • Scheduling

6
Collaborate Securely
  • Integrated communications
  • Single enterprise search
  • Flexible access
  • Fine grained access control
  • Auditing
  • Workflow
  • Personalized portal

7
Find Patterns and Insights
  • Oracle Data Mining
  • Find relationships clusters
  • Oracle Discoverer Oracle OLAP
  • Interactive query drill-down
  • Statistics
  • mean, stdev, median, correlations, linear
    regression
  • Oracle Text
  • Cluster Classify documents of interest
  • Table Functions
  • Implement complex algorithms within the database

8
Outline
  • Data Challenges
  • Case Studies
  • Summary

9
Regular Expression Searches
  • A powerful method of describing both simple
    complex patterns for searching manipulating
  • A multilingual regular expression support for SQL
    PL/SQL string types
  • Follows POSIX style Regexp syntax
  • Support standard Regexp operators
  • Includes common extensions such as
    case-insensitive matching, sub-expression
    back-references, etc.
  • Compatible with popular Regexp implementations
    like GNU, Perl, Awk

10
Case Study Retrieve Protein Data from SGD using
Regular Expressions
Case study courtesy of Prolexys Pharmaceuticals,
Inc.
11
HTTP Raw Data
lt/scriptgt lt/headgtltbodygtltbody bgcolor'FFFFFF'gt ltt
able cellpadding"2" width"100" cellspacing"0"
border"0"gtlttrgtlttd colspan"4"gtlthr width"100"
/gtlt/tdgtlt/trgtlttrgtlttd valign"middle"
align"right"gtlta href"http//www.yeastgenome.org/
"gtltimg alt"SGD" border"0" src"http//www.yeastg
enome.org/images/SGD-to.gif" /gtlt/agtlt/tdgtltth
valign"middle" nowrap"1"gtQuick Searchlt/thgtlttd
valign"middle" align"left"gtltform method"post"
action"http//db.yeastgenome.org/cgi-bin/SGD/sear
ch/quickSearch" enctype"application/x-www-form-ur
lencoded"gt ltinput type"text" name"query"
size"13" /gtltinput type"submit" name"Submit"
value"Submit" /gt lt/formgtlt/tdgtltth valign"middle"
align"left"gtlta href"http//www.yeastgenome.org/s
itemap.html"gtSite Maplt/agt lta href"http//www.ye
astgenome.org/HelpContents.shtml"gtHelplt/agt lta
href"http//www.yeastgenome.org/SearchContents.sh
tml"gtFull Searchlt/agt lta href"http//www.yeastge
nome.org/"gtHomelt/agtlt/thgtlt/trgtlttrgtlttd align"left"
colspan"4"gtlttable cellpadding"1" width"100"
cellspacing"0" border"0"gtlttr align"center"
bgcolor"navajowhite"gtlttdgtltfont size"-1"gtlta
href"http//www.yeastgenome.org/ComContents.shtml
"gtCommunity Infolt/agtlt/fontgtlt/tdgtlttdgtltfont
size"-1"gtlta href"http//www.yeastgenome.org/Subm
itContents.shtml"gtSubmit Datalt/agtlt/fontgtlt/tdgtlttdgtlt
font size"-1"gtlta href"http//seq.yeastgenome.org
/cgi-bin/SGD/nph-blast2sgd"gtBLASTlt/agtlt/fontgtlt/tdgtlt
tdgtltfont size"-1"gtlta href"http//seq.yeastgenome
.org/cgi-bin/SGD/web-primer"gtPrimerslt/agtlt/fontgtlt/t
dgtlttdgtltfont size"-1"gtlta href"http//seq.yeastgen
ome.org/cgi-bin/SGD/PATMATCH/nph-patmatch"gtPatMatc
hlt/agtlt/fontgtlt/tdgtlttdgtltfont size"-1"gtlta
href"http//db.yeastgenome.org/cgi-bin/SGD/seqToo
ls"gtGene/Seq Resourceslt/agtlt/fontgtlt/tdgtlttdgtltfont
size"-1"gtlta href"http//www.yeastgenome.org/Vl-y
east.shtml"gtVirtual Librarylt/agtlt/fontgtlt/tdgtlttdgtltfo
nt size"-1"gtlta href"http//db.yeastgenome.org/cg
i-bin/SGD/suggestion"gtContact SGDlt/agtlt/fontgtlt/tdgtlt
/trgtlt/tablegtlt/tdgtlt/trgtlttrgtlttd colspan"4"gtlthr
width"100" /gtlt/tdgtlt/trgtlt/tablegtlttable
cellpadding"0" width"100" cellspacing"0"
border"0"gtlttrgtlttd width"10"gtltbr /gtlt/tdgtlttd
valign"middle" align"center" width"80"gtlth1gtSeq
uence for a region of YDR099W/BMH2lt/h1gtlt/tdgtlttd
valign"middle" align"right" width"10"gtlt/tdgtlt/t
rgtlt/tablegtltp /gtltcentergtlta target"infowin"
href"http//db.yeastgenome.org/cgi-bin/SGD/sugges
tion"gtSend questions or suggestions to
SGDlt/agtlt/centergtltp /gtltp /gtltcentergtlta
target"infowin" href"http//seq.yeastgenome.org/
cgi-bin/SGD/nph-blast2sgd?nameYDR099Wampsuffix
prot"gtBLAST searchlt/agt lta target"infowin"
href"http//seq.yeastgenome.org/cgi-bin/SGD/nph-f
astasgd?nameYDR099Wampsuffixprot"gtFASTA
searchlt/agtlt/centergtltp /gtltcentergtlthr width"35"
/gtlt/centergtltp /gtltfont color"FF0000"gtltstronggtProte
in translation of the coding sequence.lt/stronggtlt/f
ontgtltp /gtltp /gtOther Formats Available lta
href"http//db.yeastgenome.org/cgi-bin/SGD/getSeq
?mappmapampseqYDR099Wampflankl0ampflankr
0amprev"gtGCGlt/agtltpregtgtYDR099W Chr 4
MSQTREDSVYLAKLAEQAERYEEMVENMKAVASSGQELSVEERNLLSVA
YKNVIGARRAS WRIVSSIEQKEESKEKSEHQVELIRSYRSKIETELTKI
SDDILSVLDSHLIPSATTGESK VFYYKMKGDYHRYLAEFSSGDAREKAT
NSSLEAYKTASEIATTELPPTHPIRLGLALNFS VFYYEIQNSPDKACHL
AKQAFDDAIAELDTLSEESYKDSTLIMQLLRDNLTLWTSDISES GQEDQ
QQQQQQQQQQQQQQQQAPAEQTQGEPTK lt/pregtlthr size"2"
width"75"gt lttable width"100"gtlttrgtlttd
valign"top" align"left"gtlta href"http//www.yeas
tgenome.org/"gtltimg border"0" src"http//www.yeas
tgenome.org/images/arrow.small.up.gif" /gtReturn
to SGDlt/agtlt/tdgtlttd valign"bottom"
align"right"gtltform method"post"
action"http//db.yeastgenome.org/cgi-bin/SGD/sugg
estion" enctype"application/x-www-form-urlencoded
" target"infowin" name"suggestion"gt ltinput
type"hidden" name"script_name"
value"/cgi-bin/SGD/getSeq" /gtltinput
type"hidden" name"server_name"
value"db.yeastgenome.org" /gtltinput type"hidden"
name"query_string" value"seqYDR099Wampflankl
0ampflankr0ampmapp3map" /gtlta
href"javascriptdocument.suggestion.submit()"gtSen
d a Message to the SGD Curatorsltimg border"0"
src"http//www.yeastgenome.org/images/mail.gif"
/gtlt/agt lt/formgtlt/tdgtlt/trgtlt/tablegtlt/bodygtlt/htmlgt
12
Function to Parse out AA Sequence
create or replace function orf2seq (
p_orf in varchar2 ) return varchar2 is
v_stream clob strt number begin
-- Retrieve the HTTP stream v_stream
httpuritype.getclob(httpuritype.createuri(
'http//db.yeastgenome.org/cgi-bin/SGD
/getSeq?seq'p_orf
'flankl0flankr0mapp3map') )
-- Trim off the head of the stream
strt dbms_lob.instr(v_stream, 'Submit', 1,
1) -- Strip out control characters, new
lines, etc. v_stream
regexp_replace(dbms_lob.substr(v_stream, 4000,
strt), 'cntrl', '') -- Return the
AA sequence return(regexp_substr(dbms_lob
.substr(v_stream, 4000, strt), 'upper10,')
) end
13
AA Sequence for ORF YDR099W
SQLgt select orf2seq('YDR099W') from
dual ORF2SEQ('YDR099W') ------------------------
--------------------------------------------------
------ MSQTREDSVYLAKLAEQAERYEEMVENMKAVASSGQELSVEE
RNLLSVAYKNVIGARRASWRIVSSIEQKEESKEKSEHQVELIRSYRSKIE
TELTKISDDILSVLDSHLIPSATTGESKVFYYKMKGDYHRYLAEFSSGDA
REKATNSSLEAYKTASEIATTELPPTHPIRLGLALNFSVFYYEIQNSPDK
ACHLAKQAFDDAIAELDTLSEESYKDSTLIMQLLRDNLTLWTSDISESGQ
EDQQQQQQQQQQQQQQQQQAPAEQTQGEPTK Elapsed
000001.24
SQLgt insert into pseq (orf_id, sequence) 2
values ('YDR099W', orf2seq('YDR099W'))
14
Case Study Motif Searching in Proteins
  • PROSITE database of protein sequence motifs
  • ID TYR_PHOSPHO_SITE PATTERN
  • AC PS00007
  • DT APR-1990 (CREATED) APR-1990 (DATA UPDATE)
    APR-1990 (INFO UPDATE)
  • DE Tyrosine kinase phosphorylation site
  • PA RK-x(2,3)-DE-x(2,3)-Y
  • CC /TAXO-RANGE??E?V CC /SITE5,phosphorylation
  • CC /SKIP-FLAGTRUE
  • DO PDOC00007
  • Source http//www.expasy.org/prosite/ps_frequent_
    patterns.txt
  • TKP Pattern RK-x(2,3)-DE-x(2,3)-Y
  • RArginine, KLysine, DAspartate, EGlutamate,
    YTyrosine, xany AA
  • Oracle10g Regular Expression Equivalent
  • RK.2,3DE.2,3Y

Case study courtesy of Prolexys Pharmaceuticals,
Inc.
15
SQL to Retrieve All Proteins Interacting with TKP
select distinct substr(a.refseq_id, 1,
9) refseq_id, length(a.seq_string_varchar)
seq_length, regexp_instr(a.seq_string_var
char, 'RK.2,3DE.2,3Y', 1, 1)
motif_offs1, regexp_instr(a.seq_string_var
char, 'RK.2,3DE.2,3Y', 1, 2)
motif_offs2, regexp_instr(a.seq_string_var
char, 'RK.2,3DE.2,3Y', 1, 3)
motif_offs3, regexp_instr(a.seq_string_var
char, 'RK.2,3DE.2,3Y', 1, 4)
motif_offs4 from target_db a,
y2h_interaction_p b where a.refseq_id
like 'NP' and regexp_like(a.seq_string_va
rchar, 'RK.2,3DE.2,3Y') and
(substr(a.refseq_id,1,9) b.bait_refseq or
substr(a.refseq_id,1,9) b.prey_refseq)
16
Query Results
REFSEQ_ID SEQ_LENGTH MOTIF1_OFFS
MOTIF2_OFFS MOTIF3_OFFS MOTIF4_OFFS ----------
-- ---------- ----------- ----------- -----------
----------- NP_003961 1465
14 202
347 537 NP_003968 330
241 0
0
0 NP_003983 490
8 50
62 93 NP_004001
3562 3085 0
0 0 ...
MHHCKRYRSPEPDPYLSYRWKRRRSYSREHEGRLRYPSRREPPPRRSRS
RSHDRLPYQRRYRERRDSDTYRCEERSPSFGEDYYGPSRSRHRRRSRERG
PYRTRKHAHHCHKRRTRSCSSASSRSQQSSKRTGRSVEDDKEGHLVCRIG
DWLQERYEIVGNLGEGTFGKVVECLDHARGKSQVALKIIRNVGKYREAAR
LEINVLKKIKEKDKENKFLCVLMSDWFNFHGHMCIAFELLGKNTFEFLKE
NNFQPYPLPHVRHMAYQLCHALRFLHENQLTHTDLKPENILFVNSEFETL
YNEHKSCEEKSVKNTSIRVADFGSATFDHEHHTTIVATRHYRPPEVILEL
GWAQPCDVWSIGCILFEYYRGFTLFQTHENREHLVMMEKILGPIPSHMIH
RTRKQKYFYKGGLVWDENSSDGRYVKENCKPLKSYMLQDSLEHVQLFDLM
RRMLEFDPAQRITLAEALLHPFFAGLTPEERSFHTSRNPSR
17
SQL to Retrieve Motif Frequency by Protein
select c.refseq_id "Refseq ID", rs2desc(c.refseq
_id) "Protein Description", a.cnt
"Repetitions", b.ps_ac "Prosite AC", b.descr
"Motif Description" from motif_data a, ps_data
b, target_dbp c where a.ps_ac b.ps_ac and
a.sequence_id c.sequence_id order by 3 desc,
1
18
Query Results
Refseq ID Protein Description
Repetitions Prosite AC Motif
Description --------------- ----------------------
-------- ----------- ------------
------------------------------ NP_055995.2
spectrin repeat containing, 145
PS00006 Casein kinase II phosphorylation
site nuclear envelope
2
NP_056363.1 bullous pemphigoid antigen 1,
132 PS00006 Casein kinase II
phosphorylation site
230/240kDa
NP_001139.2 ankyrin 2, neuronal
115 PS00006 Casein
kinase II phosphorylation site

NP_066267.1 ankyrin 3, node of Ranvier
110 PS00006 Casein kinase II
phosphorylation site
(ankyrin G)
NP_056363.1 bullous pemphigoid
antigen 1, 102 PS00005 Protein
kinase C phosphorylation site
230/240kDa
NP_005520.2 heparan sulfate
proteoglycan 2 97 PS00008
N-myristoylation site
(perlecan) NP_066267.1 ankyrin 3, node of
Ranvier 97 PS00005 Protein
kinase C phosphorylation site
(ankyrin G)
P_001139.2 ankyrin 2,
neuronal 96 PS00005
Protein kinase C phosphorylation site

NP_115495.1 monogenic, audiogenic
seizure 95 PS00006 Casein kinase II
phosphorylation site
susceptibility 1 homolog (mouse)
...
19
Regular Expression Searches Quote
  • "Thanks to Oracle 10g's Regular Expressions (RE)
    query support, it's no longer necessary to export
    data from the database, process it with a RE
    enabled tool and then import the data back into
    the database. Now, RE processing can be handled
    with a single query." - Marcel Davidson, Head of
    Database Administration, Myriad Proteomics

20
Oracle Data Mining BLAST
  • Implemented using a table function interface
  • BLAST search functions can be placed in SQL
    queries
  • Different functions for match align
  • Combination of SQL queries BLAST is very
    powerful flexible

21
Case Study BLAST as a Sequence Identification
Tool
  • Identify protein with high sequence similarity
    and the functional class
  • select function, COUNT(seq_id) f_count
  • from (select t.seq_id, t.score, t.expect,
    g.function
  • from SwissProt_DB g,
  • Table(BLASTP_MATCH(
  • AEQAERYDDMAAAMKRY,
  • cursor (select seq_id, sequence
  • from SwissProt_DB),
  • 5)) t / expect_value /
  • where t.seq_id g.seq_id)
  • group by function / swissprot kw /
  • order by f_count

function, f_count
GROUP BY
seq_id, function
t.seq_id g.seq_id
seq_id, score, expect
SwissProt_DB
BLASTP_MATCH
query_sequence, parameters
SwissProt_DB
22
Case Study Homology Search between Yeast and
Human Data
Yeast Protein Interactome
Human Protein Interactome
Homology Mapping
A
X
Determined experimentally with Y2H
C
Determined experimentally with Y2H
B
Y
Z
Inferred through BLAST
Interlogs (AX, BY) and (AX, BZ)
Case study courtesy of Prolexys Pharmaceuticals,
Inc.
23
Batch BLAST Human (query) vs. Yeast (subject)
  • for v1 in c1 loop
  • insert into yeast_human_homolog (
  • human_refseq,
  • yeast_orf_name,
  • score,
  • expect
  • )
  • select
  • v1.refseq_id,
  • t.t_seq_id,
  • t.score,
  • t.expect
  • from
  • table ( blastp_match (
  • v1.sequence_string,
  • cursor ( select a.yeast_acn, a.yeast_seq
  • from yeast_prot_seq a )
  • )
  • ) t

24
BLAST Results
Yeast Yeast Human
Human Expect 1 Expect 2
Gene 1 Gene 2 Refseq 1
Refseq 2 ------- ------- -----------
----------- -------- -------- YAR018C
YIL061C NP_XXXXX1.1 NP_YYYYY1.1
4.79E-12 4.58E-06 YBL016W YDL159W
NP_XXXXX2.1 NP_YYYYY2.1 1.11E-08
5.25E-10 YBL016W YDL159W NP_XXXXX3.1
NP_YYYYY3.1 2.63E-10 9.04E-11 YBL016W
YDL159W NP_XXXXX4.1 NP_YYYYY4.1 4.57E-07
8.33E-09 YBL016W YDL159W NP_XXXXX5.1
NP_YYYYY5.1 1.57E-22 1.11E-08 YBL063W
YIL061C NP_XXXXX6.1 NP_YYYYY6.1
3.17E-64 8.67E-06 YBL063W YIL061C
NP_XXXXX7.1 NP_YYYYY7.1 2.30E-06
4.58E-06 YBR109C YDR356W NP_XXXXX8.1
NP_YYYYY8.1 1.78E-07 7.74E-11 YBR109C
YDR356W NP_XXXXX9.1 NP_YYYYY9.1 1.24E-08
7.74E-11 YBR109C YDR356W NP_XXXX10.1
NP_YYYY10.1 5.19E-07 2.80E-20 YBR109C
YDR356W NP_XXXX11.1 NP_YYYY11.1
3.92E-10 4.39E-11 YBR109C YFR014C
NP_XXXX12.1 NP_YYYY12.1 3.67E-48
6.91E-17 YBR109C YOL016C NP_XXXX13.1
NP_YYYY13.1 3.67E-48 1.82E-17
Yeast Interactors
Human Interactors
Interlogs
25
BLAST Quote
  • "Oracle 10g's new BLAST feature will enable us to
    easily integrate multiple types of genomic and
    proteomic data for complicated queries used in
    the mining of our proprietary protein-protein
    interaction and cDNA sequence datasets." - Jake
    Chen, Principal Bioinformatics Scientist, Myriad
    Proteomics

26
Spatial Network Data Model
  • Data model for managing graph (link-node)
    structures
  • Rich graph analysis functions
  • Supports variety of network structures
    (hierarchical, directed, undirected, random,
    scale-free)
  • Framework for applying network constraints and
    rules (e.g. path length, cost, minimum bounding
    rectangle)
  • Bundled Java visualiser APIs for 3rd party
    tools, application development

27
Case Study Integration Architecture
Native Formats
NREF
EMBL
GO
KEGG
BIND
AFCS
Distributed Database layer
  • Data type determines available routes
  • Routes can be determined using semantics

NDM layer (semantic layer)
Nodes
Edges
Graph
Network Route
Case study courtesy of Beyond Genomics, Inc.
28
Network Data Model Quote
  • "Beyond Genomics, Inc., as a leading systems
    biology company, believes that Oracle 10g's
    network data model will significantly advance the
    integration of metabolomic, proteomic,
    transcriptomic, and clinical data sets and the
    applications that derive value from these data."
    Eric Neumann, Vice President Strategic
    Informatics, Beyond Genomics, Inc.

29
Oracle Data Mining
  • Unsupervised Learning
  • Hierarchical K-means Cluster
  • O-Cluster
  • Non-Negative Matrix Factorization
  • Apriori
  • Supervised Learning
  • Naïve Bayes
  • Adaptive Bayes Network
  • Support Vector Machines
  • PredictorVariance
  • ODM can mine structured data, text data, or
    structured and text data

30
K-Means Clustering
  • Hierarchical k-means produces tree of clusters
  • All splits are binary
  • Each cluster has a centroid a histogram
  • Achieves a reliable solution in a single run
  • Ranked rules that describe attributes for cluster
  • Cluster assignments are probabilistic using a
    Bayesian model
  • Operates on very deep datasets by using a
    summarization module

31
Case Study Brain Tumor Clustering
  • Collection of 42 Human Brain tumors and 7,129
    gene expression profiles
  • Clustering of samples according to their gene
    expression profiles
  • It is an example of class and taxonomy discovery
  • Does the data cluster according to the known
    biological classes?
  • 42 Tumor Samples
  • Normal Cerebellum MD (4)
  • Malignant Gliomas MGlio (10)
  • Medulloblastomas MD (10)
  • Rhabdoid tumors Rhabdoid (10)
  • Primitive Neuroectodermal PNET (8)

Pomeroy et al Nature 415, 24, p436 (2002).
32
ODM Hierarchical k-Means Clustering
Node 1
Node 2
Node 3
Node 6
Node 4
Node 7
Node 5
Glioblastoma Normal
Medulloblastoma Rhabdoid
Cluster Cluster
Cluster Cluster
33
Literature Results using Hierarchical Clustering
From Pomeroy et al Nature 415, 24, p436 (2002).
34
Association Rules
  • Captures frequent co-occurrences of
    items/attribute values
  • (A, B) gt C occurrence or A and B together
    implies C
  • Can be applied in different scenarios
  • Market basket analysis
  • Pattern discovery
  • Predictive applications
  • ODM uses SQL-based implementation of Apriori
    algorithm

35
Case Study Analysis of Trends in a Patient Group
Clinical Table of 60 Medulloblastoma Patients 7
Clinical attributes Subtype classic or
desmoplastic medulloblastoma Size (tumor size)
T1-T4 Stage M0-M4 Sex M, F Age (range)
0-5, 5-10, 10-15. Outcome S (treatment
success), F (treatment
failure) Chemo (regime type) 0,1,2,3,4,5,6
Pomeroy et al Nature 415, 24, p436 (2002).
36
Association Rules Results
Over 100 rules reflecting factual or known
relationships in data Age1 THEN
SexM (confidence 0.8) Interpretation Most
5-10 year-old patients are male SubtypeDesmoplas
tic THEN StageM0 (confidence
0.79) Interpretation Most desmoplastic patients
in the study have stage M0
37
Association Rules Results
Other interesting trends StageM0 THEN
OutcomeS (confidence 0.74) Interpretation
Stage M0 vs non-M0 is a predictor of treatment
outcome StageM0 AND SizeT3 AND Chemo1 THEN
OutcomeS (confidence 0.92) Interpretation
Most patients with stage M0, size T3 who received
chemo regime 1 had good response to treatment
38
Support Vector Machines
  • SVM provides a very general multi-purpose and
    powerful classifier
  • SVM does not require feature selection and can
    work well with thousands of input features
  • SVM is accurate and can approximate complex
    functional relationships
  • SVM works in binary, multi-class, sparse (text)
    classification and regression
  • SVM is easy to train and apply and can be used
    in discovery mode or in production automated
    methodologies


39
Case Study Classification of Normal Human Tissue
and Tumors
  • Multiple Examples (14) of normal human tissue and
    tumors
  • Could a single model distinguish normal vs
    cancer?
  • Train set 200 samples, test set 80 samples
  • Microarrays profiles for 7,129 genes

Normal Tissue vs. Cancer
S. Ramaswamy et al, Proc. Natl. Acad. Sci. USA
98 15149-15154 (2001)
40
Support Vector Machines Results
Normal vs. Cancer (Multiple types) SVM Test Set
Predictions
Predicted Normal
Cancer Actual Normal 16 10
Cancer 3 51 Test set
accuracy 83.75
(Naïve Bayes
75)
41
Classification of Multiple Tumor Types
DNA Microarray Data for 14 Tumor Classes
Published Datasets
  • S. Ramaswamy et al, Proc. Natl. Acad. Sci. USA
    98 15149-15154 (2001)
  • C. Yeang et al, Procs. of ISMB 2001.
    Bioinformatics Discovery Note, 11-7, (2001)

42
Results of Multiple Tumor Type Analysis
  • Gene expression profiles for 7,129 genes
  • Datasets tumor type composition
  • 9 minutes training time on 500MHz Netra
  • 78.3 accuracy for multi-tumor molecular
    classification

Tumor Class Train Test Tumor Class Train Test
Breast (BR) 8 3 Uterus (UT) 8 2
Prostate (PR) 8 2 Leukemia (LE) 24 6
Lung (LU) 8 3 Renal (RE) 8 3
Colorectal (CO) 8 5 Pancreas (PA) 8 3
Lymphoma (LY) 16 6 Ovary (OV) 8 3
Bladder (BL) 8 3 Mesothelioma (MS) 8 3
Melanoma (ML) 8 2 Brain (BR) 16 4
43
Outline
  • Data Challenges
  • Case Studies
  • Summary

44
Summary
  • Databases have functionality to access and
    integrate distributed data
  • There are data management, performance and
    security benefits to performing analytics in
    databases
  • A range of analytical functionality is now
    available in databases
Write a Comment
User Comments (0)
About PowerShow.com