talk in bioitworld2002 - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

talk in bioitworld2002

Description:

create table GP (#uid: 'NUMBER', #detail: 'LONG') using db; ... A good group contains signals that are highly correlated with the class, and yet ... – PowerPoint PPT presentation

Number of Views:16
Avg rating:3.0/5.0
Slides: 38
Provided by: Limsoo
Category:

less

Transcript and Presenter's Notes

Title: talk in bioitworld2002


1
From Informatics to Bioinformatics
Limsoon Wong Institute for Infocomm
Research Singapore
2
What is Bioinformatics?
3
Themes of Bioinformatics
Bioinformatics Data Mgmt Knowledge
Discovery Data Mgmt Integration
Transformation Cleansing Knowledge Discovery
Statistics Algorithms Databases
4
Benefits of Bioinformatics
To the patient Better drug, better treatment To
the pharma Save time, save cost, make more To
the scientist Better science
5
From Informatics to Bioinformatics
Protein Interactions Extraction (PIES)
8 years of bioinformatics RD in Singapore
MHC-Peptide Binding (PREDICT)
Gene Expression Medical Record Datamining (PCL)
Cleansing Warehousing (FIMM)
Gene Feature Recognition (Dragon)
Integration Technology (Kleisli)
Venom Informatics
1994
1998
1996
2002
2000
ISS
LIT/I2R
KRDL
6
Data Integration
A DOE impossible query For each gene on a
given cytogenetic band, find its non-human
homologs.
7
Data Integration Results
sybase-add (nameGDB", ...) create view L
from locus_cyto_location using GDB create view E
from object_genbank_eref using GDB select
accn g.genbank_ref, nonhuman-homologs
H from L as c, E as g, select u
from g.genbank_ref.na-get-homolog-summary as u
where not(u.title string-islike "Human")
andalso not(u.title
string-islike "H.sapien") as H where
c.chrom_num "22 andalso g.object_id
c.locus_id andalso not (H )
  • Using Kleisli
  • Clear
  • Succinct
  • Efficient
  • Handles
  • heterogeneity
  • complexity

8
Data Warehousing
(uid 6138971, title "Homo sapiens
adrenergic ...", accession "NM_001619",
organism "Homo sapiens", taxon 9606,
lineage "Eukaryota", "Metazoa", , seq
"CTCGGCCTCGGGCGCGGC...", feature
(name "source", continuous true,
position (accn "NM_001619",
start 0, end 3602,
negative false), anno
(anno_name "organism", descr "Homo
sapiens"), ), )
  • Motivation
  • efficiency
  • availabilty
  • denial of service
  • data cleansing
  • Requirements
  • efficient to query
  • easy to update.
  • model data naturally

9
Data Warehousing Results
! Log in oracle-cplobj-add (name "db", ...) !
Define table create table GP (uid "NUMBER",
detail "LONG") using db ! Populate table with
GenPept reports select uid x.uid, detail x
into GP from aa-get-seqfeat-general "PTP as
x using db ! Map GP to that table create view
GP from GP using db ! Run a queryto get title
of 131470 select x.detail.title from GP as
x where x.uid 131470
Relational DBMS is insufficient because it forces
us to fragment data into 3NF. Kleisli turns
flat relational DBMS into nested relational DBMS.
It can use flat relational DBMS such as Sybase,
Oracle, MySQL, etc. to be its update-able complex
object store.
10
Epitope Prediction
TRAP-559AA MNHLGNVKYLVIVFLIFFDLFLVNGRDVQNNIVDEIKYS
E EVCNDQVDLYLLMDCSGSIRRHNWVNHAVPLAMKLIQQLN LNDNAIH
LYVNVFSNNAKEIIRLHSDASKNKEKALIIIRS LLSTNLPYGRTNLTDA
LLQVRKHLNDRINRENANQLVVIL TDGIPDSIQDSLKESRKLSDRGVKI
AVFGIGQGINVAFNR FLVGCHPSDGKCNLYADSAWENVKNVIGPFMKAV
CVEVEK TASCGVWDEWSPCSVTCGKGTRSRKREILHEGCTSEIQEQ CE
EERCPPKWEPLDVPDEPEDDQPRPRGDNSSVQKPEENI IDNNPQEPSPN
PEEGKDENPNGFDLDENPENPPNPDIPEQ KPNIPEDSEKEVPSDVPKNP
EDDREENFDIPKKPENKHDN QNNLPNDKSDRNIPYSPLPPKVLDNERKQ
SDPQSQDNNGN RHVPNSEDRETRPHGRNNENRSYNRKYNDTPKHPEREE
HE KPDNNKKKGESDNKYKIAGGIAGGLALLACAGLAYKFVVP GAATPY
AGEPAPFDETLGEEDKDLDEPEQFRLPEENEWN
11
Epitope Prediction Results
  • Prediction by our ANN model for HLA-A11
  • 29 predictions
  • 22 epitopes
  • 76 specificity
  • Prediction by BIMAS matrix for HLA-A1101

Number of experimental
binders 19 (52.8) 5 (13.9)
12 (33.3)
Rank by BIMAS
12
Transcription Start Prediction
13
Transcription Start Prediction Results
14
Medical Record Analysis
  • Looking for patterns that are
  • valid
  • novel
  • useful
  • understandable

15
Gene Expression Analysis
  • Classifying gene expression profiles
  • find stable differentially expressed genes
  • find significant gene groups
  • derive coordinated gene expression

16
Medical Record Gene Expression Analysis Results
  • PCL, a novel emerging pattern method
  • Beats C4.5, CBA, LB, NB, TAN in 21 out of 32 UCI
    benchmarks
  • Works well for gene expressions

Cancer Cell, March 2002, 1(2)
17
Protein Interaction Extraction
What are the protein-protein interaction
pathways from the latest reported discoveries?
18
Protein Interaction Extraction Results
  • Rule-based system for processing free texts in
    scientific abstracts
  • Specialized in
  • extracting protein names
  • extracting protein-protein interactions

Jak1
19
Behind the Scene
  • Allen Chong
  • Judice Koh
  • SPT Krishnan
  • Huiqing Liu
  • Seng Hong Seah
  • Soon Heng Tan
  • Guanglan Zhang
  • Zhuo Zhang
  • Vladimir Bajic
  • Vladimir Brusic
  • Jinyan Li
  • See-Kiong Ng
  • Limsoon Wong
  • Louxin Zhang

and many more students, folks from
geneticXchange, MolecularConnections, and other
collaborators.
20
Using Feature Generation Feature Selection for
Accurate Prediction of Translation Initiation
Sites
A more detailed example of post-genome knowledge
discovery
21
Translation Initiation Recognition
22
A Sample cDNA
299 HSU27655.1 CAT U27655 Homo
sapiens CGTGTGTGCAGCAGCCTGCAGCTGCCCCAAGCCATGGCTGAA
CACTGACTCCCAGCTGTG 80 CCCAGGGCTTCAAAGACTTCTCA
GCTTCGAGCATGGCTTTTGGCTGTCAGGGCAGCTGTA
160 GGAGGCAGATGAGAAGAGGGAGATGGCCTTGGAGGAAGGGAAGGGG
CCTGGTGCCGAGGA 240 CCTCTCCTGGCCAGGAGCTTCCTCCAG
GACAAGACCTTCCACCCAACAAGGACTCCCCT .................
...........................................
80 ................................iEEEEEEEEEEEEEE
EEEEEEEEEEEEE 160 EEEEEEEEEEEEEEEEEEEEEEEEEEEE
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
240 EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
EEEEEEEEEEEEE
What makes the second ATG the translation
initiation site?
23
Approach
  • Training data gathering
  • Signal generation
  • k-grams, distance, domain know-how, ...
  • Signal selection
  • Entropy, ?2, CFS, t-test, domain know-how...
  • Signal integration
  • SVM, ANN, PCL, CART, C4.5, kNN, ...

24
Training Testing Data
  • Vertebrate dataset of Pedersen Nielsen
    ISMB97
  • 3312 sequences
  • 13503 ATG sites
  • 3312 (24.5) are TIS
  • 10191 (75.5) are non-TIS
  • Use for 3-fold x-validation expts

25
Signal Generation
  • K-grams (ie., k consecutive letters)
  • K 1, 2, 3, 4, 5,
  • Window size vs. fixed position
  • Up-stream, downstream vs. any where in window
  • In-frame vs. any frame

26
Too Many Signals
  • For each value of k, there are
  • 4k 3 2 k-grams
  • If we use k 1, 2, 3, 4, 5, we have
  • 4 24 96 384 1536 6144 8188
  • features!
  • This is too many for most machine learning
    algorithms

27
Signal Selection (Basic Idea)
  • Choose a signal w/ low intra-class distance
  • Choose a signal w/ high inter-class distance
  • Which of the following 3 signals is good?

28
Signal Selection (eg., t-statistics)
29
Signal Selection (eg., CFS)
  • Instead of scoring individual signals, how about
    scoring a group of signals as a whole?
  • CFS
  • A good group contains signals that are highly
    correlated with the class, and yet uncorrelated
    with each other

30
Sample k-grams Selected by CFS
Leaky scanning
Kozak consensus
  • Position 3
  • in-frame upstream ATG
  • in-frame downstream
  • TAA, TAG, TGA,
  • CTG, GAC, GAG, and GCC

Stop codon
Codon bias?
31
Signal Integration
  • kNN
  • Given a test sample, find the k training samples
    that are most similar to it. Let the majority
    class win.
  • SVM
  • Given a group of training samples from two
    classes, determine a separating plane that
    maximises the margin of error.
  • Naïve Bayes, ANN, C4.5, ...

32
Results (3-fold x-validation)
33
Improvement by Voting
  • Apply any 3 of Naïve Bayes, SVM, Neural Network,
    Decision Tree. Decide by majority.

34
Improvement by Scanning
  • Apply Naïve Bayes or SVM left-to-right until
    first ATG predicted as positive. Thats the TIS.
  • Naïve Bayes SVM models were trained using TIS
    vs. Up-stream ATG

35
Performance Comparisons
result not directly comparable
36
Technique Comparisons
  • Our approach
  • Explicit feature generation
  • Explicit feature selection
  • Use any machine learning method w/o any form of
    complicated tuning
  • Scanning rule is optional
  • PedersenNielsen ISMB97
  • Neural network
  • No explicit features
  • Zien Bioinformatics00
  • SVMkernel engineering
  • No explicit features
  • Hatzigeorgiou Bioinformatics02
  • Multiple neural networks
  • Scanning rule
  • No explicit features

37
Acknowledgements
  • A.G. Pedersen
  • H. Nielsen
  • Roland Yap
  • Fanfan Zeng
Write a Comment
User Comments (0)
About PowerShow.com