Title: Data Mining and Bioinformatics
1Data Mining and Bioinformatics
- Wei Wang
- Assistant Professor
329 Sitterson Hall weiwang_at_cs.unc.edu www.cs.unc.e
du/weiwang
2What is Data Mining?
- Techniques that can extract valuable knowledge
from massive data - Clustering
- Anomaly detection
- Data modeling
- Compression
- Classification
- Association/correlation analysis
- Similarity search
- Trends detection
- Wide applications
- Bioinformatics
- Image processing
- Traffic engineering
- Security
- E-commerce
3Bioinformatics
- Biological data are abundant and information rich
- Data produced at different levels
- molecules, cells, organs, organisms, populations
- Data obtained from different channels
- Structure sequence, shape, energy,
- Function gene expression, pathway, phenotypic
and clinical data,
4Bioinformatics
5Bioinformatics
6Bioinformatics
- Challenges
- Highly complex
- Noisy
- inconsistent
- Redundant
- Data mining can help!
7What We Are Doing
- Proteomics
- Protein structure modeling and analysis
- with Jan Prins (CS), Alex Tropsha (Pharmacology)
- Gene expression and pathways
- OP-Cluster tendency based gene expression
analysis - Classification based on association
- Discriminative feature selection
- Classification on one class and unlabelled data
- with Andrew Nobel (Statistics), Peter Petrusz
(Medicine), UIUC
8What We Are Doing
- Semantic integration of heterogeneous genome
databases - Similarity Queries across heterogeneous
Microarray data - UIUC
9Protein Data Bank (PDB) Growth
Can we find patterns from the exponentially
growing PDB?
10Protein Structure Visualization
11Protein Classification
12Computer Understands Numbers
13Graph Representation of Proteins
- We present a protein by an undirected labeled
graph - Every node corresponds to a residue in the
protein, labeled by its type. - (u, v) is a edge in the graph iff
- there is a peptide bond between residue u and v
(peptide edge), or - the distance between u, v (represented by the two
C?s) is less than 10 Ă… (proximity edge)
ATOM 820 CA THR 115 -7.108 8.835 6.640
1.00 8.21 ATOM 1280 CA THR 175 -19.567
2.837 0.682 1.00 14.73 ATOM 1671 CA ARG
229 -15.242 -4.327 0.885 1.00 6.50
ATOM 1707 CA SER 233 -15.989 -6.491 -4.881
1.00 6.86
14Graph Representation of Proteins
15Finding All Frequent Subgraphs
NP Hard!
16Classifying Proteins from SCOP
SCOP classifies proteins by five levels Class,
Fold, Superfamily, Family and Individual
proteins. We formed three datasets from SCOP
Accuracy is defined as (true positive true
negative) / total samples. The results are
reported as average values of ten fold cross
validation. Used LibSVM classifier from
http//www.csie.ntu.edu.tw/cjlin/libsvm/
Parameters C-SVM classification model the
linear kernel and leaving others as default
17Fingerprints in Prokaryotic Serine Protease
G1
G2
Backbone Achromobacter lyticus protease I (PDB
ID 1ARB).
18An Even Larger Fingerprint
ASP 2 2 0 0 0 0 0 2 GLY 0 2 0 0 0 0 2 0 GLY 0 0 0
0 0 0 2 0 PHE 2 0 0 0 0 0 0 2 LEU 2 2 0 0 0 0 0 2
ALA 0 2 0 0 0 0 2 0 VAL 0 0 0 0 0 0 2 0 ALA 2
proximity edge.
19Clustering and Classification
Cells
Labeled transcript
AAAA
IVT (Biotin-UTP Biotin-CTP)
L
L
L
L
Poly (A)/ Total RNA
cDNA
Fragment (heat, Mg2)
L
L
Wash Stain
Hybridize (16 hours)
L
L
Scan
Labeled fragments
20Clustering
21We are looking for solution to..
- Gene Discovery
- screening technique to identify regulated genes.
e.g. transcriptional response of yeast to
environmental stresses (cold, saline,
nutrient-starvation,) - transcript profiles of diseases e.g. cancer
- gt identification of single genes products
establishment of tumor markers for diagnostic
purpose - gt drug development only affecting expressed
genes - gt cancer classification
- Toxicological research, drug discovery
- genetic network interference
- .
22We work on numbers again
samples
genes
23OP-Clustering on mouse gene expression
24Database Integration
?
?
Schema Integration
25Want to learn more?
- COMP 290-90 Research Seminar Data Mining and its
Applications - 330-445PM Tuesday and Thursday, SN 325
- http//www.cs.unc.edu/weiwang/
- BCB module class covers
- Information theory, Machine learning, software
engineering - Sequence analysis
- Database, ontology, digital library
- Data mining
- Biostatistics
- Mathematical metabolic and cell modeling
26Broader communities
- Carolina Center for Genome Sciences
- http//genomics.unc.edu/
- Carolina Database Group
- http//www.cs.duke.edu/dbgroup/cdb/
27I can be reached at
- weiwang_at_cs.unc.edu
- 329 Sitterson Hall
- 919-962-1744
- Out of office Aug 23 Aug 28
Thank You!