Data Mining and Bioinformatics - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

Data Mining and Bioinformatics

Description:

Data Mining and Bioinformatics. Wei Wang. Assistant Professor. 329 Sitterson Hall ... Bioinformatics. Biological data are abundant and information rich ... – PowerPoint PPT presentation

Number of Views:799
Avg rating:5.0/5.0
Slides: 28
Provided by: UNC52
Category:

less

Transcript and Presenter's Notes

Title: Data Mining and Bioinformatics


1
Data Mining and Bioinformatics
  • Wei Wang
  • Assistant Professor

329 Sitterson Hall weiwang_at_cs.unc.edu www.cs.unc.e
du/weiwang
2
What is Data Mining?
  • Techniques that can extract valuable knowledge
    from massive data
  • Clustering
  • Anomaly detection
  • Data modeling
  • Compression
  • Classification
  • Association/correlation analysis
  • Similarity search
  • Trends detection
  • Wide applications
  • Bioinformatics
  • Image processing
  • Traffic engineering
  • Security
  • E-commerce

3
Bioinformatics
  • Biological data are abundant and information rich
  • Data produced at different levels
  • molecules, cells, organs, organisms, populations
  • Data obtained from different channels
  • Structure sequence, shape, energy,
  • Function gene expression, pathway, phenotypic
    and clinical data,

4
Bioinformatics
  • Molecular level

5
Bioinformatics
  • Molecular level

6
Bioinformatics
  • Challenges
  • Highly complex
  • Noisy
  • inconsistent
  • Redundant
  • Data mining can help!

7
What We Are Doing
  • Proteomics
  • Protein structure modeling and analysis
  • with Jan Prins (CS), Alex Tropsha (Pharmacology)
  • Gene expression and pathways
  • OP-Cluster tendency based gene expression
    analysis
  • Classification based on association
  • Discriminative feature selection
  • Classification on one class and unlabelled data
  • with Andrew Nobel (Statistics), Peter Petrusz
    (Medicine), UIUC

8
What We Are Doing
  • Semantic integration of heterogeneous genome
    databases
  • Similarity Queries across heterogeneous
    Microarray data
  • UIUC

9
Protein Data Bank (PDB) Growth
Can we find patterns from the exponentially
growing PDB?
10
Protein Structure Visualization
11
Protein Classification
12
Computer Understands Numbers
13
Graph Representation of Proteins
  • We present a protein by an undirected labeled
    graph
  • Every node corresponds to a residue in the
    protein, labeled by its type.
  • (u, v) is a edge in the graph iff
  • there is a peptide bond between residue u and v
    (peptide edge), or
  • the distance between u, v (represented by the two
    C?s) is less than 10 Ă… (proximity edge)

ATOM 820 CA THR 115 -7.108 8.835 6.640
1.00 8.21 ATOM 1280 CA THR 175 -19.567
2.837 0.682 1.00 14.73 ATOM 1671 CA ARG
229 -15.242 -4.327 0.885 1.00 6.50
ATOM 1707 CA SER 233 -15.989 -6.491 -4.881
1.00 6.86
14
Graph Representation of Proteins
15
Finding All Frequent Subgraphs
NP Hard!
16
Classifying Proteins from SCOP
SCOP classifies proteins by five levels Class,
Fold, Superfamily, Family and Individual
proteins. We formed three datasets from SCOP
Accuracy is defined as (true positive true
negative) / total samples. The results are
reported as average values of ten fold cross
validation. Used LibSVM classifier from
http//www.csie.ntu.edu.tw/cjlin/libsvm/
Parameters C-SVM classification model the
linear kernel and leaving others as default
17
Fingerprints in Prokaryotic Serine Protease
G1
G2
Backbone Achromobacter lyticus protease I (PDB
ID 1ARB).
18
An Even Larger Fingerprint
ASP 2 2 0 0 0 0 0 2 GLY 0 2 0 0 0 0 2 0 GLY 0 0 0
0 0 0 2 0 PHE 2 0 0 0 0 0 0 2 LEU 2 2 0 0 0 0 0 2
ALA 0 2 0 0 0 0 2 0 VAL 0 0 0 0 0 0 2 0 ALA 2
proximity edge.
19
Clustering and Classification
  • Gene expression data

Cells
Labeled transcript
AAAA
IVT (Biotin-UTP Biotin-CTP)
L
L
L
L
Poly (A)/ Total RNA
cDNA
Fragment (heat, Mg2)
L
L
Wash Stain
Hybridize (16 hours)
L
L
Scan
Labeled fragments
20
Clustering
21
We are looking for solution to..
  • Gene Discovery
  • screening technique to identify regulated genes.
    e.g. transcriptional response of yeast to
    environmental stresses (cold, saline,
    nutrient-starvation,)
  • transcript profiles of diseases e.g. cancer
  • gt identification of single genes products
    establishment of tumor markers for diagnostic
    purpose
  • gt drug development only affecting expressed
    genes
  • gt cancer classification
  • Toxicological research, drug discovery
  • genetic network interference
  • .

22
We work on numbers again
samples
genes
23
OP-Clustering on mouse gene expression
24
Database Integration
?
?
Schema Integration
25
Want to learn more?
  • COMP 290-90 Research Seminar Data Mining and its
    Applications
  • 330-445PM Tuesday and Thursday, SN 325
  • http//www.cs.unc.edu/weiwang/
  • BCB module class covers
  • Information theory, Machine learning, software
    engineering
  • Sequence analysis
  • Database, ontology, digital library
  • Data mining
  • Biostatistics
  • Mathematical metabolic and cell modeling

26
Broader communities
  • Carolina Center for Genome Sciences
  • http//genomics.unc.edu/
  • Carolina Database Group
  • http//www.cs.duke.edu/dbgroup/cdb/

27
I can be reached at
  • weiwang_at_cs.unc.edu
  • 329 Sitterson Hall
  • 919-962-1744
  • Out of office Aug 23 Aug 28

Thank You!
Write a Comment
User Comments (0)
About PowerShow.com