Title: PROTERAN:
1PROTERAN
- ANIMATED TERRAIN EVOLUTION FOR VISUAL ANALYSIS OF
PROTEIN FOLDING TRAJECTORY
2The need for Bioinformatics
- Bioinformatics Application of computational
techniques to the management and analysis of
biological information. - Clustering techniques applied on data not enough.
Need a good visual representation
3Agenda
- Microarrays
- Review of existing clustering and visualization
techniques on gene expression data - The need for a customized visualization tool for
use by Dr. Laxmi Parida Dr. Ruhong Zhou of the
computational biology group at the IBM Watson
Research Center for visual analysis of protein
characteristics - Introduce our new technique that makes use of an
animated terrain, implemented in the program
called PROTERAN
4Function of Genes Proteins
- Through the proteins they encode genes
orchestrate the mysteries of life - Protein functions vary widely from mechanical
support to transportation to regulation.
5Still a lot of work ahead
- Traditional methods of discovering their
functions were done on a gene-by-gene basis, thus
throughput was low. - Believed that many genes work together this is
not exhibited in a one-by-one fashion.
6Microarrays
- Solve the throughput problem
- Allow scientists to see genes on a genomic level
7Expression Matrix
8Clustering Visualization Techniques Review
9Clustering
- Clustering Act of grouping similar objects
together - Applied to gene expression in order to find the
function of unknown genes - Many different clustering techniques in the
literature. Represented techniques are discussed
next.
10Determining similarity between two genes
- Choose a similarity distance to compare genes
- e.g. Euclidian distance
11Hierarchical Clustering
- Create distance matrix of all genes in relation
to each other - Find the two closest genes
- Merge these two genes and redo distance matrix
- Repeat steps 2-3 until only one cluster left
12Dendrogram
- Binary tree with a distinguished root, which has
all the data items at the leaves - Re-orders the expression matrix to place similar
genes beside each other
13Example
Agglomerative Hierarchical Clustering
14Advantages
- Familiar to biologists
- Few parameters to specify
15Disadvantages
- Requires fast CPUs and large amounts of memory
- Does not identify important clusters
- Only represents hierarchical organized data
- Does not scale up
-
16Disadvantages cont..
- Dendrogram always offers 2n-1 representations
(where n number of elements)
17Self Organizing Maps (SOMs)
- User picks number of clusters called nodes
- Nodes randomly mapped to M-dimensional space (M
of experiments) - Node values are adjusted by random vectors picked
from original data - After node values settle vectors are clustered to
closest node
18Visualization
- Dendrogram
- Error Bar Representation
19Visualization
20Advantages
- User has partial control over structure
- Fuzzy Clusters
- Variety of visual techniques applicable
21Disadvantages
- Knowledge of number of clusters beforehand
- Many parameters to specify
22Principle Component Analysis (PCA)
- Mathematical technique that can be used to reduce
the number of dimensions of data
Principal component analysis
23Visualization
24Advantages
- No parameters required
- 3D Visualization
25Disadvantages
- Little control over structure
- Running time of O(N3)
- Not applicable when input is a distance matrix
26Biclustering
- Clustering of both rows and columns
simultaneously
27Available Software
Software Name Description Available at
F-Scan Quantification and analysis of fluorescently probed microarrays scatterplots multiple image comparison. http//abs.cit.nih.gov/fscan/
TIGR SpotFinder Spot identification. http//www.tigr.org/software/
Cluster Hierarchical clustering, K means clustering Self-Organizing Map (SOM), PCA http//rana.lbl.gov/EisenSoftware.htm
Genesis A Java suite containing various tools such as filters, normalization, visualization tools, common clustering algorithms, SOM, k-means, PCA, http//genome.tugraz.at/Software/GenesisCenter.html
J-Express Pro 2.0 Hierarchical clustering, K-means, Principal Component Analysis, Self-organizing maps, Profile similarity search, Normalization and filtering, Raw data import, Project organization http//www.molmine.com/frameset/frm_jexpress.htm
TreeView Cluster output visualization http//rana.lbl.gov/EisenSoftware.htm
28Protein Folding
29Reaction Coordinates
- Folding determines the function of protein
- All-atom recreation of protein unrealistic
- Reaction coordinates used to describe protein
structure - Fraction of Native Contacts
- Radius of Gyration
- RMSD from the native structure
- Number of beta-strand Hydrogen Bonds
- Number of alpha helix turns
- Hydrophobic core radius of gyration
- Principle Components
30Protein States
- While folding, a protein goes through certain
states - The raw data is similar to microarray data.
- Dr. Parida and Dr. Zhou have developed their own
techniques and clustered ß-Hairpin data.
31Reaction Coordinates used on the ß-Hairpin
- Number of Native ß-strand hydrogen bonds
- Radius of gyration of the hydrophobic core
residues - Radius of gyration of entire protein
- Fraction of native contacts
- Principle component 1
- Principle component 2
- Root mean square deviation (RMSD) from the native
structure.
32Raw Data
33Patterned Cluster
2 0 0.1 4 0.23
3 23 26 27 23 26 27
- RED Number of columns in pattern. (Also
defined as the Pattern Type) - WHITE Column Number
- PURPLE Column Value
- YELLOW Number of occurrences
- GREEN Occurrences
34Sample Patterned Cluster File
2 0 7.335 1 0.735
1006 59728 87235 94826-94831 95748-95752 95761-95763 120424-120426
2 0 7.335 1 0.736
1003 59728 87235 94826-94831 95748-95752 95761-95763 95769
3 0 7.335 4 -5.881 6 3.292
1036 59728 72071 87235 94826 94828-94831 95761-95763
3 0 7.335 4 -5.881 5 2.214
1056 59728 72071 87235 94826 94828-94831 95761-95763
5 2 8.144 3 0.899 4 -3.855 5 -33.574 6 3.292
1089 45533 59728 72071 87235 94826 95748-95752
35The need for Visual Analysis of Patterned Cluster
Data
- ß-Hairpin file approx 500MB large
- Difficult to study the textual representation and
get a global view - Very difficult to see interaction of all
patterned clusters in relation to each other - Also very difficult to remember all patterned
clusters and their occurrence in time
36Visual Requirements
- Global View
- Navigation Focus
- Relative growth
- Details of characteristics on demand
37Need for Customized Tool
- All of the existing visualization techniques on
microarrays had one or more drawbacks - None were able to provide a visual for depicting
relative growth of clusters.
38Terrain Metaphor
- Has been shown to be a useful technique in
searching a corpus of documents - Very recently the idea has been applied to gene
expression with high density clusters
representing mountains
39Using a Landscape Metaphor to solve our
requirements
- Each mountain represents a patterned cluster
- Mountain growth represents evolution of patterned
cluster - Clicking on mountains returns details of
patterned cluster
40PROTERAN
41Mapping of Patterned Cluster Data into Terrain
Geometry
42Mapping of Patterned Cluster data into Terrain
Geometry
- Pattern Type Number of columns in a patterned
cluster - Column Combination Unique number that
identifies a combination of columns
2 0 0.1 4 0.23
3 23 26 27 23 26 27
43Column Combinations
- c!
- (c t)! t!
- c number of characteristics
- t pattern number
Pattern Type Number of Column Combinations
2 21
3 35
4 35
5 21
6 7
7 1
44Layout
- We first thought of using an automated layout
technique. - However, one of Dr. Zhous requirements was that
the same pattern cluster should appear in the
same position for consistent interpretation. - Another was that larger pattern types (6 and 7
column) must be very distinguishably placed. - Hence it was decided to use a manual layout
design described next.
45Layout
01 02 03 01234 01235 01236 012 013 014 015 016
04 05 06 01245 01246 01256 023 024 025 026 034
12 13 14 01345 01346 01356 035 036 045 046 056
15 16 23 01456 02345 02346 123 124 125 126 134
24 25 26 02356 02456 03456 135 136 145 146 156
34 35 36 12345 12346 12356 234 235 236 245 246
45 46 56 12456 13456 23456 256 345 346 356 456
0123 0124 0125 0126 0134
0135 0136 0145 0146 0156
012345 012346 012356 0234 0235 0236 0245 0246
0123456 012456 013456 023456 0256 0345 0346 0356 0456
123456 1234 1235 1236 1245 1246
1256 1345 1346 1356 1456
2345 2346 2356 2456 3456
46Top Patterned Clusters Visualized
- Final requirement by Dr. Parida and Dr. Zhou is
that only the top 10 largest patterned clusters
of each column combination should be visualized
10TH Highest Occurrence of combination 01
9TH Highest Occurrence of combination 01 2ND Highest Occurence of combination 01 3RD Highest Occurrence of combination 01
8TH Highest Occurrence of combination 01 Highest Occurrence of combination 01 4TH Highest Occurrence of combination 01
7TH Highest Occurrence of combination 01 6TH Highest Occurrence of combination 01 5TH Highest Occurrence of combination 01
47PROTERAN LAYOUT
48Animated Terrain Evolution
- Time proceeds from 0 to the maximum number of
experiments - Each time unit all patterned clusters are checked
- If there is an occurrence the mountains height
is increased
49Mountains of PROTERAN
50Results Extensions
51Results
- Very encouraging feedback
- Easy to use layout and the interface allows
- Identification of states
- Obtain values of patterned clusters
- Relation of patterned clusters to each other as
they grow over time - In the initial use itself, Dr. Zhou said that
he was able to find that the hydrophobic core is
largely formed before the beta-strand hydrogen
bonds are formed.
52Future of PROTERAN
- Introduced at the Intelligent Systems For
Molecular Biology (ISMB) in Scotland Received
very well - Robert-Cedergren Bioinformatics Colloquium at
University of Montreal (Sept 23-24th)
53Extensions
- Analyze with different types of protein data
- More generic layout with more characteristics
- Application with different types of data
54Summary
- Review of existing techniques to cluster and
visualize gene expression data - Protein characteristics data is similar to that
of gene expression data - None of the existing techniques applied, thus the
need for a customized visual - Terrain Metaphor to solve our requirements
implemented in the program PROTERAN
55Questions