PROTERAN: - PowerPoint PPT Presentation

1 / 55
About This Presentation
Title:

PROTERAN:

Description:

http://genome.tugraz.at/Software/GenesisCenter.html ... Using a Landscape Metaphor to solve our requirements ... a manual layout design described next. 45 ... – PowerPoint PPT presentation

Number of Views:17
Avg rating:3.0/5.0
Slides: 56
Provided by: kushk
Learn more at: https://cs.nyu.edu
Category:

less

Transcript and Presenter's Notes

Title: PROTERAN:


1
PROTERAN
  • ANIMATED TERRAIN EVOLUTION FOR VISUAL ANALYSIS OF
    PROTEIN FOLDING TRAJECTORY

2
The need for Bioinformatics
  • Bioinformatics Application of computational
    techniques to the management and analysis of
    biological information.
  • Clustering techniques applied on data not enough.
    Need a good visual representation

3
Agenda
  • Microarrays
  • Review of existing clustering and visualization
    techniques on gene expression data
  • The need for a customized visualization tool for
    use by Dr. Laxmi Parida Dr. Ruhong Zhou of the
    computational biology group at the IBM Watson
    Research Center for visual analysis of protein
    characteristics
  • Introduce our new technique that makes use of an
    animated terrain, implemented in the program
    called PROTERAN

4
Function of Genes Proteins
  • Through the proteins they encode genes
    orchestrate the mysteries of life
  • Protein functions vary widely from mechanical
    support to transportation to regulation.

5
Still a lot of work ahead
  • Traditional methods of discovering their
    functions were done on a gene-by-gene basis, thus
    throughput was low.
  • Believed that many genes work together this is
    not exhibited in a one-by-one fashion.

6
Microarrays
  • Solve the throughput problem
  • Allow scientists to see genes on a genomic level

7
Expression Matrix
8
Clustering Visualization Techniques Review
9
Clustering
  • Clustering Act of grouping similar objects
    together
  • Applied to gene expression in order to find the
    function of unknown genes
  • Many different clustering techniques in the
    literature. Represented techniques are discussed
    next.

10
Determining similarity between two genes
  • Choose a similarity distance to compare genes
  • e.g. Euclidian distance

11
Hierarchical Clustering
  1. Create distance matrix of all genes in relation
    to each other
  2. Find the two closest genes
  3. Merge these two genes and redo distance matrix
  4. Repeat steps 2-3 until only one cluster left

12
Dendrogram
  • Binary tree with a distinguished root, which has
    all the data items at the leaves
  • Re-orders the expression matrix to place similar
    genes beside each other

13
Example
Agglomerative Hierarchical Clustering
14
Advantages
  • Familiar to biologists
  • Few parameters to specify

15
Disadvantages
  • Requires fast CPUs and large amounts of memory
  • Does not identify important clusters
  • Only represents hierarchical organized data
  • Does not scale up

16
Disadvantages cont..
  • Dendrogram always offers 2n-1 representations
    (where n number of elements)

17
Self Organizing Maps (SOMs)
  • User picks number of clusters called nodes
  • Nodes randomly mapped to M-dimensional space (M
    of experiments)
  • Node values are adjusted by random vectors picked
    from original data
  • After node values settle vectors are clustered to
    closest node

18
Visualization
  1. Dendrogram
  2. Error Bar Representation

19
Visualization
  • U-Matrix

20
Advantages
  • User has partial control over structure
  • Fuzzy Clusters
  • Variety of visual techniques applicable

21
Disadvantages
  • Knowledge of number of clusters beforehand
  • Many parameters to specify

22
Principle Component Analysis (PCA)
  • Mathematical technique that can be used to reduce
    the number of dimensions of data

Principal component analysis
23
Visualization
24
Advantages
  • No parameters required
  • 3D Visualization

25
Disadvantages
  • Little control over structure
  • Running time of O(N3)
  • Not applicable when input is a distance matrix

26
Biclustering
  • Clustering of both rows and columns
    simultaneously

27
Available Software
Software Name Description Available at
F-Scan Quantification and analysis of fluorescently probed microarrays scatterplots multiple image comparison. http//abs.cit.nih.gov/fscan/
TIGR SpotFinder Spot identification. http//www.tigr.org/software/
Cluster Hierarchical clustering, K means clustering Self-Organizing Map (SOM), PCA http//rana.lbl.gov/EisenSoftware.htm
Genesis A Java suite containing various tools such as filters, normalization, visualization tools, common clustering algorithms, SOM, k-means, PCA, http//genome.tugraz.at/Software/GenesisCenter.html
J-Express Pro 2.0 Hierarchical clustering, K-means, Principal Component Analysis, Self-organizing maps, Profile similarity search, Normalization and filtering, Raw data import, Project organization http//www.molmine.com/frameset/frm_jexpress.htm
TreeView Cluster output visualization http//rana.lbl.gov/EisenSoftware.htm
28
Protein Folding
29
Reaction Coordinates
  • Folding determines the function of protein
  • All-atom recreation of protein unrealistic
  • Reaction coordinates used to describe protein
    structure
  • Fraction of Native Contacts
  • Radius of Gyration
  • RMSD from the native structure
  • Number of beta-strand Hydrogen Bonds
  • Number of alpha helix turns
  • Hydrophobic core radius of gyration
  • Principle Components

30
Protein States
  • While folding, a protein goes through certain
    states
  • The raw data is similar to microarray data.
  • Dr. Parida and Dr. Zhou have developed their own
    techniques and clustered ß-Hairpin data.

31
Reaction Coordinates used on the ß-Hairpin
  • Number of Native ß-strand hydrogen bonds
  • Radius of gyration of the hydrophobic core
    residues
  • Radius of gyration of entire protein
  • Fraction of native contacts
  • Principle component 1
  • Principle component 2
  • Root mean square deviation (RMSD) from the native
    structure.

32
Raw Data
33
Patterned Cluster
2  0  0.1 4  0.23
3  23    26     27 23    26     27
  • RED Number of columns in pattern. (Also
    defined as the Pattern Type)
  • WHITE Column Number
  • PURPLE Column Value
  • YELLOW Number of occurrences
  • GREEN Occurrences

34
Sample Patterned Cluster File
2 0 7.335 1 0.735
1006 59728 87235 94826-94831 95748-95752 95761-95763 120424-120426
2 0 7.335 1 0.736
1003 59728 87235 94826-94831 95748-95752 95761-95763 95769
3 0 7.335 4 -5.881 6 3.292
1036 59728 72071 87235 94826 94828-94831 95761-95763
3 0 7.335 4 -5.881 5 2.214
1056 59728 72071 87235 94826 94828-94831 95761-95763

5 2 8.144 3 0.899 4 -3.855 5 -33.574 6 3.292
1089 45533 59728 72071 87235 94826 95748-95752
35
The need for Visual Analysis of Patterned Cluster
Data
  • ß-Hairpin file approx 500MB large
  • Difficult to study the textual representation and
    get a global view
  • Very difficult to see interaction of all
    patterned clusters in relation to each other
  • Also very difficult to remember all patterned
    clusters and their occurrence in time

36
Visual Requirements
  • Global View
  • Navigation Focus
  • Relative growth
  • Details of characteristics on demand

37
Need for Customized Tool
  • All of the existing visualization techniques on
    microarrays had one or more drawbacks
  • None were able to provide a visual for depicting
    relative growth of clusters.

38
Terrain Metaphor
  • Has been shown to be a useful technique in
    searching a corpus of documents
  • Very recently the idea has been applied to gene
    expression with high density clusters
    representing mountains

39
Using a Landscape Metaphor to solve our
requirements
  • Each mountain represents a patterned cluster
  • Mountain growth represents evolution of patterned
    cluster
  • Clicking on mountains returns details of
    patterned cluster

40
PROTERAN
41
Mapping of Patterned Cluster Data into Terrain
Geometry
42
Mapping of Patterned Cluster data into Terrain
Geometry
  • Pattern Type Number of columns in a patterned
    cluster
  • Column Combination Unique number that
    identifies a combination of columns

2  0  0.1 4  0.23
3  23    26     27 23    26     27
43
Column Combinations
  • c!
  • (c t)! t!
  • c number of characteristics
  • t pattern number

Pattern Type Number of Column Combinations
2 21
3 35
4 35
5 21
6 7
7 1
44
Layout
  • We first thought of using an automated layout
    technique.
  • However, one of Dr. Zhous requirements was that
    the same pattern cluster should appear in the
    same position for consistent interpretation.
  • Another was that larger pattern types (6 and 7
    column) must be very distinguishably placed.
  • Hence it was decided to use a manual layout
    design described next.

45
Layout
01 02 03 01234 01235 01236 012 013 014 015 016
04 05 06 01245 01246 01256 023 024 025 026 034
12 13 14 01345 01346 01356 035 036 045 046 056
15 16 23 01456 02345 02346 123 124 125 126 134
24 25 26 02356 02456 03456 135 136 145 146 156
34 35 36 12345 12346 12356 234 235 236 245 246
45 46 56 12456 13456 23456 256 345 346 356 456
0123 0124 0125 0126 0134
0135 0136 0145 0146 0156
012345 012346 012356 0234 0235 0236 0245 0246
0123456 012456 013456 023456 0256 0345 0346 0356 0456
123456 1234 1235 1236 1245 1246
1256 1345 1346 1356 1456
2345 2346 2356 2456 3456
46
Top Patterned Clusters Visualized
  • Final requirement by Dr. Parida and Dr. Zhou is
    that only the top 10 largest patterned clusters
    of each column combination should be visualized

10TH Highest Occurrence of combination 01
9TH Highest Occurrence of combination 01 2ND Highest Occurence of combination 01 3RD Highest Occurrence of combination 01
8TH Highest Occurrence of combination 01 Highest Occurrence of combination 01 4TH Highest Occurrence of combination 01
7TH Highest Occurrence of combination 01 6TH Highest Occurrence of combination 01 5TH Highest Occurrence of combination 01

47
PROTERAN LAYOUT
48
Animated Terrain Evolution
  • Time proceeds from 0 to the maximum number of
    experiments
  • Each time unit all patterned clusters are checked
  • If there is an occurrence the mountains height
    is increased

49
Mountains of PROTERAN
50
Results Extensions
51
Results
  • Very encouraging feedback
  • Easy to use layout and the interface allows
  • Identification of states
  • Obtain values of patterned clusters
  • Relation of patterned clusters to each other as
    they grow over time
  • In the initial use itself, Dr. Zhou said that
    he was able to find that the hydrophobic core is
    largely formed before the beta-strand hydrogen
    bonds are formed.

52
Future of PROTERAN
  • Introduced at the Intelligent Systems For
    Molecular Biology (ISMB) in Scotland Received
    very well
  • Robert-Cedergren Bioinformatics Colloquium at
    University of Montreal (Sept 23-24th)

53
Extensions
  • Analyze with different types of protein data
  • More generic layout with more characteristics
  • Application with different types of data

54
Summary
  • Review of existing techniques to cluster and
    visualize gene expression data
  • Protein characteristics data is similar to that
    of gene expression data
  • None of the existing techniques applied, thus the
    need for a customized visual
  • Terrain Metaphor to solve our requirements
    implemented in the program PROTERAN

55
Questions
Write a Comment
User Comments (0)
About PowerShow.com