PUMAdb: Data Analysis Tutorial - PowerPoint PPT Presentation

About This Presentation
Title:

PUMAdb: Data Analysis Tutorial

Description:

http://puma.princeton.edu/help/ http://puma.princeton.edu/help ... The names should be separated by two colons. Gene Selection: All genes. Ten arrays. All genes ... – PowerPoint PPT presentation

Number of Views:32
Avg rating:3.0/5.0
Slides: 120
Provided by: johnm109
Category:

less

Transcript and Presenter's Notes

Title: PUMAdb: Data Analysis Tutorial


1
PUMAdb Data Analysis Tutorial
  • June 1, 2004

2
User Help Help, Tutorials and Workshops
  • Help FAQ
  • http//puma.princeton.edu/help/
  • http//puma.princeton.edu/help/FAQ.shtml
  • Tutorials regularly scheduled
  • Welcome tutorial
  • Data analysis, Normalization and Clustering
  • Interested? Email array_at_genomics.princeton.edu
  • Hybridization Scanning Individual Instruction
  • Email dstorton_at_molbio.princeton.edu

3
PUMAdb Data Analysis
  • Data Analysis Background
  • Data normalization
  • Clustering algorithms
  • Data centering
  • Using the Databases Analysis Pipeline
  • Gene Selection and Annotation
  • Data Filtering
  • Data Retrieval
  • Gene Filtering
  • Clustering and Image Generation

4
Data Analysis Background
  • Data normalization
  • Transforms data for cross-array comparison, by
    eliminating or compensating for some biases.
  • Clustering algorithms
  • Identifies and reveals patterns within the data.
  • Data centering
  • Transforms data for within-array comparison.

5
What is data normalization?
  • Normalization is an attempt to correct for
    systematic bias in data.
  • Normalization allows you to compare data from one
    array to another.
  • In practice we do not always understand the data
    - inevitably some biology will be removed too (or
    at least not revealed).

6
Tumor
Pool of Cell Lines
7
Such biases have consequences
  • Plotting the frequency of un-normalized
    intensities reveals the differential effect
    between the two channels.

8
How do we deal with this?
  • Normalization
  • In general, an assumption is made that the
    average gene does not change.
  • You need to understand your data, to know if that
    is an appropriate assumption or not.
  • The number of reporters (clones or genes) you
    are assaying will affect this.

9
Normalization
10
Effect on log ratios
Un-normalized
Normalized
Frequency
Log-ratios
11
Total Intensity Normalization
  • For those spots that are thought to be well
    measured, calculate mean or median log ratio.
  • Use this as a normalization factor to adjust all
    log ratios.
  • Equivalent to assuming same total intensity in
    both channels.
  • Our current software
  • provides two simple methods for selection of well
    measured spots pixel-by-pixel regression, and
    foreground over background intensity.
  • calculates normalized values for all channel 2
    measurements, and ratios.

12
Normalization by Subset
  • Housekeeping genes
  • Calculate normalization based on biologically
    determined stable genes.
  • Not always valid even very stable genes can
    respond to some conditions.
  • Spiking or doping controls
  • Calculate based on introduced DNA species.
  • Requires careful measurement of total DNA in each
    channel.
  • Our software accepts a global (per array),
    user-defined normalization factor for this
    purpose.

13
PUMAdb Data Analysis
  • Data Analysis Background
  • Data normalization
  • Clustering algorithms
  • Data centering
  • Using the Databases Analysis Pipeline
  • Gene Selection and Annotation
  • Data Filtering
  • Data Retrieval
  • Gene Filtering
  • Clustering and Image Generation

14
Clustering Algorithms
In microarray studies, we often use clustering
algorithms to help us identify patterns in
complex data. For example, we can randomize the
data used to represent this painting and see if
clustering will help us visualize the pattern.
15
Clustering algorithms
?
The painting is sliced into rows which are then
randomized.
16
Clustering algorithms
Rows ordered by hierarchical clustering with
nodes flipped to optimize ordering
17
Clustering algorithms
Rows ordered by Self-Organizing Maps
18
Clustering Random vs. Biological Data
From Eisen MB, et al, PNAS 1998 95(25)14863-8
19
How does clustering work?
  1. Compare all expression patterns to each other.
  2. Join patterns that are the most similar out of
    all patterns.
  3. Compare all joined and unjoined patterns.
  4. Go to step 2, and repeat until all patterns are
    joined.

20
How do we compare expression profiles?
  • Treat expression data for a gene as a
    multidimensional vector.
  • Decide on a distance metric to compare the
    vectors.
  • Plenty to choose from
  • Pearson correlation, Euclidean Distance,
    Manhattan Distance etc.

21
Expression Vectors
  • Crucial concept for understanding clustering
  • Each gene is represented by a vector where
    coordinates are its values (log(ratio)) in each
    experiment
  • x log(ratio)expt1
  • y log(ratio)expt2
  • z log(ratio)expt3
  • etc.

22
Distance Metrics
  • Distances are measured between expression
    vectors
  • Distance metrics define the way we measure
    distances
  • Many different ways to measure distance
  • Euclidean distance
  • Pearson correlation coefficient(s)
  • Manhattan distance
  • Mutual information
  • Kendalls Tau
  • etc.
  • Each has different properties and can reveal
    different features of the data

23
Euclidean distance
  • The Euclidean distance metric detects similar
    vectors by identifying those that are closest in
    space.
  • In this example, A and C are closest to one
    another.

24
Pearson correlation
  • The Pearson correlation disregards the magnitude
    of the vectors but instead compares their
    directions.
  • In this example, Gene A and Gene B have the same
    slope, so would be most similar to each other.

25
Distance Metric Pearson vs. Euclidean
A
B
C
  • By Euclidean distance, A and B are most similar.
  • By Pearson correlation, A and C are most similar.

26
Hierarchical Clustering
  1. Calculate the distance between all genes. Find
    the smallest distance. If several pairs share the
    same similarity, use a predetermined rule to
    decide between alternatives.
  2. Fuse the two selected clusters to produce a new
    cluster that now contains at least two objects.
    Calculate the distance between the new cluster
    and all other clusters.
  3. Repeat steps 1 and 2 until only a single cluster
    remains.
  4. Draw a tree representing the results.

27
Clustering Optimizing node order
  • When joining a gene vector to another, it is
    important to think about the order in which the
    nodes are joined.
  • In this example, ASH1 is allegedly most similar
    to PIR1, so their patterns are displayed adjacent
    to one another.

28
And we finally get a cluster
29
Clustering Two-way clustering
  • Just as gene patterns are clustered, array
    patterns can be clustered.
  • All the data points for an array can be used to
    construct a vector for that array and the vectors
    of multiple arrays can be compared.

30
Clustering Two-way Clustering
Two-way clustering can help show which samples
are most similar, as well as which genes.
31
So is clustering the solution?
  • Advantages
  • Simple
  • Easy to implement
  • Easy to visualize
  • Disadvantages
  • Can lead to incorrect/incomplete conclusions
  • Discarding of subtleties in 2-way clustering
  • May be driven by strong sub-clusters

32
Clustering Partitioning Methods
  • Split data up into smaller, more homogenous sets
  • Should avoid artifacts associated with
    incorrectly joining dissimilar vectors
  • Can cluster each partition independently of
    others
  • Self-Organizing Maps is one partitioning method

33
Clustering Self Organizing Maps
  • SOMs result in genes being assigned to partitions
    of most similar genes.
  • Neighboring partitions are more similar to each
    other than they are to distant partitions.

34
The 64,000 question
  • How many partitions do I use?
  • Ask a statistician
  • Tibshirani R, et al. (2000) Estimating the number
    of clusters in a dataset via the Gap statistic
  • http//www-stat.stanford.edu/tibs/ftp/gap.pdf
  • Ask us, and well say trial and error -)
  • The ideal outcome is a single expression pattern
    in each partition, and each partition distinct
    from the others.

35
PUMAdb Data Analysis
  • Data Analysis Background
  • Data normalization
  • Clustering algorithms
  • Data centering
  • Using the Databases Analysis Pipeline
  • Gene Selection and Annotation
  • Data Filtering
  • Data Retrieval
  • Gene Filtering
  • Clustering and Image Generation

36
Data Centering
  • Centering sets the average value of a vector to
    zero.
  • This results in a loss of information, but may
    reveal important patterns.

37
Data Centering
  • Gene centering is useful when the actual value of
    the ratio is not important or is not meaningful
    (e.g., common reference).
  • Centering is generally not appropriate when using
    a biologically meaningful control sample, such as
    a matched, untreated sample, or a zero timepoint.

38
Data Transformation Centering
  • To illustrate how centering affects data, a small
    sample of data were duplicated. A constant was
    added to the second copy of each row

39
Data Centering Effects of Different Centering
Strategies
Uncentered Data, No Centering Metric During
Clustering
Uncentered Data, Centering Metric During
Clustering
Centered Data, No Centering Metric During
Clustering
Centered Data, Centering Metric During Clustering
40
PUMAdb Data Analysis
  • Data Analysis Background
  • Data normalization
  • Clustering algorithms
  • Data centering
  • Using the Databases Analysis Pipeline
  • Gene Selection and Annotation
  • Data Filtering
  • Data Retrieval
  • Gene Filtering
  • Clustering and Image Generation

41
Data Retrieval and Analysis
  • Experiment names will be listed with feature
    extraction software indicated.

42
Gene Selection and Annotation
  • Specify genes or clones
  • Collapse data by SUID or LUID
  • Determine UID column
  • Choose biological annotation
  • Label result set

43
Gene Selection Specify Genes or Clones
  • Use all genes or clones on an array
  • Select a genelist from your loader account
  • Enter a list of genes to select. The names
    should be separated by two colons

44
Gene Selection All genes
  • Ten arrays
  • All genes
  • No control or empty spots, Spot flag 0
  • 8690 SUIDs used in cluster
  • Using all genes results in a very long cluster!

45
Gene Selection Genelists
  • Ten arrays
  • 500-gene genelist
  • No control or empty spots, Spot flag 0
  • 380 SUIDs used for cluster
  • Using a genelist reduces the length of the cluster

46
Gene Selection Specify Genes or Clones
  • Using all genes or clones on an array will give
    you a very long list of genes. This is the best
    option when you have no pre-existing expectations
    about your data and simply want to see what is
    happening.
  • Selecting a genelist from your loader account
    will give you a more select group of genes. This
    can be appropriate for testing hypotheses.

47
Gene Selection Retrieving and Collapsing Data
  • Collapse or averaging occurs within a single
    array. Multiple instances of the same entity
    will be combined as specified.
  • Duplicated entities can be defined in three ways
  • Sequence Unique ID (the identifier for a
    reporter). A SUID refers to the sequence itself.
  • Laboratory Unique ID (the identifier for the
    source of the sample in the lab). An LUID refers
    to a specific microtiter well. Multiple LUIDs
    may correspond to one SUID.
  • SPOT (the number corresponding to a feature on a
    print). This option only appears for retrieval
    from a single print (array design). Multiple
    spots/features on an array may contain a single
    LUID or SUID.

48
Gene Selection Collapse by SUID
  • Ten arrays
  • 500 gene genelist
  • No control or empty spots, Spot flag 0
  • 380 SUIDs used for cluster

49
Gene Selection Collapse by LUID
  • Ten arrays
  • Gene list of 500 genes
  • No control or empty spots
  • Retrieve by LUID
  • 397 LUIDs used for cluster
  • Retrieving via LUIDs may increase the number of
    gene vectors generated

50
Gene Selection Collapse Data
  • Retrieving by SUID (databases identifier for
    sequence) yields 380 genes -- samples that came
    from different microtiter wells will be collapsed
    if they are called the same sequence
  • Retrieving by LUID (the identifier for the
    original microtiter well location of the sample)
    yields 397 genes -- even if samples are the
    same sequence, they will not be collapsed if they
    come from different microtiter wells

51
Gene Annotation UID column
  • Rows of data can be labeled with one of four
    options
  • Systematic name / clone ID (the default)
  • SUID gives the databases unique ID
  • LUID gives the labs unique ID (we dont always
    have data for this defaults to SUID)
  • SPOT gives the spot number

52
Gene Annotation Biological Annotation
  • The list includes all information stored within
    the database for any gene from the organism in
    question. Not all genes will have all
    annotations.
  • Annotations from a genelist (selected earlier)
    can be used to describe the genes

53
Array Annotation Name Choices
  • Arrays (hybridizations) are identified in the
    database by slide name (e.g., serial number) and
    experiment name, both unique.
  • Agilent and Affymetrix data sets are further
    identified by a result set name possibly more
    than one per hybridization, and not guaranteed to
    be unique.

54
Gene Selection and Annotation Summary
  • Specify genes or clones
  • Collapse data by SUID or LUID
  • Determine UID column
  • Choose biological annotation
  • Label arrays/hybridizations

55
PUMAdb Data Analysis
  • Data Analysis Background
  • Data normalization
  • Clustering algorithms
  • Data centering
  • Using the Databases Analysis Pipeline
  • Gene Selection and Annotation
  • Data Filtering
  • Data Retrieval
  • Gene Filtering
  • Clustering and Image Generation

56
Data Filtering
  • Choose data column to retrieve
  • Elect to invert reverse dye replicates
  • Elect to filter by spot flag
  • Select spot criteria for filtering
  • Define image presentation options

57
Data Filtering Choose Data to Retrieve
  • You can retrieve and cluster any numerical
    measurement from your data.
  • Clustering doesnt necessarily make sense for all
    fields.
  • Default (and most appropriate) fields for
    clustering are log ratio (two-channel data) and
    signal/intensity (single-channel data).

58
Data Filtering Spot Flags, Reverse Replicates
  • Unreliable spots (identified by software or
    visual inspection) can be flagged. Spots that
    are not flagged are given a flag value of 0.
  • Autoflags (GenePix 5.0) are included in this
    option.
  • If your experiments are identified as reverse
    replicates, clicking on the reverse option will
    properly invert the ratio and log ratio data.

59
Data Filtering Selecting Filtering Criteria
  • Each spot will be individually assessed as
    specified, prior to any averaging or collapse.
  • Each filter can be made active and customized as
    desired.
  • Filters can be combined using logical operators
    (filter string), defaulting to a logical AND.
  • Filters available will be appropriate to the
    feature extraction software used. The exception
    is ScanAlyze and older versions of GenePix, which
    get (but cant use) all options for GenePix.

60
Data Filtering Default Spot Filters
  • Regression correlation measures pixel-by-pixel
    agreement between the two channels.
  • Foreground/Background intensities are a simple
    measure of signal to noise.
  • Absolute intensity cutoffs impose a minimum net
    signal.
  • Failed and Is Contaminated refer to the
    quality of the spotted material.
  • Equivalent defaults are presented for Agilent
    data.
  • Affymetrix data can be filtered on detection,
    detection p-value, etc.
  • Any data, including biological annotations, can
    be used for customized filters.

61
Data Filtering Filter selection
  • Data filters should be customized for the data
    retrieved.
  • Uniform filter values will be applied to each
    array retrieved.
  • The database makes available some basic tools for
    examining data and choosing appropriate filter
    values.

62
Data Filtering Filter Selection
  • Any numerical field can be plotted against any
    other (or none), in a scatter plot or histogram.
  • This is useful for quality assessment, and for
    selecting filters.

63
Data Filtering Regression Correlation
  • Plot filter field (here regression correlation)
    against test field (log ratio).
  • Log ratios should center around 0.
  • Here, the log ratios appear to diverge below a
    regression correlation of about 0.4 - 0.6.

64
Spots with low regression correlation
65
Data Filtering
  • Ten arrays
  • 500 gene gene list
  • Spot flag 0
  • No other filters
  • 380 SUIDs used for cluster

66
Data Filtering Regression Correlation
  • Ten arrays
  • 500 gene Genelist
  • Spot flag 0
  • Regression correlation gt 0.6
  • 380 SUIDs used for filtering
  • Filtering away spots with low regression
    correlation removes many spots

67
Data Filtering Regression Correlation
  • Ten arrays
  • 500-gene genelist
  • Spot flag 0
  • Regression correlation gt 0.8
  • 364 SUIDs used for clustering
  • A more stringent filter reduces the data quite a
    bit and even removes some genes entirely

68
Data Filtering Foreground to Background
Intensity Ratios
  • FG/BG (log scale) versus log ratio
  • Data center around 0
  • Impose cutoff at 2.5 (linear) to eliminate
    flare at low relative intensity.

69
Data Filtering Intensity to Background Ratios
  • Ten arrays used
  • 500-gene genelist
  • Spot flag 0
  • Normalized Channel 2 (red) mean intensity divided
    by Normalized Channel 2 median background greater
    than 2.5
  • 371 SUIDs used for clustering
  • Some arrays show very high background and some
    genes show such high background that they did not
    pass this filter in any array

70
Data Filtering Intensity to Background Ratios
  • Ten arrays used
  • 500-gene genelist
  • Spot flag 0
  • Channel 1 (green) mean intensity divided by
    Channel 1 median background greater than 2.5
  • 377 SUIDs used for clustering
  • Often, background can be higher in one channel --
    note that fewer data are removed here than when
    we used the same filter on Channel 2 (red)

71
Data Filtering Intensity to Background Ratios
72
Data Filtering Intensity Cutoff
  • More than one way to look at a fish.

73
Data Filtering Combinations of Filters
  • Ten arrays
  • 500-gene genelist
  • Spot flag 0
  • Regression correlation gt 0.6
  • Net intensity in either channel gt 350
  • 374 SUIDs selected for clustering
  • This data set was formed by selecting spots that
    are good quality (via the regression correlation)
    and good intensity in at least one channel

74
Data Filtering
  • No filters 380 SUIDs
  • Regression correlation gt 0.8 364 SUIDs
  • Ratio of intensity to background in both channels
    gt 2.5 370 SUIDs
  • Net intensity in either channel gt 350 377 SUIDs
  • 70 of pixels within one standard deviation of
    background 345 SUIDs
  • Regression correlation gt 0.6 AND Net intensity in
    either channel gt 350 374 SUIDs

75
Data Filtering Image Presentation Options
  • Retrieve spot coordinates will allow you to see
    an assembled image of each array after
    clustering. (However, multiple spots with the
    same contents interact poorly with use of
    systematic names as IDs - only one spot image
    will be shown).
  • Show all spots allows you to view the spots you
    filtered out (in addition to the ones that passed
    filtering) after clustering. This slows down
    retrieval.

76
Data Filtering Summary
  • Choose data column to retrieve
  • Elect to invert reverse-dye replicates
  • Elect to filter by spot flag
  • Select spot criteria for filtering (spot filters
    dont remove genes, but just gray data that
    dont pass, unless all spots are removed)
  • Define image presentation options

77
PUMAdb Data Analysis
  • Data Analysis Background
  • Data normalization
  • Clustering algorithms
  • Data centering
  • Using the Databases Analysis Pipeline
  • Gene Selection and Annotation
  • Data Filtering
  • Data Retrieval
  • Gene Filtering
  • Clustering and Image Generation

78
Data Retrieval
  • General results and progress
  • PreClustering (.pcl) file
  • Data retrieval summary report
  • Option to deposit data in repository

79
Data Retrieval Summary
80
Data Processing and Clustering
  • Experiment Selection
  • Gene Selection and Annotation
  • Data Filtering
  • Data Retrieval
  • Gene Filtering
  • Clustering and Image Generation

81
Gene Filtering
  • Transform single-channel data
  • Filter genes based on data distribution
  • Data centering
  • Filter genes based on data values
  • Filter genes and arrays based on spot filter
    criteria

82
Gene Filtering Transformation
  • Single-channel (e.g., Affymetrix) data only.
  • Adjust arrays for simple cross-array
    normalization.
  • Log-transform data for clustering.
  • May add a constant for variance stabilization
  • May replace non-positive values with very small
    values

83
Gene Filtering Data Distribution
  • Rank will select genes whose retrieved value is
    in the top Nth percentile for M or more arrays.
  • Deviations selects those genes whose retrieved
    value has a value significantly above or below
    the mean (N standard deviations), for M or more
    arrays.

84
Gene Filtering Percentile Rank
  • Ten arrays
  • 500-gene genelist
  • Spot flag 0
  • Regression correlation gt 0.6
  • Net intensity in either channel gt 350
  • Rank gt 95 in at least one array
  • 66 SUIDs are used for clustering
  • Many spots are removed, since only the spots that
    were very intense in the red channel were
    included

85
Gene Filtering Deviation from Mean Value
  • Ten arrays, 500-gene genelist
  • Spot flag 0
  • Regression correlation gt 0.6
  • Net intensity in either channel gt 350
  • Genes whose Log(Normalized Red/Green) is more
    than one standard deviation from mean in at least
    one array
  • 70 SUIDs selected for clustering
  • This filter removes spots that do not show
    significant variance from the mean -- a good way
    to identify genes with potentially interesting
    behavior

86
Gene Filtering Centering Data
  • Data can be centered at this stage. This
    transforms the data so that the mean value is
    equal to zero. Images and downloaded files will
    reflect this transformation.
  • During clustering, data can be treated as if they
    were centered, but the values of the data are not
    affected.
  • Data centering and centering during clustering
    can be combined in all four possible ways.
  • Gene centering is useful for common references.
  • Array centering amounts to renormalizing each
    array, using the spots that pass the spot filter
    criteria.

87
Data Centering Effects of Different Centering
Strategies
Uncentered Data, No Centering Metric During
Clustering
Uncentered Data, Centering Metric During
Clustering
Centered Data, No Centering Metric During
Clustering
Centered Data, Centering Metric During Clustering
88
Gene Filtering Center Genes
Centered
Uncentered
  • Ten arrays, 500-gene genelist, Spot flag 0
  • Regression correlation gt 0.6
  • Net intensity in either channel gt 350
  • Genes centered -- no effect on number of SUIDs
    clustered, but distribution of signal is changed
    (centered data is displayed on left)

89
Gene Filtering Data Values
  • Cutoff requires data to exceed a user-defined
    value in at least M arrays. This is perhaps our
    least useful filter. Especially when data are
    centered, you could be losing important
    information.
  • Distance requires that the length of the genes
    expression vector, across all arrays, be greater
    than a user-defined value. This is a general
    measure of response to experimental conditions.
  • Only available for log ratio data.

90
Gene Filtering Values of Log(Red/Green)
  • Ten arrays, 500-gene genelist, Spot flag 0
  • Regression correlation gt 0.6
  • Net intensity in either channel gt 350
  • Log of Red/GreenNormalized Ratio (Mean) is
    absolute value gt 2 for at least 1 array
  • 57 SUIDs selected for clustering
  • Since this is a filter based on values, caution
    should be exercised -- values often change during
    normalization and centering.

91
Gene Filtering Spot Filter Criteria
  • Genes can be screened out if they do not meet the
    spot criteria a given percentage of the time, as
    specified by the user.
  • Arrays can be similarly filtered out if they do
    not meet the spot filter criteria.

92
Gene Filtering Amount of Data Passing Filters
  • Ten arrays, 500-gene genelist, Spot flag 0
  • Regression correlation gt 0.6
  • Net intensity in either channel gt 350
  • Centered genes and arrays
  • Genes must have 80 of spots pass filters
  • 285 SUIDs are used for the cluster
  • This reduces the number of missing data genes
    and permits the clustering to be performed on
    genes with more data points.

93
Gene Filtering Amount of Data Passing Filters
  • Ten arrays, 500-gene genelist, Spot flag 0
  • Regression correlation gt 0.6
  • Net intensity in either channel gt 350
  • Centered genes and arrays
  • Genes and Arrays must have 80 of spots pass
    filters
  • 285 SUIDs are used for the cluster
  • Filtering away arrays whose spots fail the
    filters at a high frequency is a good way to
    remove pathologically bad arrays

94
Spot Filtering vs. Gene Filtering
Gene filters remove the genes that do not meet
the filter criteria often enough. This reduces
the number of genes.
Spot filters remove individual data points. That
means there will be more missing (gray) data.
95
Gene Filtering Summary
  • Correct selection of filters will retain
    interesting data and remove those that are
    unreliable or uninteresting.
  • A good understanding of your experiment is
    REQUIRED before you can decide which filters make
    biological sense.
  • Not all filtering criteria are useful for all
    experiments.

96
Gene Filtering Results
  • The numbers of genes and arrays are shown
  • PreClustering files (.pcl) can be downloaded
  • Summary report is available
  • May deposit to repository at this stage.
  • Proceed to clustering

97
Gene Filtering Data Retrieval Summary Report
98
Gene Filtering Summary
  • Transform single-channel data
  • Filter genes based on data distribution
  • Center data
  • Filter genes based on data values
  • Filter genes and arrays based on spot filter
    criteria

99
PUMAdb Data Analysis
  • Data Analysis Background
  • Data normalization
  • Clustering algorithms
  • Data centering
  • Using the Databases Analysis Pipeline
  • Gene Selection and Annotation
  • Data Filtering
  • Data Retrieval
  • Gene Filtering
  • Clustering and Image Generation

100
Clustering and Image Generation
  • Partitioning options
  • Clustering metric selections
  • Correlated genes
  • Image generation options

101
Clustering Metric Selections
  • Genes and arrays can be clustered.
  • Pearson correlation treats vectors as if they
    were the same (unit) length.
  • Euclidean distance measures the absolute distance
    between two points in space. Therefore Euclidean
    distance will be affected by both the direction
    and the amplitude of the vectors.

102
Clustering Gene Clustering
  • Ten arrays, 500-gene genelist, Spot flag 0
  • Regression correlation gt 0.6
  • Net intensity in either channel gt 350
  • Centered genes
  • Genes must have 80 of spots pass filters
  • 274 SUIDs are used for the cluster
  • No centering during clustering
  • Pearson correlation, genes clustered

103
Clustering Tree Displays
  • Clustered gene arrays are displayed adjacent to
    most similar arrays.
  • The nodes of the trees indicate the members of an
    array and the degree of similarity to its
    neighbor.

104
Clustering Array clustering
  • Ten arrays, 500-gene genelist, Spot flag 0
  • Regression correlation gt 0.6
  • Net intensity in either channel gt 350
  • Centered genes, 80 must pass filters
  • 274 SUIDs are used for the cluster
  • No centering during clustering
  • Pearson correlation, clustering genes and arrays
  • Clustering of arrays will change the order of the
    arrays in your display

105
Clustering Tree Displays
  • Clustering arrays will give a tree for the arrays
    that is very similar to that for the genes

106
Clustering Array Clustering
No Array Clustering
With Array Clustering
107
Clustering Partitioning Data
  • Data can be partitioned into a Self Organizing
    Map (SOM)
  • If partitioned, dimensions of the SOM must be
    specified

108
Clustering Self Organizing Maps
  • SOMs result in genes being assigned to partitions
    of most similar genes
  • Neighboring partitions are more similar to each
    other than they are to distant partitions

109
Clustering Correlated Genes
  • A file listing the best-correlated genes, for
    each gene retrieved, can be produced.

110
Clustering Image Generation Options
  • Contrast can be modified
  • Missing data can be assigned different colors of
    gray
  • Both red/green and blue/yellow schemes can be
    used
  • You can elect to view spot images

111
Clustering Visualization
  • Click on the image to get a dynamic display.
  • Click on one of the other options to see static
    displays with or without the spot images.
  • Downloadable files (.cdt, .atr, .gtr, report) for
    use with other tools (e.g., TreeView).

112
Clustering Cluster Image
  • Scale is indicated on the color bar
  • Gene names are at the right
  • Tree generated by hierarchical clustering is at
    the left

113
Clustering Display Clustered Spot Images
114
Clustering DisplayAdjacent Cluster and
Clustered Spot Images
115
Clustering Display Hierarchical Cluster View
116
Clustering and Image Generation Summary
  • Partitioning data
  • Clustering metric selections
  • Correlated genes
  • Image generation options

117
PUMAdb Data Analysis
  • Data Analysis Background
  • Data normalization
  • Clustering algorithms
  • Data centering
  • Using the Databases Analysis Pipeline
  • Gene Selection and Annotation
  • Data Filtering
  • Data Retrieval
  • Gene Filtering
  • Clustering and Image Generation

118
User Help Help, Tutorials and Workshops
  • Help FAQ
  • http//puma.princeton.edu/help/
  • http//puma.princeton.edu/help/FAQ.shtml
  • Tutorials regularly scheduled
  • Welcome tutorial
  • Data analysis, Normalization and Clustering
  • Interested? Email array_at_genomics.princeton.edu
  • Hybridization Scanning Individual Instruction
  • Email dstorton_at_molbio.princeton.edu

119
PUMAdb Office Hours
  • CIL 135
  • Thursday 9-11 am
  • Friday 2-4 pm
Write a Comment
User Comments (0)
About PowerShow.com