Title: GIS mAdb Intermediate
1GIS mAdb Intermediate Informatics Training
John Greene, Ph.D.
August 29, 2002
2Default Definitions
- Signal - refers to the (Target Intensity -
Background Intensity). More precisely, it is the
MEAN Target Intensity - MEDIAN Background
Intensity. MEAN-MEDIAN was used based on a
publication by Mike Eisen at Stanford. Can now
also choose MEDIAN MEDIAN. - Normalization By default, we use the overall
ratio (Signal Cy5/Signal Cy3). Normalization is
calculated so that the median(Ratio) is 1.0.
Those outliers with an extremely low signal are
excluded from the calculation. - Spot size for GenePix, Spot Size is the
percentage of feature pixels 1 S.D. above
background.
3Recenter
Histogram uses spots that have been extracted and
filtered
4Whenever possible, use ratios converted to log
base 2
- Why? Because it makes variation ratios of
intensities more independent of absolute
magnitude. - Easier interpretation negative numbers are
downregulated genes positive numbers are
upregulated genes - Evens out highly skewed distributions
- Gives a more realistic sense of variation
5Simple Group Retrieval Tool ArraySuite data
Applies spot filtering options to selected arrays
and creates a new working dataset.
show
6Extended Dataset Extraction Tool (GenePix Arrays
Only)
do
7Spotfilters and Dataset Properties
8- Dataset Properties Checkbox to Activate
- Rows ordered by
- Dataset Location
- Transient (24 Hours after creation)
- Temporary (30 Days after last access)
- Permanent
- Dataset Label highly recommended
-
9Data Display - Example
10Additional Filtering and Analysis Options
11Additional Data/Array Filtering Options
Applies selected filtering options to the dataset
based on values in the data and creates a new
subset. Can repeat without changing set name for
trial and error filtering
12Open/Expand datasets
13Filtering hierarchy /tree structure why
dataset management is a necessity
Original spot filtering
Original Dataset
Additional filtering
Data subsets
14Refreshing Gene Info Dataset Management
Not yet available on GIS mAdb
15Dataset Management delete/move datasets
16Dataset History
A log is maintained for each dataset tracing the
analysis history. When the history is displayed,
links are provided to allow the user to recall
any dataset in the analysis chain.
17(No Transcript)
18Boolean Comparison Summary
Clicking on the Logical Subset links creates a
new working dataset reflecting the Boolean
results.
19Array Analysis Methods
- Gene Discovery
- Outlier detection simple and group logic
retrieval tools multiple array viewers - Scatter plots
- Pattern Prediction
- t-tests Wilcoxon tests, ANOVA, Kruskal-Willis
- Stanford PAM imminent
- Pattern Discovery
- Clustering Hierarchical, K-means, SOMs
- Multidimensional Scaling, PCA
- FutureGene Shaving, Tree Harvesting,
20Designating groups
21Two Group Statistical Comparison Options
22T-test
- The t-test assesses whether the means of two
groups are statistically different from each
other. - Once you compute the t-value you have to look it
up in a table of significance to test whether the
ratio is large enough to say that the difference
between the groups is not likely to have been a
chance finding. To test the significance, you
need to set a risk level (called the alpha
level). In most research, the "rule of thumb" is
to set the alpha level at .05. This means that
five times out of a hundred you would find a
statistically significant difference between the
means even if there was none (i.e., by "chance"). - More than two groups ANOVA(parametric)
Kruskal-Wallis (non-parametric)
23Independent T-test variance
- Equal (pooled) or unequal (separate) variance
- For independent (non-paired) samples, you must
choose an option for the variance of the data - Checking this option bases the calculation for
the variance of a difference between two
proportions or a difference between two means on
the assumption that the variance in the
populations from which the two groups studied are
selected is the same. Note that the default
choice, two populations with different variances,
would be preferred by many researchers. You have
to have some evidence in logic or observed that
the variances are the same before selecting this
option. - The pooled variance in the case of the equal
variance assumption will mostly be larger,
compared with the un-equality variance
assumption. However, the number of degrees of
freedom will also be larger at df (n1n2)-2.
This will result in a slightly more powerful test
of statistical significance.
24Wilcoxon tests
- The t-tests are widely used, but they do depend
on certain assumptions. These assumptions are - 1. The data are from a normal distribution
(i.e. parametric) - 2. All observations are independent
- When these assumptions are acceptable, the
t-tests provide the most sensitive and powerful
approach to the analysis of the data. - However, in many cases, observations arise from
populations which are clearly non-normal. In
these cases, simpler tests are available, based
on signs, or on the rank order of the data. These
are known as non-parametric tests. - Independent samples use Wilcoxon rank-sum test
(Mann-Whitney and Wilcoxon Rank Sum use different
methods of calculation, but are equivalent in
result). - Paired (dependent) samples use Wilcoxon
matched-pairs signed rank test - http//www-jime.open.ac.uk/98/12/demos/stats/stats
.html
25(No Transcript)
26(No Transcript)
27(No Transcript)
28(No Transcript)
29Ad Hoc Query Tool
30Ad Hoc Query Output
31Graphics tools Scatter Plot
32 Correlation Summary Report
pairwise scatter plots
33Multiple Array Viewer
34Multidimensional Scaling
- Mapping of data points from a high-dimensional
space into a lower-dimensional space - Example Represent a tumors 5,000-dimensional
gene profile as a point in 3-dimensional space - Typically uses nonlinear optimization methods
that select lower-dimensional coordinates to
best match pairwise distances in
higher-dimensional space - Depends only on pairwise distances (Euclidean,
1-correlation, . . .) between points - All distances in the lower dimensional space must
be viewed in a relative sense - Allows missing values in input data
35(No Transcript)
36PCA
Principal components analysis (PCA) explores the
variability in gene expression patterns and finds
a small number of themes. These themes can be
combined to make all the different gene
expression patterns in a data set. The first
principal component is obtained by finding the
linear combination of expression patterns
explaining the greatest amount of variability in
the data. The second principal component is
obtained by finding another linear
combination of expression patterns
that is at right angles to (i.e. orthogonal and
uncorrelated with) the first principal component.
The second principal component must explain the
greatest amount of the remaining variability in
the data after accounting for the first principal
component. Each succeeding principal component
is similarly obtained. There will never be more
principal components than there are variables
(experimental points) in the data. Any individual
gene expression pattern can be recreated as a
linear combination of the principal component
expression patterns.
37Principal Components Analysis
- Principal Components Analysis (PCA) is an
exploratory multivariate statistical technique
for simplifying complex data sets. Given m
observations on n variables, the goal of PCA is
to reduce the dimensionality of the data matrix
by finding r new variables, where r is less than
n. Termed principal components, these r new
variables together account for as much of the
variance in the original n variables as possible
while remaining mutually uncorrelated and
orthogonal. Each principal component is a linear
combination of the original variables, and so it
is often possible to ascribe meaning to what the
components represent. Principal components
analysis has been used in a wide range of
biomedical problems, including the analysis of
microarray data in search of outlier genes
(Hilsenbeck et al. 1999) as well as the analysis
of other types of expression data (Vohradsky et
al. 1997, Craig et al. 1997). - Use PCA to focus on specific expression patterns
and their changes, identify discriminating genes,
separate contributing profiles and find trends,
e.g. in time series or dose response curves. - For dispersion matrix, use correlation option
when data is scaled to fit within boundaries, or
when variables measured in different units or
have different variances. Most often, covariance
is the correct choice ( when variables measured
in same units and have similar variance) - N.B. PCA does not allow missing values in input
data these are filtered out - http//www.statsoftinc.com/textbook/stfacan.html
- http//www.okstate.edu/artsci/botany/ordinate/PCA.
htm
38PCA Details
(First three components)
39PCA Details
40MDS/PCA comparison
- PCA
- Linear projection
- Does not allow (filters out) missing values
- Preserves large dissimilarities better
- Meaningful variables
- information content known
- Computationally efficient for large number of
samples - Meaningful orientation
- Performed on covariance or correlation
similarities
- MDS
- Nonlinear projection
- Allows missing values
- Preserves small dissimilarities better
- Meaningless variables
- information content not known
- Computationally inefficient for large number of
samples - Arbitrary orientation
- Performed on any type of (dis)similarities
Adapted from Partek Quick Start for Microarray
Analysis
41Clustering
- Clustering programs make clusters even if the
data are completely random you must examine your
clusters to see if they make biological sense - If clustered by genes, are the genes in certain
clusters biologically related in function? In a
pathway? - If clustered by array, do the clusters group
related samples/tissues/diseases/treatments
together logically?
42Common clustering methods
Hierarchical Clustering allows you to visualize a
set of samples or genes by organizing them into a
mock-phylogenetic tree, often referred to as a
dendrogram. In these trees, samples or genes
having similar effects on the gene expression
patterns are clustered together.
K-means clustering divides genes into distinct
groups based on their expression patterns. Genes
are initially divided into a number (k) of
user-defined and equally-sized groups. Centroids
are calculated for each group corresponding to
the average of the expression profiles.
Individual genes are then reassigned to the group
in which the centroid is the most similar to the
gene. The process is iterated until the group
compositions converge.
Self-Organizing Maps (SOMs) are similar to
k-means clustering, but adds an additional
feature where the resulting groups of genes can
be displayed in a rectangular pattern, with
adjacent groups being more similar than groups
further away. Self-Organizing Maps were invented
by Tuevo Kohonen and are used to analyze many
kinds of data.
43Example of Hierarchical Clustering (Alizadeh et
al., Nature, Feb. 2000)
44Dendrogram Construction for Hierarchical
Agglomerative Clustering
- Merge two closest (least distant) objects (genes
or arrays) - Subsequent merges require specification of
linkage to define distance between clusters - Average linkage
- Complete linkage
- Single linkage
45Euclidean distance
Generally, the distance between two points is
taken as a common metric to assess the similarity
among the components of a population. The most
commonly used distance measure is the Euclidean
metric which defines the distance between two
points p ( p1, p2, ...) and q ( q1, q2, ...)
as
46Linkage Methods
- Average Linkage
- Merge clusters whose average distance between all
pairs of items (one item from each cluster) is
minimized - Particularly sensitive to distance metric
- Complete Linkage
- Merge clusters to minimize the maximum distance
within any resulting cluster - Tends to produce compact clusters
- Single Linkage
- Merge clusters at minimum distance from one
another - Prone to chaining and sensitive to noise
47(Data from Bittner et al., Nature, 2000)
48Common Distance Metrics forHierarchical
Clustering
- Euclidean distance
- Measures absolute distance (square root of sum of
squared differences) - 1-Correlation
- Values reflect amount of linear association
(pattern dissimilarity) smaller the value, more
similar the gene expression pattern
8 9 10 11 12 13
14
49Server-side Hierarchical Clustering
50Hierarchical Clustering Output
51Expanded Heatmap Thumbnail Image
52Tree View for PostScript output, too large files
http//rana.lbl.gov/EisenSoftware.htm
53K-means
54Self-organizing/Kohonen maps
55Summary Remarks
- Data quality assessment and pre-processing are
important. - Different study objectives will require different
statistical analysis approaches. - Different analysis methods may produce different
results. Thoughtful application of multiple
analysis methods may be required. - Chances for spurious findings are enormous, and
validation of any findings on larger independent
collections of specimens will be essential. - Analysis tools are not an adequate substitute for
collaboration with professional statisticians and
data analysts.
56Acknowledgments The Single ArrayViewer and
Multi-ArrayViewer were derived from NHGRI uAP
Toolset developed in the NHGRI/Cancer Genetics
Branch under Dr. Jeffrey Trent. The Scatterplot
and Multi-dimensional scaling tools were derived
from work done in the NCI/Biometric Research
Branch under Dr. Richard Simon. Server side
Cluster uses a derivative of the Xcluster program
developed at Stanford University by Gavin
Sherlock, Head Microarray Informatics.
57Acknowledgments
- CIT NCI mAdb
- John Powell, Chief, BIMAS
- Liming Yang, Ph.D.
- Jim Tomlin
- Carla Bock
- Esther Asaki, SRA
- Robin Martell, SRA
- Kathy Meyer, SRA
- Agara Sudhindra, SRA
- Tammy Qiu, SRA
- Biometric Research Branch/NCI
- Richard Simon, Ph.D.
- Lisa McShane, Ph.D.
- Michael Radmacher, Ph.D.
- Joanna Shih, Ph.D
- Yingdong Zhao, Ph.D.
- MSB Section
- NHGRI Java viewers
- Mike Bittner
- Yidong Chen
- Jeff Trent
58(No Transcript)
59Averaging Arrays
Names/Descriptions for averaged arrays This tool
creates a new dataset consisting of one array per
group. Each array is the average of all arrays
within a group. Averaging is done on the log base
2 ratio values. The new averaged arrays will not
have an array name or description. You may enter
appropriate Names/Descriptions to be associated
with the new arrays. If you choose not to enter
values, the name will default to the Group
designation, the description defaults to NULL.
60Gene Ontology/KEGG Pathway Summary Report
61(No Transcript)