Title: Introduction
1mAdb Basic Informatics Training
Esther Asaki John Greene, Ph.D.
- Introduction Overview
- Uploading Arrays
- Tools Demonstration Discussion
May 30, 2002
2- mAdb BioInformatics Project
- Goal
- Provide an integrated set of web-based analysis
tools and a data management system for storing
and mining cDNA/oligo Gene Expression data using
open systems design. - System only supports arrays produced by the NCI,
NIAID, and FDA Microarray Centers - Currently support Axon GenePix, GSI
Lumonics/Packard/Perkin-Elmer QuantArray, and
Arraysuite image analysis software (Yidong Chen,
NHGRI) two-color only - Imagene possible Affymetrix under study
3- Other microarray training
- Advanced statistics BRB Array Tools class
offered bimonthly next available classes already
full but you will be placed on list for next
classes - MA Explorer periodic
- Intermediate class under development hands-on
4- mAdb
- http//madb.nci.nih.gov
- madb-support_at_bimas.cit.nih.gov
PREFERRED - mAdb Team
-
5Architecture for ?Array Informatics
Image
Format and Upload Image and Data
Analyze Image
Central Expression Database
Scan
Web/ Application Server
Wash
Web-based Data Analysis Tools
Hybridize probe to ?Array
MGD
KEGG
TIGR
GenBank (via Entrez)
GeneCards
dbEST
UniGene
Control
Experiment
DNA Samples
Internal Databases
External Databases
6Architecture for NCI ?Array Informatics
Image
Analyze Image
Upload Composite Image and Data
Sun Enterprise 3500 Server Sybase ASE
Scan
SunBlade 1000 Workstation Apache Web Server
Wash
PC/Mac/Unix Netscape Internet Explorer
Hybridize probe to ?Array
MGD
KEGG
TIGR
GenBank (via Entrez)
Control
Experiment
250Gbytes Fiber channel
RNA Samples
Internal Databases
External Databases
7- System design allows
- Sharing of CGI programs between multiple Web
servers - Customized appearance
- Independent database connections
Sun Enterprise 3500 Sybase ASE
Sun Enterprise 3500 Sybase ASE
NCI Web Server
FDA Web Server
LLMPP Web Server
NIAID Web Server
1TeraByte Network Storage
8- Links to external data sources
- dbEST
- GeneCards
- LocusLink
- MGD
- GenBank
- Stanford GO (Genome Ontology)
- PubMed
- UniGene
- Automatic updates of external data sources
9- mAdb Statistics as of May 29, 2002
- 14,135 Arrays from ATC uploaded since Feb. 2000
- gt 100 million cDNA expression points
- 629 NIH users
- Among the largest collections of microarray
data in the world
10mAdb Database Design Feature Tracking
Inventory Stock
Print Plates
Print Order
Arrays
11Software Downloads
12Data Upload
- Login change password if first-time user
- Create project
- 3. Grant project access to others
- Select project
- Fill in experimental info
- Upload image and data files (be careful not to
reverse - order of files!!)
- Look at status page
- Close browser when finished for security
- Once a species has been chosen for a project,
- you can only see the Array Print Sets for
that species - 10. We suggest including the slide number
scratched on - the slide as part of the Experiment Name,
- which will act as a unique identifier
- 11. Adjust JPEG for desired contrast/brightness
13Create New Project
14Adding Arrays
15(No Transcript)
16Project Access
17Upload Status
18Common Errors
- Common Upload Errors
- Using wrong GAL file
- Loading GAL file or Set Up file in place of
GenePix data (.gpr) file - Loading multi-image TIFF file instead of
composite JPEG file - Loading Image File in place of data file and vice
versa - Common GenePix Errors
- Setting incorrect option for Analyze Absent
Feature (box should be checked) results in
truncated blocks - Deleted blocks
- Improper gridding
19Array Analysis Methods
- Gene Discovery
- Outlier detection simple and group logic
retrieval tools single and multiple array
viewers - Scatter plots
- Pattern Discovery
- Clustering Hierarchical, K-means, SOMs
- Multidimensional Scaling
- Gene Shaving, Tree Harvesting, PCA, etc.
- Pattern Prediction
- 2 Group t-test others imminent
20Default Definitions
- Signal - refers to the (Target Intensity -
Background Intensity). More precisely, it is the
MEAN Target Intensity - MEDIAN Background
Intensity. MEAN-MEDIAN was used based on a
publication by Mike Eisen at Stanford. - Normalization By default, we use the ratio
(Signal Cy5/Signal Cy3). Normalization is
calculated so that the median(Ratio) is 1.0.
Those outliers with an extremely low signal are
excluded from the calculation. - Spot size for GenePix, Spot Size is the
percentage of feature pixels 1 S.D. above
background.
21Need for Normalization of Ratios
- Unequal incorporation of labels (green better
than red) - Unequal amounts of sample
- Unequal PMT voltage
22(No Transcript)
23Whenever possible, use log spot intensities and
ratios
- Why? Because it makes variation of intensities
and ratios of intensities more independent of
absolute magnitude. - Easier interpretation negative numbers are
downregulated genes positive numbers are
upregulated genes - Evens out highly skewed distributions
- Gives a more realistic sense of variation
24- mAdb Analysis Paradigm
- Extract, Spot Filter, Normalize Align a Dataset
- Apply Data/Gene criteria Filters
- Apply appropriate Analysis/Visualization Tools
- Retrieve Datasets/Results
25Live Demo
26(No Transcript)
27(No Transcript)
28New Correlation Summary Report
29Simple Group Retrieval Tool
Applies spot filtering options to selected arrays
and creates a new working dataset.
show
30Extended Dataset Extraction Tool (GenePix Arrays
Only)
do
31- Extended Tool Signal, Normalization Ratio
Options - Signal Calculation
- Mean Intensity Median Background
- Median Intensity Median Background
- Normalization
- None
- 50th Percentile (Median)
- Applied to extracted spots (spot filtered)
- All spots or only Housekeeping spots (if
designated) - Default Ratio
- Chan B/Chan A (CY5/CY3)
- Chan A/Chan B (CY3/CY5)
32(No Transcript)
33- Spot Filter Options Checkbox to Activate
- Exclude any Spots Flagged as
- Target diameter is between
- Target Pixels 1 Standard Deviation above
background gt - Signal/Background Ratio gt
- Signal gt
- Override if Chan B and /or A Signal gt
-
34- Spot Filter Options Checkbox to Activate
- Exclude any Spots Flagged as
- Target diameter is between
- Target Pixels 1 Standard Deviation above
background gt - Signal/Background Ratio gt
- Signal gt
- Override if Chan B and /or A Signal gt
-
35(No Transcript)
36- Dataset Propeties Checkbox to Activate
- Rows ordered by
- Dataset Location
- Transient (24 Hours after creation)
- Temporary (30 Days after last access)
- Permanent
- Dataset Label highly recommended
-
37Waiting for Data Extraction
Intermediate screen which monitors the data
extraction process. When the creation of the
working dataset is complete, the user can
continue to the Data Display page.
38Data Display - Example
39(No Transcript)
40GeneCardstm Mirror Site
41Interacting with data sets
42- Additional Data Filtering/Adjustment/Analysis
- Additional Filtering Options (Data values)
- Array Order Designation/Filtering
- Array Group Assignment/Filtering
- Two or more Group Comparison - statistics
- Boolean Comparison with another Set
- Clustering (Hierarchical, K-means, SOM)
- Correlation Summary Report pairwise scatter
plots - Scatter Plot
- Multi-Dimensional Scaling
- Save as a New Dataset
-
43Additional Data Filtering Options
Applies selected filtering options to the dataset
and creates a new subset.
44(No Transcript)
45Dataset History
A log is maintained for each dataset tracing the
analysis history. When the history is displayed,
links are provided to allow the user to recall
any dataset in the analysis chain.
46Filtering hierarchy /tree structure
Original spot filtering
Original Dataset
Additional filtering
Data subsets
47Accessing data sets
48Boolean Comparison Summary
Clicking on the Logical Subset links creates a
new working dataset reflecting the Boolean
results.
49Multiple Array Viewer
50Designating groups
51Two Group Statistical Comparison Options
52T-test
- The t-test assesses whether the means of two
groups are statistically different from each
other. - Once you compute the t-value you have to look it
up in a table of significance to test whether the
ratio is large enough to say that the difference
between the groups is not likely to have been a
chance finding. To test the significance, you
need to set a risk level (called the alpha
level). In most research, the "rule of thumb" is
to set the alpha level at .05. This means that
five times out of a hundred you would find a
statistically significant difference between the
means even if there was none (i.e., by "chance").
53(No Transcript)
54(No Transcript)
55(No Transcript)
56(No Transcript)
57(No Transcript)
58Clustering
- Clustering programs make clusters even if the
data are completely random you must examine your
clusters to see if they make biological sense - If clustered by genes, are the genes in certain
clusters biologically related in function? In a
pathway? - If clustered by array, do the clusters group
related samples/tissues/diseases/treatments
together logically?
59What is Clustering?
- Discovery algorithms function by using a bottom
up approach to explore new phenomena in the
data. Rather than relying on previous knowledge
to help sort the data sets, the data are allowed
to determine their own sorting parameters running
unsupervised (i.e., without human intervention).
- Clustering is an example of a discovery
algorithm. Given a collection of data points,
clustering techniques find structure in the data
and attempt to organize that data into meaningful
groups. - A cluster is a set of data points (such as gene
expression data) which are alike. The goal in
sorting is for each cluster to be distinct from
other clusters. - Clustering is an unsupervised machine learning
technique - Cluster analysis is the formal study of
algorithms and methods for clustering of data
60Common clustering methods
Hierarchical Clustering allows you to visualize a
set of samples or genes by organizing them into a
mock-phylogenetic tree, often referred to as a
dendrogram. In these trees, samples or genes
having similar effects on the gene expression
patterns are clustered together.
K-means clustering divides genes into distinct
groups based on their expression patterns. Genes
are initially divided into a number (k) of
user-defined and equally-sized groups. Centroids
are calculated for each group corresponding to
the average of the expression profiles.
Individual genes are then reassigned to the group
in which the centroid is the most similar to the
gene. The process is iterated until the group
compositions converge.
Self-Organizing Maps (SOMs) are similar to
k-means clustering, but adds an additional
feature where the resulting groups of genes can
be displayed in a rectangular pattern, with
adjacent groups being more similar than groups
further away. Self-Organizing Maps were invented
by Tuevo Kohonen and are used to analyze many
kinds of data.
61Example of Hierarchical Clustering (Alizadeh et
al., Nature, Feb. 2000)
62Dendrogram Construction for Hierarchical
Agglomerative Clustering
- Merge two closest (least distant) objects (genes
or arrays) - Subsequent merges require specification of
linkage to define distance between clusters - Average linkage
- Complete linkage
- Single linkage
63Linkage Methods
- Average Linkage
- Merge clusters whose average distance between all
pairs of items (one item from each cluster) is
minimized - Particularly sensitive to distance metric
- Complete Linkage
- Merge clusters to minimize the maximum distance
within any resulting cluster - Tends to produce compact clusters
- Single Linkage
- Merge clusters at minimum distance from one
another - Prone to chaining and sensitive to noise
64(Data from Bittner et al., Nature, 2000)
65Common Distance Metrics forHierarchical
Clustering
- Euclidean distance
- Measures absolute distance (square root of sum of
squared differences) - 1-Correlation
- Large values reflect lack of linear association
(pattern dissimilarity)
66Server-side Hierarchical Clustering
67Clustering In Progress Page
68Hierarchical Clustering Output
69Expanded Thumbnail Image
70Tree View for PostScript output or very large
files
71Multidimensional Scaling
- Represents data points from a high-dimensional
space in a lower-dimensional space - Example Represent a tumors 5,000-dimensional
gene profile as a point in 3-dimensional space - Typically based on principal components or uses
optimization methods that select
lower-dimensional coordinates to best match
pairwise distances in higher-dimensional space - Depends only on pairwise distances (Euclidean,
1-correlation, . . .) between points - All distances in lower dimensional space must be
viewed in a relative sense
72(No Transcript)
73Tips for array analysis
- Do you really want to upload that ugly array?
- Look at Project Summaries normalization factor
for a good array should be between 0.5 and 2.0. - If you have replicate arrays, do a scatter plot
to determine the correlation between the arrays
(i.e. how close the slope is to 1. For reverse
fluors, how close to 1) just for QC purposes. - Filters former ATC director Lance Miller sets
signal/background 2.0 and sets Signal to zero
for both Channel A and B. - Select Results Format as HTML preview first to
see what results look like. If you Limit
Preview, results are not limited if you choose to
export to another format. - Turning Show Spot Images off, displays results
faster. - In clustering, you can cluster genes and/or
arrays.
74- Coming in the NEAR Future
- Excel/Clustering support for additional row
parameters - Additional Statistics Analysis (ANOVA, )
- Additional Filtering based on Gene annotations
- Extensive Ad Hoc Query Applied to Datasets
- Gene Set Creation/Filtering from Gene Ontology
(GO) - Graphical Viewers (for both Macintosh PC)
- Full support for Arraysuite II
- Ability to average repeats/RF repeats
-
75- Coming in the Future
- Support for Affymetrix Data
- Shared Analysis/Dataset Areas
- Partek Datafile package retrieval
- MIAME/GEML compliance/support
-
76- Going in the NEAR Future (from the Gateway Tools
level) - 1 or 2 Group Logic Retrieval Tool
- To be moved down to dataset level
- Scatter Plot Tool
- Ad Hoc Query Tool
-
77Older Tools
78(No Transcript)
79(No Transcript)
80Ad Hoc Query Tool
81Ad Hoc Query Output
82http//madb.nci.nih.gov
For assistance, remember madb-support_at_bimas
.cit.nih.gov