Introduction - PowerPoint PPT Presentation

1 / 82
About This Presentation
Title:

Introduction

Description:

John Powell, Chief, BIMAS, CIT. 5. Architecture for Array Informatics. Central. Expression ... The process is iterated until the group compositions converge. ... – PowerPoint PPT presentation

Number of Views:26
Avg rating:3.0/5.0
Slides: 83
Provided by: jimt158
Category:

less

Transcript and Presenter's Notes

Title: Introduction


1
mAdb Basic Informatics Training
Esther Asaki John Greene, Ph.D.
  • Introduction Overview
  • Uploading Arrays
  • Tools Demonstration Discussion

May 30, 2002
2
  • mAdb BioInformatics Project
  • Goal
  • Provide an integrated set of web-based analysis
    tools and a data management system for storing
    and mining cDNA/oligo Gene Expression data using
    open systems design.
  • System only supports arrays produced by the NCI,
    NIAID, and FDA Microarray Centers
  • Currently support Axon GenePix, GSI
    Lumonics/Packard/Perkin-Elmer QuantArray, and
    Arraysuite image analysis software (Yidong Chen,
    NHGRI) two-color only
  • Imagene possible Affymetrix under study

3
  • Other microarray training
  • Advanced statistics BRB Array Tools class
    offered bimonthly next available classes already
    full but you will be placed on list for next
    classes
  • MA Explorer periodic
  • Intermediate class under development hands-on

4
  • mAdb
  • http//madb.nci.nih.gov
  • madb-support_at_bimas.cit.nih.gov
    PREFERRED
  • mAdb Team

5
Architecture for ?Array Informatics
Image
Format and Upload Image and Data
Analyze Image
Central Expression Database
Scan
Web/ Application Server
Wash
Web-based Data Analysis Tools
Hybridize probe to ?Array
MGD
KEGG
TIGR
GenBank (via Entrez)
GeneCards
dbEST
UniGene
Control
Experiment
DNA Samples
Internal Databases
External Databases
6
Architecture for NCI ?Array Informatics
Image
Analyze Image
Upload Composite Image and Data
Sun Enterprise 3500 Server Sybase ASE
Scan
SunBlade 1000 Workstation Apache Web Server
Wash
PC/Mac/Unix Netscape Internet Explorer
Hybridize probe to ?Array
MGD
KEGG
TIGR
GenBank (via Entrez)
Control
Experiment
250Gbytes Fiber channel
RNA Samples
Internal Databases
External Databases
7
  • System design allows
  • Sharing of CGI programs between multiple Web
    servers
  • Customized appearance
  • Independent database connections

Sun Enterprise 3500 Sybase ASE
Sun Enterprise 3500 Sybase ASE
NCI Web Server
FDA Web Server
LLMPP Web Server
NIAID Web Server
1TeraByte Network Storage
8
  • Links to external data sources
  • dbEST
  • GeneCards
  • LocusLink
  • MGD
  • GenBank
  • Stanford GO (Genome Ontology)
  • PubMed
  • UniGene
  • Automatic updates of external data sources

9
  • mAdb Statistics as of May 29, 2002
  • 14,135 Arrays from ATC uploaded since Feb. 2000
  • gt 100 million cDNA expression points
  • 629 NIH users
  • Among the largest collections of microarray
    data in the world

10
mAdb Database Design Feature Tracking
Inventory Stock
Print Plates
Print Order
Arrays
11
Software Downloads
12
Data Upload
  • Login change password if first-time user
  • Create project
  • 3. Grant project access to others
  • Select project
  • Fill in experimental info
  • Upload image and data files (be careful not to
    reverse
  • order of files!!)
  • Look at status page
  • Close browser when finished for security
  • Once a species has been chosen for a project,
  • you can only see the Array Print Sets for
    that species
  • 10. We suggest including the slide number
    scratched on
  • the slide as part of the Experiment Name,
  • which will act as a unique identifier
  • 11. Adjust JPEG for desired contrast/brightness

13
Create New Project
14
Adding Arrays
15
(No Transcript)
16
Project Access
17
Upload Status
18
Common Errors
  • Common Upload Errors
  • Using wrong GAL file
  • Loading GAL file or Set Up file in place of
    GenePix data (.gpr) file
  • Loading multi-image TIFF file instead of
    composite JPEG file
  • Loading Image File in place of data file and vice
    versa
  • Common GenePix Errors
  • Setting incorrect option for Analyze Absent
    Feature (box should be checked) results in
    truncated blocks
  • Deleted blocks
  • Improper gridding

19
Array Analysis Methods
  • Gene Discovery
  • Outlier detection simple and group logic
    retrieval tools single and multiple array
    viewers
  • Scatter plots
  • Pattern Discovery
  • Clustering Hierarchical, K-means, SOMs
  • Multidimensional Scaling
  • Gene Shaving, Tree Harvesting, PCA, etc.
  • Pattern Prediction
  • 2 Group t-test others imminent

20
Default Definitions
  • Signal - refers to the (Target Intensity -
    Background Intensity). More precisely, it is the
    MEAN Target Intensity - MEDIAN Background
    Intensity. MEAN-MEDIAN was used based on a
    publication by Mike Eisen at Stanford.
  • Normalization By default, we use the ratio
    (Signal Cy5/Signal Cy3). Normalization is
    calculated so that the median(Ratio) is 1.0.
    Those outliers with an extremely low signal are
    excluded from the calculation.
  • Spot size for GenePix, Spot Size is the
    percentage of feature pixels 1 S.D. above
    background.

21
Need for Normalization of Ratios
  • Unequal incorporation of labels (green better
    than red)
  • Unequal amounts of sample
  • Unequal PMT voltage

22
(No Transcript)
23
Whenever possible, use log spot intensities and
ratios
  • Why? Because it makes variation of intensities
    and ratios of intensities more independent of
    absolute magnitude.
  • Easier interpretation negative numbers are
    downregulated genes positive numbers are
    upregulated genes
  • Evens out highly skewed distributions
  • Gives a more realistic sense of variation

24
  • mAdb Analysis Paradigm
  • Extract, Spot Filter, Normalize Align a Dataset
  • Apply Data/Gene criteria Filters
  • Apply appropriate Analysis/Visualization Tools
  • Retrieve Datasets/Results

25
Live Demo
26
(No Transcript)
27
(No Transcript)
28
New Correlation Summary Report
29
Simple Group Retrieval Tool
Applies spot filtering options to selected arrays
and creates a new working dataset.
show
30
Extended Dataset Extraction Tool (GenePix Arrays
Only)
do
31
  • Extended Tool Signal, Normalization Ratio
    Options
  • Signal Calculation
  • Mean Intensity Median Background
  • Median Intensity Median Background
  • Normalization
  • None
  • 50th Percentile (Median)
  • Applied to extracted spots (spot filtered)
  • All spots or only Housekeeping spots (if
    designated)
  • Default Ratio
  • Chan B/Chan A (CY5/CY3)
  • Chan A/Chan B (CY3/CY5)

32
(No Transcript)
33
  • Spot Filter Options Checkbox to Activate
  • Exclude any Spots Flagged as
  • Target diameter is between
  • Target Pixels 1 Standard Deviation above
    background gt
  • Signal/Background Ratio gt
  • Signal gt
  • Override if Chan B and /or A Signal gt

34
  • Spot Filter Options Checkbox to Activate
  • Exclude any Spots Flagged as
  • Target diameter is between
  • Target Pixels 1 Standard Deviation above
    background gt
  • Signal/Background Ratio gt
  • Signal gt
  • Override if Chan B and /or A Signal gt

35
(No Transcript)
36
  • Dataset Propeties Checkbox to Activate
  • Rows ordered by
  • Dataset Location
  • Transient (24 Hours after creation)
  • Temporary (30 Days after last access)
  • Permanent
  • Dataset Label highly recommended

37
Waiting for Data Extraction
Intermediate screen which monitors the data
extraction process. When the creation of the
working dataset is complete, the user can
continue to the Data Display page.
38
Data Display - Example
39
(No Transcript)
40
GeneCardstm Mirror Site
41
Interacting with data sets
42
  • Additional Data Filtering/Adjustment/Analysis
  • Additional Filtering Options (Data values)
  • Array Order Designation/Filtering
  • Array Group Assignment/Filtering
  • Two or more Group Comparison - statistics
  • Boolean Comparison with another Set
  • Clustering (Hierarchical, K-means, SOM)
  • Correlation Summary Report pairwise scatter
    plots
  • Scatter Plot
  • Multi-Dimensional Scaling
  • Save as a New Dataset

43
Additional Data Filtering Options
Applies selected filtering options to the dataset
and creates a new subset.
44
(No Transcript)
45
Dataset History
A log is maintained for each dataset tracing the
analysis history. When the history is displayed,
links are provided to allow the user to recall
any dataset in the analysis chain.
46
Filtering hierarchy /tree structure
Original spot filtering
Original Dataset
Additional filtering
Data subsets
47
Accessing data sets
48
Boolean Comparison Summary
Clicking on the Logical Subset links creates a
new working dataset reflecting the Boolean
results.
49
Multiple Array Viewer
50
Designating groups
51
Two Group Statistical Comparison Options
52
T-test
  • The t-test assesses whether the means of two
    groups are statistically different from each
    other.
  • Once you compute the t-value you have to look it
    up in a table of significance to test whether the
    ratio is large enough to say that the difference
    between the groups is not likely to have been a
    chance finding. To test the significance, you
    need to set a risk level (called the alpha
    level). In most research, the "rule of thumb" is
    to set the alpha level at .05. This means that
    five times out of a hundred you would find a
    statistically significant difference between the
    means even if there was none (i.e., by "chance").

53
(No Transcript)
54
(No Transcript)
55
(No Transcript)
56
(No Transcript)
57
(No Transcript)
58
Clustering
  • Clustering programs make clusters even if the
    data are completely random you must examine your
    clusters to see if they make biological sense
  • If clustered by genes, are the genes in certain
    clusters biologically related in function? In a
    pathway?
  • If clustered by array, do the clusters group
    related samples/tissues/diseases/treatments
    together logically?

59
What is Clustering?
  • Discovery algorithms function by using a bottom
    up approach to explore new phenomena in the
    data. Rather than relying on previous knowledge
    to help sort the data sets, the data are allowed
    to determine their own sorting parameters running
    unsupervised (i.e., without human intervention).
  • Clustering is an example of a discovery
    algorithm. Given a collection of data points,
    clustering techniques find structure in the data
    and attempt to organize that data into meaningful
    groups.
  • A cluster is a set of data points (such as gene
    expression data) which are alike. The goal in
    sorting is for each cluster to be distinct from
    other clusters.
  • Clustering is an unsupervised machine learning
    technique
  • Cluster analysis is the formal study of
    algorithms and methods for clustering of data

60
Common clustering methods
Hierarchical Clustering allows you to visualize a
set of samples or genes by organizing them into a
mock-phylogenetic tree, often referred to as a
dendrogram. In these trees, samples or genes
having similar effects on the gene expression
patterns are clustered together.
K-means clustering divides genes into distinct
groups based on their expression patterns. Genes
are initially divided into a number (k) of
user-defined and equally-sized groups. Centroids
are calculated for each group corresponding to
the average of the expression profiles.
Individual genes are then reassigned to the group
in which the centroid is the most similar to the
gene. The process is iterated until the group
compositions converge.
Self-Organizing Maps (SOMs) are similar to
k-means clustering, but adds an additional
feature where the resulting groups of genes can
be displayed in a rectangular pattern, with
adjacent groups being more similar than groups
further away. Self-Organizing Maps were invented
by Tuevo Kohonen and are used to analyze many
kinds of data.
61
Example of Hierarchical Clustering (Alizadeh et
al., Nature, Feb. 2000)
62
Dendrogram Construction for Hierarchical
Agglomerative Clustering
  • Merge two closest (least distant) objects (genes
    or arrays)
  • Subsequent merges require specification of
    linkage to define distance between clusters
  • Average linkage
  • Complete linkage
  • Single linkage

63
Linkage Methods
  • Average Linkage
  • Merge clusters whose average distance between all
    pairs of items (one item from each cluster) is
    minimized
  • Particularly sensitive to distance metric
  • Complete Linkage
  • Merge clusters to minimize the maximum distance
    within any resulting cluster
  • Tends to produce compact clusters
  • Single Linkage
  • Merge clusters at minimum distance from one
    another
  • Prone to chaining and sensitive to noise

64
(Data from Bittner et al., Nature, 2000)
65
Common Distance Metrics forHierarchical
Clustering
  • Euclidean distance
  • Measures absolute distance (square root of sum of
    squared differences)
  • 1-Correlation
  • Large values reflect lack of linear association
    (pattern dissimilarity)

66
Server-side Hierarchical Clustering
67
Clustering In Progress Page
68
Hierarchical Clustering Output
69
Expanded Thumbnail Image
70
Tree View for PostScript output or very large
files
71
Multidimensional Scaling
  • Represents data points from a high-dimensional
    space in a lower-dimensional space
  • Example Represent a tumors 5,000-dimensional
    gene profile as a point in 3-dimensional space
  • Typically based on principal components or uses
    optimization methods that select
    lower-dimensional coordinates to best match
    pairwise distances in higher-dimensional space
  • Depends only on pairwise distances (Euclidean,
    1-correlation, . . .) between points
  • All distances in lower dimensional space must be
    viewed in a relative sense

72
(No Transcript)
73
Tips for array analysis
  • Do you really want to upload that ugly array?
  • Look at Project Summaries normalization factor
    for a good array should be between 0.5 and 2.0.
  • If you have replicate arrays, do a scatter plot
    to determine the correlation between the arrays
    (i.e. how close the slope is to 1. For reverse
    fluors, how close to 1) just for QC purposes.
  • Filters former ATC director Lance Miller sets
    signal/background 2.0 and sets Signal to zero
    for both Channel A and B.
  • Select Results Format as HTML preview first to
    see what results look like. If you Limit
    Preview, results are not limited if you choose to
    export to another format.
  • Turning Show Spot Images off, displays results
    faster.
  • In clustering, you can cluster genes and/or
    arrays.

74
  • Coming in the NEAR Future
  • Excel/Clustering support for additional row
    parameters
  • Additional Statistics Analysis (ANOVA, )
  • Additional Filtering based on Gene annotations
  • Extensive Ad Hoc Query Applied to Datasets
  • Gene Set Creation/Filtering from Gene Ontology
    (GO)
  • Graphical Viewers (for both Macintosh PC)
  • Full support for Arraysuite II
  • Ability to average repeats/RF repeats

75
  • Coming in the Future
  • Support for Affymetrix Data
  • Shared Analysis/Dataset Areas
  • Partek Datafile package retrieval
  • MIAME/GEML compliance/support

76
  • Going in the NEAR Future (from the Gateway Tools
    level)
  • 1 or 2 Group Logic Retrieval Tool
  • To be moved down to dataset level
  • Scatter Plot Tool
  • Ad Hoc Query Tool

77
Older Tools
78
(No Transcript)
79
(No Transcript)
80
Ad Hoc Query Tool
81
Ad Hoc Query Output
82
http//madb.nci.nih.gov
For assistance, remember madb-support_at_bimas
.cit.nih.gov
Write a Comment
User Comments (0)
About PowerShow.com