Title: Persistent Systems Pvt. Ltd. http://www.persistent.co.in
1Gene Expression Analysis Using Microarrays
- Dr Mushtaq Ahmed
- Technology Incubation Division
- Persistent Systems Private Ltd
- Pune
2Topics
- Introduction
- Data Storage and Exchange Standards
- Analysis (Clustering)
- Conclusion and References
31. Introduction
- Structure Activity Relationship
- Structural vs. Functional Genomics
- Principals of Microarray Experiment
- Applications
4Structure Activity Relationship
GENES (finite)
EXPERIMENTAL SETUP
Functional Genomics OR Confirmation Work
Structural Genomics OR Prediction Work
FUNCTIONS (infinite)
PROTEINS
5SourceYale Bioinformatics
6Principles of a Microarray ExperimentHybridizati
on
- Environment ? Functions ? Proteins ? mRNA ? cDNA
- Different incubations of cells results in up or
down regulation of different sets of genes. - Microarray provides a medium for matching known
and unknown DNA samples based on base-pairing
rules and automating the process of identifying
the unknowns - Set of expressed genes (at mRNA stage) isolated
and identified using hybridization on a
microarray chip
7HTS Using Hybridization
Microarray Chip
Target cDNA (variables to be detected)
Probe oligos/cDNA (gene templates)
Samples
Hybridization
Analysis of outcome
Pathways
Functional Annotation
Targets/Leads
Disease Class.
Physiological states
8Timeline for drug discovery
Discovery (5 yrs) 5000 Gene expression
study Pre-Clinical (1 yr) 50 Clinical (6
yrs) 5 Review (2 yrs) 1 Marketed
92. Data Storage and Exchange Standards
- Raw and Processed Data
- Conceptual View of Database
- Example of ArrayExpress
- Issues
- Standardization for Exchange
10Raw data images
- Red (Cy5) dot
- overexpressed or up-regulated
- Green (Cy3) dot
- underexpressed or down-regulated
- Yellow dot
- equally expressed
- Intensity - absolute level
- red/green - ratio of expression
- 2 - 2x overexpressed
- 0.5 - 2x underexpressed
- log2( red/green ) - log ratio
- 1 2x overexpressed
- -1 2x underexpressed
cDNA plotted microarray
11 Microarray Expression Value Representation
expression value types
composite spots
primary measurements
derived values
primary spots
composite images e.g., green/red ratios
primary images
Source MGED
12Gene expression database a conceptual view
Samples
Gene expression matrix
Genes
Gene expression levels
13(No Transcript)
14DAG Representation of Biomaterials
Source MGED
15ArrayExpress (MGED) Design
Source MGED
16ArrayExpress (MGED) Architecture
application server
Web server
MAML data
ArrayExpress
data warehouse
data submission Curation database
image server?
Curation pipeline
Source MGED
17Issues in Storage
- Size of Data
- Experiments
- 100 000 genes, 320 cell types
- 2000 compounds, 3 time points, 2 concentrations,
2 replicates - Data
- 8 x 1011 data-points
- 1 x 1015 1 petaB of data
- Others
- Raw data are images
- lack of standard measurement units for gene
expression - lack of standards for sample annotation
18Standardization
- MIAME (Minimum Info About a Microarray Expt)
- Experimental design, Array design
- Samples, Hybridisations
- Measurements, Controls
- OMG-LSR-DFT
- Life Sciences Research, Domain Task Force Gene
Expression RFP - EBI (MAML), Rosetta (GEML), NetGenics
submitters - Proposed MAGEML (MAML GEML)
- Annotations data data stored as a set of
external 2D matrices - Data format independent of particular scanner or
image analysis software - Sample and treatment can be represented as a
Directed Acyclic Graphs - Concept of composite images and composite spots
193. Data Analysis (Clustering)
- Normalization
- Hierarchical Clustering
- Divisive Clustering
- Other Methods
- Visual Tools
20Normalization
- Assumption
- Average expression ratio 1
- Amount of mRNA from both the sample is same
- Total Intensity
- Calculate a factor to rescale intensities of all
te genes so that - total Cy3 total Cy5
- Regression Techniques
- Adjust the intensities so that
- Slope of scatter plot of Cy3 vs Cy5 1
- Using ratio statistics
- Based on housekeeping genes expression a
probability density ratio is developed which is
used for normalization
21(No Transcript)
22Clustering
- Hierarchical
- Single, Complete and Average Linkage
- Divisive
- K-means
- Self Organizing Maps (SOM)
- Others
- Principal Component Analysis (PCA)
- Supervised Methods
23Hierarchical clustering
- Distance metrics or Similarity Measures
- Euclidian, Pearson, distance of slopes etc..
- Cost functions
- Single Linkage
- Min distance of any two members (one from each of
the two clusters) - Complete Linkage
- Max distance of any two members (one from each of
the two clusters) - Average Linkage
- UPGMA
- WPGMA
- Within Groups
- Wards Method
- Join which produces smallest possible error in
some of squared errors
24(No Transcript)
25Divisive clustering
- K-means
- k random (or specified) points used to create
clusters, average vectors for the clusters then
used iteratively - Knowledge of probable no of clusters (k) needed
- Used in combination with PCA and hierarchical
clustering - Self Organizing maps
- User defined geometric configurations as
partitions - Random vectors generated for each partition and
TRAINED till convergence (ANN based) - Visualization Methods
- Helps in cluster visualization
- Scatter Plot, Web plot, histogram
- May help in clustering itself
- E.g., SuperGrouper utility of MaxdView
26(No Transcript)
27Other Clustering Methods
- PCA (Principal Component Analysis)
- Also called SVD (Singular Value Decomposition)
- Reduces dimensionality of gene expression space
- Finds best view that helps separate data into
groups - Supervised Methods
- SVM (Support Vector Machine)
- Previous knowledge of which genes expected to
cluster is used for training - Binary classifier uses feature space and
kernel function to define a optimal
hyperplane - Also used for classification of samples-
expression fingerprinting for disease
classification
28(No Transcript)
294. Conclusion and References
- Microarrays makes HTS with hybridization possible
- No single standard unit for measuring expression
levels - Handling and interpretation not yet exact
- Assumptions Elements in cluster must share some
commonality - Classification depends on method used for
clustering, normalization, distance function - No correct way of classification, biological
understanding is the ultimate guide - Provides extension to existing knowledge (e.g.,
classifying a novel gene into a known pathway)
30Software
- Databases
- Public repositories
- GEO (NCBI), GeneX (NCGR), ArrayExpress (EBI)
- In-house databases
- Stanford, MIT, University of Pennsylvania,
- Organism specific databases
- Mouse Genome Informatics Database
- Proprietary databases
- Gene Logic, NCI, Synergy (NetGenics), Genomics
Knowledge Platform (Incyte) - Analysis Tools
- Public Domain
- maxdView (University of Manchester)
- CyberT , RCuster interfaces of GeneX
- Proprietary
- Spotfire, Xpression NTI (Informaxinc)
31References
- Microarray Gene Expression Database Group
- http//www.mged.org
- National Center for Genomic Research
- http//genex.ncgr.org
- University of Manchester , Bioinformatics Group
- http//bioinf.man.ac.uk/microarray/resources.html
- Nature Reviews Genetics
- http//www.nature.com/nrg/