Title: Introduction to Microarray Analysis
1Introduction to Microarray Analysis
- Uma Chandran PhD, MSIS
- Department of Biomedical Informatics
- chandranur_at_upmc.edu
- 412-623-7841
- 12/17/08
2My Background
- Bioinformatics Analysis Service
- UPCI
- Department of Biomedical Informatics
- Clinical Genomics Facility
- Runs expression, SNP and microRNA microarrays
- Bioinformatics tightly integrated with data
analysis - Expression, SNP, proteomic, integration of
proteomic and genomic data
3Workshop Objectives
- Introduction to microarray analysis
- Understand general principles
- BRB Array Tools from NCI
- HSLS also offers Array Assist, Genespring GX
- Not an advanced analysis course offered through
DBMI and Biostatistics - Not a statistics course
- Will discuss some statistical issues
- Should consult literature, statistician to
understand methods in detail
4What is a microarray
- Probes on chips
- Detect target RNA in samples
- High throughput
- 10000s of specific probes
- Measure global gene expression
- Glass beads, chips, slides
5(No Transcript)
6Bioinformatic approaches for analysis
- Measuring 10000s of data points simultaneously
- High dimensional data
- 10 Exp x 50K 500K
- How to find real differences over the noise
- Statistical approaches
7Bioinformatic approaches for analysis
- Class Comparison
- Which genes are up or down in tumors v normal,
untreated v treated - Class Discovery
- Within the tumor samples, are there subgroups
that have a specific expression profile? - Class prediction, pathway analysis etc
8Challenges in microarray analysis
- Different platforms
- Ilumina, Affymetrix, Agilent.
- Many file types, many data formats
- Need to learn platform dependent methods and
software required - Analysis
- How to get started?
- Which methods? Which software? Many freely
available tools. Some commercial - How to interpret results
9Public databases
- Many sources for public data labs, consortia,
government - Publications require that data files including
raw files be made public - GEO http//www.ncbi.nlm.nih.gov/geo/
- Array Express - http//www.ebi.ac.uk/arrayexpress/
ae-main0
10What tools to use
- Gene Spring GX (HSLS)
- Todays exercise using
- BRB arrays tools from NCI
- Excel Interface
- First install R statistical package from
Bioconductor - Fairly easy to use if you dont have access to
commercial tools - Analysis is robust, display and graphics are
minimal - Learn concepts using BRB
11GEO data for Exercise
- Raetz et al
- Characterization of T-ALL, T-LL, B-ALL
- T-ALL and T-LL are morphologically
indistinguishable - Are there expression differences?
- Class comparison, class discovery, prediction
12Files
- C\Desktops\cmcclass\HSLSclass\
- Treeview, Cluster
- Files may also be under C\
13Hands on 1
- Google GEO
- Query Raetz et al
- Open cel files for class
- .cel files
- Affymetrix files
- Has many files including .dat, .cel. chp
- Need Affy software to open these files
- Freely downloadable
14Microarray analysis Data Preprocessing
- Objective
- Convert image of thousands of signals to a a
signal value for each gene or probe set - Multiple step
- Image analysis
- Background and noise subtraction
- Normalization
- Expression value for a gene or probe set
- Image analysis and bkg, noise usually done by
proprietary software
Gene 1 100 Gene 2 150 Gene 3 75 . Gene10000 500
15Normalization
Treated Control
- Corrects for variation in hybridization etc
- Assumption that no global change in gene
expression - Without normalization
- Intensity value for gene will be lower on Chip B
- Many genes will appear to be downregulated when
in reality they are not
Gene 1 100 Gene 2 150 Gene 3
75 . Gene10000 500
50 75 32 250
16How to normalize?
- Many methods Affy MAS5.0
- Median scaling median intensity for all chips
should be the same - Known genes, house keeping, invariant genes
- Quantile - RMA
- Normalization method may differ depending on
platform - Illumina cubic spline
- Affymetrix
- Choose method
- .cel to .chp file
- Which method to choose?
- Know the biology
17(No Transcript)
18BRB Array tools
- Website
- Excel plug in R and fortran
- Import, choose correct format
- From .cel files
- Process using GCRMA or MAS5.0
- Or directly from processed files
- Attaches annotation
- Create experiment labels
19Ilumina Format
20Hands on 2
- Open Excel
- Click on Array Tools
- Look at data import options
- Wizard
- General format Affy, non Affy
- Affy data
- Import .cel files or already normalized files
- Various normalization options
- Clicking OK will import data dont do this now
because time and memory intensive - Already normalized files look at MAS5.0.txt
- For the next step, we will work with data that
has already been imported into BRB
21Hands on continued
- Look at a normalized file MAS5.0.txt
- Open this file in Excel
- Absent, Present calls unique to Affy
- This already normalized file can also be imported
into BRB
22Quality control
- Will not go into detail here because platform
specific - Read the literature for platform
- Open file lableled MAS5.0 report. Folder cel
file to import - Scale factor
- P calls
- Should be at least 40
- If RNA quality poor, then fewer present calls
- Control probes
- GAPDH has 3/, middle and 5 probes
- Ratio of 3/5
- Other spike in probes for over cDNA to cRNA
synthesis - For hybridization
- Background/noise
- All platforms will have control probes and
quality metrics
23BRB import
- Spot filters
- For Affy array, check off
- For other arrays, could exclude if negative
- Set to threshold
- Normalization
- If importing already normalized as in MAS5.0,
check off - If RMA, already normalized, check off
- Other methods that do not automatically
normalized, choose a method here - Gene Filters
- For now, leave 20
- Rest, check off
24Hands on
- Go to BRB analysis folder
- BRB analysis\Project\Raetz.xls file
- This file shows what an imported project looks
like - Experiment Descriptor is the class label for each
experiment
25Data Analysis
- Part 2- Data analysis
- Class discovery
- Class comparison
- Class prediction
- Biological annotation
- Pathway analysis
26Class Discovery
- Objective?
- Can data tell us which classes are similar?
- Are there subgroups?
- Do T-ALL, T-LL, B-ALL fall into distinct groups?
- Methods
- Hierarchical clustering
- K-means, SOM etc
- These are Unsupervised Methods
- Class Ids are not known to the algorithm
- For example, does not know which one is cancer or
non cancer - Do the expression values differentiate, does it
discover new classes
27Hands on Class discovery
- Multidimensional scaling in BRB
- Raetz.xls
- Choose defaults in BRB
- Eisens Cluster
- Filter
- Accept
- Adjust
- Cluster
- Different Clustering metrics will give diff
results - Not a very robust method but very popular
- Use as exploratory tool
28Multidimensional scaling - MDS
29Hands on - Hierarchical Clustering
- Eisen Cluster and Treeview
- Import data
- Filter
- Filter or not to filter, P calls, SD etc
- Accept filter
- Adjust data
- Log transform (important), center, normalize
- Clustering
- Cluster array or genes
- Gene computationally intensive
- Choose distance metric
- .cdt file created
- Open with Treeview
30Class comparison differential expression
analysis
- What genes are up regulated between control and
test or multiple test conditions - Normal v tumor
- Treated v untreated
- Fold change
- Not sufficient, need statistics
- Statistics
- t test, non-parametric, fdr,
31Class comparison
- Many analysis methods
- May produce different results
- Different underlying statistics and methods
- t test
- t test with permutations
- SAM
- Emperical bayesian
- Depends on underlying assumptions about data
- High throughput data with many rows and few
samples - What is the distribution
- Variance from gene to gene
- Save raw data files to try different methods and
compare results
32Fold change does not take variation into account
Modified from madB
http//nciarray.nci.nih.gov/
33Hypothesis Testing
Normal
Tumor
d
mean1
mean2
Null hypothesis
Alternative hypotheses
34Statistical power
- t test
- Test hypothesis that the two means are not
statistically different - Adding confidence to the fold change value
- Mean
- Standard deviation
- Sample size
- Calculates statistic
- You choose cutoff or threshold
- Give me gene list at a cutoff of p lt0.05
- 95 confidence that the mean for that gene
between control are treated are different
35Experimental Design Very important!!!
- Sample size
- How many samples in test and control
- Will depend on many factors such as whether
tissue culture or tissue sample - Power analysis
- Replicates
- Technical v biological
- Biological replicates is more important for more
heterogenous samples Need replicates for
statistical analysis
- To pool or not to pool
- Depends on objective
- Sample acquistion or extraction
- Laser captered or gross dissected
- All experimental steps from sample acquisition to
hybridization - Microarray experiments are very expensive. So,
plan experiments carefully
36t tests
- Results might look like
- At a plt0.05, there are 300 genes up and 200 genes
downregulated - 95 confidence that the means of these genes in
the two groups is different - At a p lt 0.05, x genes up and y genes down with a
fold change of at least 3.0
37Multiple comparison
- Microarrays have multiple comparison problem
- p lt 0.05 says that 95 confidence means are
different therefore 5 due to chance - 5 of 10000 is 500
- 500 genes are picked up by chance
- Suppose t tests selects 1000 genes at a p of 0.05
- 500/1000 Approximately 50 of the genes will be
false - Very high false discovery rate need more
confidence - How to correct?
- Correction for multiple comparison
- p value and a corrected p value
38Corrections for multiple comparisons
- Involve corrections to the p value so that the
actual p value is higher - Bonferroni
- Benjamin-Hochberg
- Significance Analysis of Microarrays
- Tusher et al. at Stanford
39Hands on BRB
- Class comparison
- Choose comparison
- Which tests are available?
- P value cutoff
- How is multiple correction testing being done?
- Stringent p value, fdr
- How is the output reported?
- Can you figure out how many genes are regulated
at different p values and different cutoffs - How to interpret results
- Look at gene lists generated by our analysis v
those generated in the paper -
40BRB Hands on
- Check Experiment desc file
- Set up Class Comparison
- T-ALL v T-LL
- Choose p value
- Random variance
- Options
- Save file
- Run
41BRB Class Comparison
- Output folder
- Check the .html file
- Look at results
- P value
- Fold change
- Annotation
- Click on annotation
- Cut and paste save into Excel
42Many studies, many methods
Dupuy and Simon, JNCI 2007
43How to manipulate Gene lists
- Create gene lists
- Venn Diagram
- Can be done even though study done on different
platforms - Compare MAS and RMA
- Venn Diagram
- Compare B-ALL v T-LL and T-LL v B-ALL
44Venn Diagram
http//www.pangloss.com/seidel/Protocols/venn.cgi
http//ncrr.pnl.gov/software/VennDiagramPlotter.s
tm
45Conclusion
- Other analysis
- Class prediction
- Gene list from class comparison can be used in
pathway analysis - HSLS pathway workshops on Ingenuity, DAVID,
Pathway Architect - Future
- Integrate expression data with other data such as
snp or microRNA - GEO has some data analysis features