SMD Data Analysis Tutorial - PowerPoint PPT Presentation

1 / 113
About This Presentation
Title:

SMD Data Analysis Tutorial

Description:

Using a genelist limits the data analyzed to a subset of genes ... Many data are removed, since only those that were very intense in the yellow ... – PowerPoint PPT presentation

Number of Views:77
Avg rating:3.0/5.0
Slides: 114
Provided by: johnm78
Category:

less

Transcript and Presenter's Notes

Title: SMD Data Analysis Tutorial


1
SMD Data Analysis Tutorial
  • April 7, 2009
  • Catherine Ball
  • (ball_at_genome.stanford.edu)
  • Janos Demeter (jdemeter_at_genome.stanford.edu)

2
SMD Getting Help
  • Click on the Help menu
  • Tool-specific links will be listed at the top.
  • Use the SMD help index to look for specific
    subjects
  • Send e-mail to
  • array_at_genome.stanford.edu

3
You will learn
  • How to use SMD Data Analysis Pipeline
  • Gene Selection and Annotation
  • Data Filtering
  • Data Retrieval
  • Gene Filtering
  • Clustering and Image Generation
  • How to use SMDs data repository
  • How to use SMDs implementation of the
    GenePattern data analysis suite

4
Data Retrieval and Analysis
  • Experiment names will be listed with feature
    extraction software indicated.

5
Gene Selection and Annotation
  • Specify genes or clones
  • Collapse data by SUID or LUID
  • Determine UID column
  • Choose biological annotation
  • Label result set

6
Gene Selection All genes
  • Ten arrays
  • All genes
  • 8690 Biosequence IDs used in cluster
  • Using all genes results in a very long cluster!

7
Gene Selection Specify Genes or Clones
  • Use all genes or clones on an array
  • Select a Genelist from your loader.stanford.edu
    account
  • Enter a list of genes to select. The names
    should be separated by two colons
  • Optionally include controls and empty spots.

8
Gene Selection Genelists
  • Ten arrays
  • 500-gene genelist
  • 380 Biosequence IDs used for cluster
  • Using a genelist limits the data analyzed to a
    subset of genes

9
Gene Selection Retrieving and Collapsing Data
  • Collapse or averaging occurs within each
    individual array. Multiple instances of the same
    entity will be combined as specified.
  • Duplicated entities can be defined in three ways
  • Biosequence ID is the identifier for the molecule
    in SMD.
  • Laboratory Unique ID is the identifier for the
    source of the sample in the lab.
  • SPOT is a individual feature on a print.

10
Gene Selection Collapse by SUID
  • Ten arrays
  • 500 gene genelist
  • Data retrieved by Biosequence ID
  • 380 Biosequence IDs used for cluster
  • duplicated spots will be averaged

11
Gene Annotation Biological Annotation
  • The list includes all information stored within
    SMD for any gene from the organism in question.
    Not all genes will have all annotations.
  • Annotations from a genelist (if one was selected)
    can be used to describe the genes.

12
Array Annotation Name Choices
  • Arrays (hybridizations) are identified in SMD by
    slide name (e.g., serial number) and experiment
    name, both unique.
  • Agilent and Affymetrix data sets are further
    identified by a result set name possibly more
    than one per hybridization, and not guaranteed to
    be unique.

13
SMD Data Analysis Tutorial
  • How to use SMD Data Analysis Pipeline
  • Gene Selection and Annotation
  • Data Filtering
  • Data Retrieval
  • Gene Filtering
  • Clustering and Image Generation
  • How to use SMDs data repository
  • How to use SMDs implementation of the
    GenePattern data analysis suite

14
Data Filtering
  • Choose data column to retrieve
  • Elect to invert reverse dye replicates
  • Elect to filter by spot flag
  • Select spot criteria for filtering
  • Define image presentation options
  • Retrieve data in background (not shown) - goes to
    repository

15
Data Filtering Choose Data to Retrieve
  • You can retrieve and cluster any numerical
    measurement from your data.
  • Clustering doesnt necessarily make sense for all
    fields.
  • Default (and most appropriate) fields for
    clustering are log ratio (two-channel data) and
    signal or intensity (single-channel data).

16
Data Filtering Selecting Filtering Criteria
  • Each spot will be individually assessed as
    specified, prior to any averaging or collapse.
  • Each filter can be made active and customized as
    desired.
  • Filters can be combined using logical operators
    (filter string), defaulting to a logical AND.
  • Filters available will be appropriate to the
    feature extraction software used.

17
Data Filtering Default Spot Filters
  • Regression correlation measures pixel-by-pixel
    agreement between the two channels.
  • Foreground/Background intensities are a simple
    measure of signal to noise.
  • Absolute intensity cutoffs impose a minimum net
    signal.
  • Failed and Is Contaminated refer to the
    quality of the spot material.
  • Equivalent defaults are presented for Agilent
    data.
  • Affymetrix data can be filtered on detection,
    detection p-value, etc.
  • Any data, including biological annotations, can
    be used for customized filters.

18
Spots with low regression correlation
19
Data Filtering Regression Correlation
  • Ten arrays
  • 500 gene Genelist
  • Spot flag 0
  • Regression correlation gt 0.6
  • 380 Biosequence IDs used for filtering
  • Filtering away spots with low regression
    correlation removes many spots

20
Data Filtering Combinations of Filters
  • Ten arrays
  • 500-gene genelist
  • Regression correlation gt 0.6
  • Net intensity in each channel gt 350
  • 371 Biosequence IDs selected for clustering
  • This data set was formed by selecting spots that
    are good quality (via the regression correlation)
    and good intensity in both channels

21
Data Filtering Image Presentation Options
  • Retrieve spot coordinates will allow you to see
    an assembled image of each array after
    clustering.
  • Show all spots allows you to view the spots you
    filtered out (in addition to the ones that passed
    filtering) after clustering. This might slow
    down data retrieval.

22
SMD Data Analysis Tutorial
  • How to use SMD Data Analysis Pipeline
  • Gene Selection and Annotation
  • Data Filtering
  • Data Retrieval
  • Gene Filtering
  • Clustering and Image Generation
  • How to use SMDs data repository
  • How to use SMDs implementation of the
    GenePattern data analysis suite

23
Data Filtering Retrieve Data in Background
  • Long running data retrieval jobs can be submitted
    and youll be e-mailed with a progress report.
  • Data sets will be saved to your data repository.

24
Data Retrieval
  • General results and progress
  • PreClustering (.pcl) file
  • Data retrieval summary report
  • Option to deposit data in repository

.
25
Data Retrieval Summary
26
SMD Data Analysis Tutorial
  • How to use SMD Data Analysis Pipeline
  • Gene Selection and Annotation
  • Data Filtering
  • Data Retrieval
  • Gene Filtering
  • Clustering and Image Generation
  • How to use SMDs data repository
  • How to use SMDs implementation of the
    GenePattern data analysis suite

27
Gene Filtering
  • Transform single-channel data
  • Filter genes based on data distribution
  • Data centering
  • Filter genes based on data values
  • Filter genes and arrays based on spot filter
    criteria
  • Zero-transform data

28
Gene Filtering Transformation
  • Single-channel (e.g., Affymetrix) data only.
  • Adjust arrays for simple cross-array comparison.
  • Log-transform data for clustering.
  • May add a constant for variance stabilization
  • May replace non-positive values with very small
    values

29
Gene Filtering Data Distribution
  • Rank will select genes whose retrieved value is
    in the top Nth percentile.
  • Deviations selects those genes whose retrieved
    value has a value significantly above or below
    the mean.

30
Gene Filtering Percentile Rank
  • Ten arrays
  • 500-gene genelist
  • Regression correlation gt 0.6
  • Net intensity in either channel gt 350
  • Rank gt 95 in at least one array
  • Many data are removed, since only those that were
    very intense in the yellow (red) channel are
    included.

31
Gene Filtering Deviation from Mean Value
  • Ten arrays, 500-gene genelist
  • Regression correlation gt 0.6
  • Net intensity in either channel gt 350
  • Genes whose Log(Normalized Red/Green) is more
    than one standard deviation from mean in at least
    one array
  • This filter removes data that do not show
    significant variance from the mean a good way
    to identify genes with potentially interesting
    behavior.

32
Gene Filtering Centering Data
  • Data can be centered at this stage. This
    transforms the data so that the mean value is
    equal to zero. Images and downloaded files will
    reflect this transformation.
  • During clustering, data can be treated as if they
    were centered, but the values of the data are not
    affected.
  • Gene centering is useful for common references.
  • Array centering amounts to renormalizing each
    array, using the spots that pass the spot filter
    criteria.

33
Data Centering
  • Centering sets the average value of a vector to
    zero.
  • This results in a loss of information, but may
    reveal important patterns.

34
Data Centering
  • Gene centering is useful when the actual value of
    the ratio is not important or is not meaningful
    (e.g., common reference).
  • Centering is generally not appropriate when using
    a biologically meaningful control sample, such as
    a matched, untreated sample, or a zero timepoint.

35
Data Transformation Centering
  • To illustrate how centering affects data, a small
    sample of data were duplicated. A constant was
    added to the second copy of each row

36
Uncentered Data, No Centering Metric During
Clustering
Uncentered Data, Centering Metric During
Clustering
Centered Data, No Centering Metric During
Clustering
Centered Data, Centering Metric During Clustering
37
Gene Filtering Center Genes
Centered
Uncentered
  • Ten arrays, 500-gene genelist
  • Regression correlation gt 0.6
  • Net intensity in either channel gt 350
  • Genes centered
  • No effect on number of biosequence IDs clustered,
    but data values are changed (centered data is
    displayed on left)

38
Gene Filtering Data Values
  • Cutoff requires data to exceed a user-defined
    value in at least A arrays. Think hard before
    using this filter. Especially when data are
    centered, you could be losing important
    information.

39
Gene Filtering Spot Filter Criteria
  • Genes can be screened out if they do not meet the
    spot criteria a given percentage of the time, as
    specified by the user.
  • Arrays can be similarly filtered out if they do
    not meet the spot filter criteria.

40
Spot Filtering vs. Gene Filtering
Gene filters remove the genes that do not meet
the filter criteria often enough. This reduces
the number of genes.
Spot filters remove individual data points. That
means there will be more missing (gray) data.
41
Gene Filtering Zero Time Point Transformation
  • Data can be transformed by subtracting one state
    of a series from all other data

42
Gene Filtering Zero Time Point Transformation
Subtract the values from the first time point
from all the other time points
43
Gene Filtering Results
  • Download PreClustering files (.pcl)
  • Go to GenePattern
  • Summary report
  • Deposit to repository
  • Another round of filtering
  • Proceed to clustering

44
Gene Filtering Data Retrieval Summary Report
45
SMD Data Analysis Tutorial
  • How to use SMD Data Analysis Pipeline
  • Gene Selection and Annotation
  • Data Filtering
  • Data Retrieval
  • Gene Filtering
  • Clustering and Image Generation
  • How to use SMDs data repository
  • How to use SMDs implementation of the
    GenePattern data analysis suite

46
Clustering and Image Generation
  • Partitioning options
  • Clustering metric selections
  • Correlated genes
  • Image generation options

47
Clustering Algorithms
In microarray studies, we often use clustering
algorithms to help us identify patterns in
complex data. For example, we can randomize the
data used to represent this painting and see if
clustering will help us visualize the pattern.
48
Clustering algorithms
The painting is sliced into rows which are then
randomized.
49
Clustering algorithms
Rows ordered by hierarchical clustering with
nodes flipped to optimize ordering
50
How do we compare expression profiles?
  • Treat expression data for a gene as a
    multidimensional vector.
  • Decide on a distance metric to compare the
    vectors.
  • Plenty to choose from
  • Pearson correlation, Euclidean Distance,
    Manhattan Distance etc.

51
Expression Vectors
  • Crucial concept for understanding clustering
  • Each gene is represented by a vector where
    coordinates are its values (log(ratio)) in each
    experiment
  • x log(ratio)expt1
  • y log(ratio)expt2
  • z log(ratio)expt3
  • etc.

52
Clustering Metric Selections
  • Genes and arrays can be clustered.
  • Pearson correlation treats vectors as if they
    were the same (unit) length.
  • Euclidean distance will be affected by both the
    direction and the amplitude of the vectors.

53
Distance Metrics
  • Distances are measured between expression
    vectors
  • Distance metrics define the way we measure
    distances
  • Many different ways to measure distance
  • Euclidean distance
  • Pearson correlation coefficient(s)
  • Manhattan distance
  • Mutual information
  • Kendalls Tau
  • etc.
  • Each has different properties and can reveal
    different features of the data

54
Euclidean Distance
  • The Euclidean distance metric detects similar
    vectors by identifying those that are closest in
    space. In this example, A and C are closest to
    one another.

55
Pearson Correlation
  • The Pearson correlation disregards the magnitude
    of the vectors but instead compares their
    directions. In this example, Gene A and Gene B
    have the same slope, so would be most similar to
    each other.

56
Distance Metric Pearson vs. Euclidean
C
A
B
  • By Euclidean distance, A and C are most similar.
  • By Pearson correlation, A and B are most similar.

57
Clustering Tree Displays
  • Clustered gene arrays are displayed adjacent to
    most similar arrays.
  • The nodes of the trees indicate the members of an
    array and the degree of similarity to its
    neighbor.

58
Hierarchical Clustering
  • Calculate the distance between all genes. Find
    the smallest distance. If several pairs share the
    same similarity, use a predetermined rule to
    decide between alternatives.
  • Fuse the two selected clusters to produce a new
    cluster that now contains at least two objects.
    Calculate the distance between the new cluster
    and all other clusters.
  • Repeat steps 1 and 2 until only a single cluster
    remains.
  • Draw a tree representing the results.

59
Clustering Array Clustering
No Array Clustering
With Array Clustering
60
Clustering Self Organizing Maps
  • Map of n partitions, that is modeled on the
    expression data, where each partition in the map
    has an associated vector
  • Genes are assigned to partitions of most similar
    genes
  • Neighboring partitions are more similar to each
    other than they are to distant partitions

61
Clustering Correlated Genes
  • SMD can produce a file listing the
    best-correlated genes, for each gene retrieved.

62
Clustering Visualization
  • Click on the image to get a dynamic display.
  • Click on the TreeView button for another dynamic
    option.
  • Click on one of the other options to see static
    displays with or without the spot images.
  • Download files (.cdt, .atr, .gtr, report) for use
    with other tools (e.g., TreeView).
  • Add cluster or pre-clustering file to your
    repository

63
Clustering Display Adjacent Cluster and
Clustered Spot Images
64
Clustering Display Hierarchical Cluster View
  • Interactive view of cluster
  • Link to GO term analysis (green nodes) to
    evaluate sub-clusters.

65
SMD Data Analysis
  • Using SMD Data Analysis Pipeline
  • Repository Tools
  • SVD
  • Synthetic Gene Tool
  • kNNimpute
  • GenePattern tools

66
SMD Help File Formats
67
File Formats Pre-clustering (PCL) File
Names and orders of arrays (if arrays are not
clustered)
68
File Formats Clustered Data Table (CDT) File
69
File Formats Gene Cluster Text (GCT) File
70
File Formats Class (CLS) File
71
Using Your Repository PCL Deposits
72
Using the Repository CDT File Options
CDT files have a few other options
GeneXplorer
Clustering with Proxy and Spot images
TreeView
Clustering with Spotimages
Clustering with Proxy images
73
Viewing Repository Entries
  • Name
  • Organism
  • Number of genes
  • Number of arrays
  • Size of file
  • Date uploaded
  • Description
  • Data retrieval summary

74
Editing Entries -- How to Share!
  • Change repository entry name
  • Change description
  • Add access to repository entry to a GROUP
  • Add access to a repository entry to a SMD USER

75
SMD Data Analysis
  • Data Analysis Background
  • Clustering algorithms
  • Data centering
  • Using SMD Data Analysis Pipeline
  • Gene Selection and Annotation
  • Data Filtering
  • Data Retrieval
  • Gene Filtering
  • Clustering and Image Generation
  • Repository Tools
  • SVD
  • Synthetic Gene Tool
  • kNNimpute
  • GenePattern tools

76
SVD Singular Value Decomposition
  • The goal of SVD is to find a set of patterns that
    describe the greatest amount of variance in a
    dataset
  • SVD determines unique orthogonal (or
    uncorrelated) gene and corresponding array
    expression patterns (i.e. "eigengenes" and
    "eigenarrays," respectively) in the data
  • Patterns might be correlated with biological
    processes OR might be correlated with technical
    artifacts

77
SVDmethod
78
SVD Display in SMD
79
SMD Data Analysis
  • Data Analysis Background
  • Clustering algorithms
  • Data centering
  • Using SMD Data Analysis Pipeline
  • Gene Selection and Annotation
  • Data Filtering
  • Data Retrieval
  • Gene Filtering
  • Clustering and Image Generation
  • Repository Tools
  • SVD
  • kNNimpute
  • Synthetic Gene Tool
  • GenePattern tools

80
KNNImpute The Missing Values Problem
  • Microarrays can have systematic or random missing
    values
  • Some algorithms arent robust to missing values
  • Large literature on parameter estimation exists
  • Whats best to do for microarrays?

81
KNNimpute Algorithm
  • Idea use genes with similar expression profiles
    to estimate missing values

82
SMD Data Analysis
  • Data Analysis Background
  • Clustering algorithms
  • Data centering
  • Using SMD Data Analysis Pipeline
  • Gene Selection and Annotation
  • Data Filtering
  • Data Retrieval
  • Gene Filtering
  • Clustering and Image Generation
  • Repository Tools
  • SVD
  • kNNimpute
  • Synthetic Gene Tool
  • GenePattern tools

83
Synthetic Genes
  • Purpose
  • average data based on arbitrary groupings of
    genes/probes
  • - for biological reasons
  • - for technical reasons
  • Can average data using
  • - common genelists
  • - your own genelists
  • - annotations in pcl file
  • After averaging
  • - a new row for the synthetic gene data
  • - Original data can be removed/included

84
Synthetic Genes
  • Common lists available (only mouse and human
    data)
  • Unigene (all clones/oligos that report on a given
    Unigene id will be averaged and shown as the
    Unigene id)
  • Entrez Geneid (same as above, but for Entrez
    Geneid)
  • These lists are useful to collapse data by gene,
    rather than biosequenceid/luid.
  • They allow comparison of experiments between
    different platforms - oligo print to cDNA print
    or spotted arrays to Agilent arrays where the
    arrays dont share common reporters. Also can be
    used to compare cDNA prints with h/meebo arrays
  • These synthetic gene lists are updated on a
    regular basis.

85
Synthetic Genes
  • Other common synthetic gene lists
  • chromosome arms
  • cytobands
  • 5 Mb tiles based on GoldenPath mappings
  • Tissue types
  • tumor types
  • processes
  • Additional lists see
  • http//smd.stanford.edu/help/synthGenes.shtml

86
SMD Data Analysis
  • Using SMD Data Analysis Pipeline
  • Repository Tools
  • SVD
  • Synthetic Gene Tool
  • kNNimpute
  • GenePattern tools

87
What is GenePattern?
  • Software package developed at Broad Institute
    (Jill P. Mesirovs group)
  • http//www.broad.mit.edu/cancer/software/genepatt
    ern/
  • Reasons to choose this package
  • Large number of microarray analysis tools (gt90)
  • Ability to create pipelines (reproducible
    research)
  • Ease of adding new modules to existing ones

88
How to find GP in SMD?
From Data retrieval
From repository
89
Terms in GenePattern
  • Module (Analysis/Visualization/Utility) program
    that does analysis, displays or executes some
    other transformation of a file
  • Pipelines chained modules - output from one -gt
    input to next
  • Suites groupings of modules/pipelines
  • Jobs
  • execution of module/pipeline
  • persistent
  • results are deleted after one week
  • Go to web-site for navigation

90
GenePattern comments
  • SMD uses pcl
  • GenePattern uses gct (among others )
  • Converters gct -gt pcl pcl -gt gct
  • Most tools in GenePattern need full dataset - Use
    ImputeMissingValuesKNN first
  • Most default values are designed for Affymetrix
    data - evaluate each option carefully
    (GeneCruiser)

91
Input/output files in GenePattern
  • Called through specific pcl file
  • Files in your repository
  • Upload data from desktop
  • Any file that has a url
  • Module to get data directly from geo

92
GenePattern modules by category Clustering
  • Clustering
  • Hierarchical clustering/HierarchicalClusteringPCL
  • Self-organizing maps (SOM)
  • K-means clustering
  • Non-negative matrix factorization (NMF) (Brunet
    et al., 2004) is an alternative method for class
    discovery. Rather than clustering genes, NMF
    detects context-dependent patterns of gene
    expression. Requires all positive values.
  • Consensus clustering (Monti et al., 2003) runs a
    selected clustering algorithm against
    perturbations of the original data set. The
    result is a consensus matrix that assesses the
    stability of discovered clusters. Supported
    clustering methods hierarchical clustering,
    K-means clustering, self-organizing maps (SOM),
    and non-negative matrix factorization (NMF).
  • Clusters genes or samples, not both
  • SubMap (Hoshida Y, et al. PLoS ONE 2(11) e1195,
    2007 ) is an unsupervised method, which estimates
    the significance of an association between
    subclasses observed in two independent data sets.
    The subclass labels are predetermined as manually
    assigned phenotypes or by clustering prior to the
    application of the SubMap algorithm.
  • Corresponding visualizers
  • HierarchicalClusteringViewer
  • SOMClusterViewer
  • HeatMapViewer
  • Etc
  • Clustering result example

93
GenePattern modules by category Marker Selection
  • ComparativeMarkerSelection (similar to SAM)
  • ComparativeMarkerSelectionViewer
  • Visualize and explore data produced by the method
  • ExtractComparativeMarkerResults extract data
    based on the analysis, create genelist
  • Gene Set Enrichment Analysis (GSEA)
  • GSEALeadingEdgeViewer

94
GenePattern modules Marker Selection
  • Goal Given phenotypically distinct classes, find
    markers with distinct expression patterns (in
    different classes)

95
GenePattern modules Marker Selection
  • Visualize result using ComparativeMarkerSelectionV
    iewer

96
GenePattern modules GSEA
  • a method to determine whether an a priori defined
    set of genes shows statistically significant,
    concordant differences between two biological
    states
  • Very similar to comparative marker selection
    sets rather than genes

97
GenePattern modules GSEA
  • Molecular Signatures DB
  • http//www.broad.mit.edu/gsea/msigdb/index.jsp
  • Gene sets groups of gene symbols
  • Gene sets are versioned

98
GenePattern modules GSEA
  • Requirements
  • Expression dataset
  • Class file
  • Chip file

Number of classes
Number of slides
99
GenePattern modules GSEA
  • How it works
  • Sorts rows based on how well a metric correlates
    with the class assignment (similar to marker
    selection tool)
  • Scores gene sets (using a scoring method) by
    walking down the ranked list of genes, increasing
    a running-sum statistic when a gene is in the
    gene set and decreasing it when it is not.

100
GenePattern modules GSEA
  • Output results can be viewed in web-browser
  • Further analysis GSEALeadingEdgeViewer

101
Create pipeline/suite
  • Pipeline can be created from a path
  • by concatenating individual modules
  • From zip file
  • Pipelines can be exported
  • Pipelines can be private or public

102
Creating new modules
  • Smd curators/programmers can create/upload new
    modules
  • If you have any programs you would like to share
    or use in smd, please let us know

103
GenePattern modules Class Prediction
  • Goal Given phenotypically distinct classes, find
    a gene expression signature that accurately
    predicts class membership.
  • Computational methodology divide data into
    training and test sets
  • Goal
  • achieve high predictive power
  • Avoid over-fitting

104
GenePattern modules Class Prediction
  • Simple example
  • Knn classifier,
  • k5, 2 genes, 2 classes

105
GenePattern modules Class Prediction
  • Simple example
  • Knn classifier,
  • k5, 2 genes, 2 classes

106
GenePattern modules Class Prediction
  • Simple example
  • Knn classifier,
  • k5, 2 genes, 2 classes

107
GenePattern modules Class Prediction
  • Evaluation on independent test set
  • Build the classifier on the train set.
  • Assess prediction performance on test set.
  • Maximize generalization/Avoid overfitting.
  • Performance measure
  • error rate

108
GenePattern modules Class Prediction
  • K-nearest-neighbors (KNN) classifies an unknown
    sample by assigning it the phenotype label most
    frequently represented among the k nearest known
    samples (Golub and Slonim et al., 1999). In
    GenePattern, the user selects a weighting factor
    for the 'votes' of the nearest neighbors
    (unweighted all votes are equal weighted by the
    reciprocal of the rank of the neighbor's
    distance the closest neighbor is given weight
    1/1, next closest neighbor is given weight 1/2,
    and so on or weighted by the reciprocal of the
    distance).
  • Weighted Voting (Slonim et al., 2000) classifies
    an unknown sample using a simple weighted voting
    scheme. Each gene in the classifier 'votes' for
    the phenotype class of the unknown sample. A
    gene's vote is weighted by how closely its
    expression correlates with the differentiation
    between phenotype classes in the training data
    set.
  • Support Vector Machines (SVM) is designed for
    multiple class classification (Rifkin et al.,
    2003). The algorithm creates a binary SVM
    classifier for each class by computing a maximal
    margin hyperplane that separates the given class
    from all other classes that is, the hyperplane
    with maximal distance to the nearest data point.
    The binary classifiers are then combined into a
    multiclass classfier. For an unknown sample, the
    assigned class is the one with the largest
    margin.
  • CART (Breiman et al., 1984) builds Classification
    And Regression Trees for predicting continuous
    dependent variables (regression) and categorical
    predictor variables (classification). It works by
    recursively splitting the feature space into a
    set of non-overlapping regions and then
    predicting the most likely value of the dependent
    variable within each region. A classification
    tree represents a set of nested if-then
    conditions that allows for the prediction of the
    value of the categorical dependent variable based
    on the observed values of the feature variables.
    A regression tree is similar but allows for the
    prediction of the value of a continuous dependent
    variable instead.

109
GenePattern modules Pathway Analysis
  • ARACNE (Algorithm for the Reconstruction of
    Accurate Cellular Networks) (Margolin, A., et
    al., BMC Bioinformatics, 2006. 7(Suppl 1) p.
    S7.) is an algorithm which reverse engineers a
    gene regulatory network from microarray gene
    expression data. It attemps to predict targets of
    select transcription factors from a microarray
    dataset.
  • MINDY (Modulator Inference by Network Dynamics)
    algorithm computationally infers genes that
    modulate the activity of a transcription factor
    at post-transcriptional levels (Wang, et. al.
    ,2006). The algorithm uses mutual information
    (MI) to measure the mutual dependence of the
    transcription factor (TF) and its target gene to
    predict modulators of TF activity.

110
GenePattern modules Survival Analysis
  • SurvivalCurve Draws survival curve based on cls
    file
  • SurvivalDifference tests if there is a
    difference between two or more survival curves
    based on sample classes defined by genomic data.
    The log-rank test (Mantel-Haenszel test) and the
    generalized Wilcoxon test can be used.

111
GenePattern modules
  • Many other modules
  • Projection methods PCA, NMF (Non-negative Matrix
    Factorization)
  • Tools for snp analysis
  • Tools for proteomics data
  • Etc

112
SMD Office Hours
  • Grant S201
  • Mondays 3 - 5 pm
  • Wednesdays 2 - 4 pm

113
SMD Staff
Gavin Sherlock Co-Investigator
Catherine Ball Director
Patrick Brown Co-Investigator
Farrell Wymore Lead Programmer
Michael Nitzberg Database Administrator
Zac Zachariah Systems Administrator
Tatiparthy Reddy Scientific Curator
Janos Demeter Computational Biologist
Heng Jin Scientific Programmer
Maria Mao Software Engineer
Jeremy Hubble Scientific Programmer
Write a Comment
User Comments (0)
About PowerShow.com