Introduction to R and Bioconductor BMI 731 Winter 2005 - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Introduction to R and Bioconductor BMI 731 Winter 2005

Description:

R is a language and environment for statistical computing and graphics. ... Other assays: aCGH, DNAcopy, prada, PROcess, RSNPer, SAGElyzer. ... – PowerPoint PPT presentation

Number of Views:161
Avg rating:3.0/5.0
Slides: 23
Provided by: Biomedical94
Learn more at: http://bmi.osu.edu
Category:

less

Transcript and Presenter's Notes

Title: Introduction to R and Bioconductor BMI 731 Winter 2005


1
Introduction to R and BioconductorBMI 731 Winter
2005
  • Catalin Barbacioru
  • Department of Biomedical Informatics
  • Ohio State University

2
References
  • R Project (www.r-project.org)
  • open-source language and environment for
    statistical computing and graphics.
  • Comprehensive R Archive Network, CRAN
    (cran.r-project.org) source code and precompiled
    binary distributions for Linux, Windows, MacOS
    base and contributed packages.
  • Bioconductor Project (www.bioconductor.org)
  • open-source software for the analysis of
    biomedical and genomic data, mainly R packages.

3
R Project
  • R is a language and environment for statistical
    computing and graphics. It is a open source
    project which is similar to the S language and
    environment which was developed at Bell
    Laboratories (formerly ATT, now Lucent
    Technologies) by John Chambers and colleagues. R
    can be considered as a different implementation
    of S.
  • R provides a wide variety of statistical (linear
    and nonlinear modeling, classical statistical
    tests, time-series analysis, classification,
    clustering, ...) and graphical techniques, and is
    highly extensible. The S language is often the
    vehicle of choice for research in statistical
    methodology, and R provides an Open Source route
    to participation in that activity.

4
R Project
  • R can be extended (easily) via packages.
  • An R package is a structured collection of code
    (R, C, or other), documentation, and/or data for
    performing specific types of analyses.
  • Packages only need to be installed once, but ...
    they must be loaded with each new R session.
  • Loading R function library, e.g.,
    library(Biobase)
  • Various functions are available to obtain
    information on a package.
  • For example, packageDescription returns the
    content of the DESCRIPTION file and .find.package
    returns the directory where the package was
    installed.
  • gt packageDescription("hgu95av2")

5
R Packages
  • Analysis packages implementation of statistical
    and graphical methods. E.g. cluster , glm, graph,
    hexbin, lattice, rpart.
  • Data packages Biological metadata packages
    consisting of environment objects for mappings
    between dierent gene identifiers (e.g., Aymetrix
    ID, GO ID, LocusLink ID, PubMed ID), CDF and
    probe sequence information for Aymetrix chips.
    E.g. GO, hgu95av2 , humanLLMappings, KEGG.
  • Specialized/custom packages code, data,
    documentation, and exercises, for a particular
    project, article, or course. E.g. EMBO03
    Bioconductor course package golubEsets Golub
    et al. (2000) ALL/AML dataset yeastCC Spellman
    et al. (1998) yeast cell cycle dataset.

6
R Packages
  • Base packages (CRAN).
  • E.g. base, graphics, RPackmethods, stats.
  • Contributed packages (CRAN).
  • E.g. ellipse, XML.
  • Bioconductor packages.
  • E.g. annotate, affy, marray, multtest, hgu95av2
    , ALL, EMBO03 .

7
Bioconductor Project
  • Bioconductor is an open-source and
    open-development software project for the
    analysis of biomedical and genomic data.
  • The project was started in the Fall of 2001 and
    includes 25 core developers in the US, Europe,
    and Australia.
  • Provide access to powerful statistical and
    graphical methods for the analysis of biomedical
    and genomic data.
  • Facilitate the integration of biological metadata
    from WWW in the analysis of experimental data.
  • E.g. GenBank, GO, LocusLink, PubMed.
  • Provide training in computational and statistical
    methods.

8
Bioconductor Packages
  • Statistical methods cluster analysis, estimation
    and (multiple) testing for linear and non-linear
    models (with possibly censored continuous and
    polychotomous outcomes), resampling,
    visualization, etc.
  • Biological assays cell-based assays, DNA
    microarrays (transcript levels, DNA copy number
    from CGH), proteomics, SAGE, SELDI-TOF, SNP, etc.
  • Biological metadata from WWW GenBank, GO, KEGG,
    PubMed,etc
  • Interfaces with other languages C, Java, Perl,
    Python, XML, etc. Omega Project
    (www.omegahat.org).
  • Interactions with other projects BGL,
    GeneSpring, Graphviz, MAGE-ML, Resourcerer, etc.

9
Bioconductor Packages
  • Analysis packages e.g., annotate, affy, marray,
    multtest.
  • Data packages
  • Biological metadata mappings between dierent
    gene identifiers (e.g., AffyID, GO ID, LocusID,
    PMID), CDF and probe sequence information for
    Affymetrix chips.
  • E.g. hgu95av2 , GO, KEGG.
  • Experimental data code, data, and
    documentation for specific experiments or
    projects.
  • ALL Chiaretti et al. (2004) ALL dataset.
  • golubEsets Golub et al. (2000) ALL/AML dataset.
  • yeastCC Spellman et al. (1998) yeast cell cycle
    dataset.

10
Bioconductor Packages
  • General infrastructure Biobase, Biostrings,
    DynDoc, reposTools, rhdf5 , ruuid, tkWidgets,
    widgetTools.
  • Annotation annotate, AnnBuilder metadata
    packages.
  • Graphics geneplotter, hexbin.
  • Pre-processing Aymetrix oligonucleotide chip
    data affy, affycomp, affydata, affylmGUI ,
    affyPLM, annaffy, gcrma, makecdfenv, vsn.
  • Pre-processing two-color spotted DNA microarray
    data arrayMagic, arrayQuality, limma, limmaGUI ,
    marray, vsn.
  • Other assays aCGH, DNAcopy, prada, PROcess,
    RSNPer, SAGElyzer.
  • Dierential gene expression EBarrays, edd,
    factDesign, genefilter, limma, limmaGUI ,
    multtest, ROC.
  • Graphs and networks graph, RBGL, Rgraphviz .
  • Gene Ontology GOstats, goTools.

11
(No Transcript)
12
Microarray data analysis
  • Pre-processing of
  • spotted array data with marray packages
  • Affymetrix chip data with affy packages.
  • List of differentially expressed genes from
    genefilter, limma, or multtest packages.
  • Prediction of tumor class using randomForest
    package.
  • Clustering of genes using cluster or hopach
    packages.
  • Use of annotate package
  • to retrieve and search PubMed abstracts
  • to generate an HTML report with links to
    LocusLink and PubMed for each gene.

13
affy Package
  • To load the necessary packages,
  • gt library(affy)
  • gt library(affydata)
  • One of the main functions for reading in
    Affymetrix data is ReadAffy. It reads in data
    from CEL files and creates objects of class
    AffyBatch.
  • In this lab we will work mainly with the Dilution
    dataset, which is included in the affydata
    package. To load the dataset, type
  • gtdata(Dilution)
  • For a description of Dilution, type
  • gt? Dilution

14
affy classes and methods
  • One of the main classes in affy is the AffyBatch
    class.
  • gtclass(Dilution)
  • 1 AffyBatch
  • gt slotNames(Dilution)
  • 1 "cdfName "nrow "ncol" "exprs" "se.exprs
    "phenoData" 7"description" "annotation" "notes
  • gtDilution
  • AffyBatch object
  • size of arrays640x640 features (12805 kb)
  • cdfHG_U95Av2 (12625 affyids)
  • number of samples4
  • number of genes12625
  • annotationhgu95av2

15
affy classes and methods
  • The exprs slot contains a matrix with columns
    corresponding to chips and rows to individual
    probes on the chip. To obtain the matrix of
    intensities for all four chips,
  • gt e lt- exprs(Dilution)
  • Probe-level PM and MM intensities can be accessed
    using the pm and mm methods.
  • gt PM lt- pm(Dilution)

16
affy classes and methods
  • gt PM15,
  • 20A 20B 10A 10B
  • 1, 468.8 282.3 433.0 198.0
  • 2, 430.0 265.0 308.5 192.8
  • 3, 182.3 115.0 138.0 86.3
  • 4, 930.0 588.0 752.8 392.5
  • 5, 171.0 128.0 152.3 97.8

17
affy classes and methods
  • To get the probe-set names (Ay IDs),
  • gt gnames lt- geneNames(Dilution)
  • gt length(gnames)
  • 1 12625
  • gt gnames15
  • 1 "1000_at" "1001_at" "1002_f_at" "1003_s_at"
    5"1004_at"

18
affy classes and methods
To produce boxplots plots of log base 2 probe
intensities, gt boxplot(Dilution, col c(2, 2, 3,
3))
19
affy classes and methods
  • The boxplots show that the Dilution data needs
    normalization. As described in the dataset help
    file and in the phenoData slot (pData(Dilution)),
    two concentrations of mRNA were used and, for
    each concentration, two scanners were used. From
    the plots, we note that scanner effects seem
    stronger than concentration effects (different
    colors). In other words, chips that should be the
    same are different chips that should be
    different are similar.
  • Because different mRNA concentrations were used,
    we perform normalization within concentration
    groups. The default procedure implemented in the
    normalize method is probe-level quantile
    normalization.

20
affy classes and methods
  • gt Dil20 lt- normalize(Dilution, 12)
  • gt Dil10 lt- normalize(Dilution, 34)
  • gt normDil lt- merge(Dil20, Dil10)
  • gtboxplot(normDil, colc(2,2,3,3))

21
affy classes and methods
  • We view the process of going from probe-level
    intensities to gene-level expression measures as
    a three-step procedure consisting of (i)
    background adjustment (ii) normalization (iii)
    summarization. The affy package provides
    implementations for a number of methods for each
    of these steps (i) background correction e.g.,
    none, MAS 5.0, convolution (ii) normalization
    e.g., probe-level quantile, cyclic loess,
    contrast loess (iii) summarization e.g., MAS
    4.0, MAS 5.0, MBEI (Li Wong, 2001), median
    polish for additive linear model (Irizarry et
    al., 2003).
  • The Robust Multichip Average (RMA) method refers
    to the sequence convolution background
    adjustment, probe-level quantile normalization,
    and median polish summarization for gene-specific
    additive models with probe and chip effects.
  • gt rmaDil lt- rma(Dilution)

22
affy classes and methods
  • CDF data packages
  • Data packages providing CDF information can be
    download from www.bioconductor.org. These
    packages contain environment objects which
    provide mappings between AffyIDs and matrices of
    probe locations, with rows corresponding to
    probe-pairs and columns to PM and MM cells. The
    CDF environment for the HGU95Av2 chip is already
    in the package. For information on the
    environment object type gt? hgu95av2cdf
Write a Comment
User Comments (0)
About PowerShow.com