Introduction to Bioinformatics - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

Introduction to Bioinformatics

Description:

Finding and managing collaborative relationships, especially with computer people. ... Finding exons, remote homologies, structural domains, fold families, etc. ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0
Slides: 32
Provided by: compbi
Category:

less

Transcript and Presenter's Notes

Title: Introduction to Bioinformatics


1
Introduction to Bioinformatics
Larry Hunter Center for Computational
Pharmacology Larry.Hunter_at_uchsc.edu 2817b
SoM 315-1094
2
Overview
  • Bioinformatics in the post-genomic era
  • Data sources tapping public science
  • Some useful databases you should know
  • Metadata and integrated sources
  • Inference transforming data into knowledge.
  • Principles and foundations
  • Example applications
  • Advice for your own efforts

3
Bioinformatics now
  • Computational methods routine (although not quite
    as integrated as statistical analysis)
  • High throughput assays increase the stakes.
    Pharma has undergone radical change!
  • Competitive advantages
  • Getting more out of your data
  • Finding other relevant information faster.
  • Exploratory, hypothesis-generating analyses

4
Some challenges
  • Staying current which techniques are accepted?
    Are best?
  • Finding external data and algorithms. Which web
    sites? Which commercial offerings?
  • Managing your own data. Beyond the spreadsheets
    eyeballing
  • Finding and managing collaborative relationships,
    especially with computer people.

5
Taking advantage of public data
  • An enormous amount of high quality data is
    available free.

6
How to find public data
  • Start with http//www.ncbi.nlm.nih.gov
  • Consult the Jan 2001 issue of Nucleic Acids
    Research http//nar.oupjournals.org/content/vol28
    /issue1
  • Metadata sites (listings of databases)
  • The NAR issue has an associated site
  • http//www.genome.ad.jp/kegg/kegg4.html
  • http//bioinformatics.weizmann.ac.il/mb/molecular_
    biol_databases.html
  • Commercial portals, e.g. www.biolinks.com,
    www.doubletwist.com

7
NCBI Ground Zero
  • The National Center for Biotechnology Information
    is the first place to go. Sequences, structures,
    PubMed, taxonomy, medical genetics, etc.
  • Spend some time learning all it has to offer.
    There are good online tutorials at
    http//www.ncbi.nlm.nih.gov/Education/ Look at
    the site map, not just the front page!
  • Check out PROW (Protein Reviews on the Web), a
    journal/reference source at NCBI.

8
An Abundance of Specialized Data
  • Gene sequences and protein structures are not all
    there is!
  • Metabolic, regulatory and signaling pathway data
    is growing rapidly
  • Carbohydrates, drugs, lipids, diseases,
    organisms, etc. all have their own public
    databases

9
(No Transcript)
10
(No Transcript)
11
Integrated data sources
  • Like the data, the shear volume of databases can
    be overwhelming.
  • Integrated systems offer organized summaries of
    diverse datasets.
  • An excellent starting place for information about
    human genes are GeneCards http//bioinformatics.w
    eizmann.ac.il/cards/

12
(No Transcript)
13
Inference from biological data
  • Goal is to move from raw data to meaningful
    conclusions.
  • Examples detecting remote homologues,
    identifying coregulated genes, predicting binding
    affinities
  • Broadly applicable computational techniques
    clustering, discrimination/regression density
    estimation

14
Clustering
  • Begin with a set of instances (e.g. gene
    sequences, protein structures) and a distance
    metric.
  • Create a collection of groups of the instances
    which are more similar to each other than they
    are to instances in other groups. Groups can be
    hierarchically clustered themselves
  • Examples
  • Building taxonomic trees from aligned sequences
  • Identifying coregulated genes from expression
    arrays

15
Discrimination/Regression
  • Induce a predictor of some aspect (the label) of
    an instance from other aspects. Numeric
    predictions are regression, class predictions
    discrimination.
  • Beginning with a training set of labeled
    instances
  • Produce a model which accurately predicts the
    labels of other (unlabeled) instances.
  • Examples
  • Protein secondary structure prediction
  • Prediction of drug response from gene expression

16
Density estimation
  • Produces a method for assessing the probability
    of an observation. Like a histogram
  • Uses a set of observations (and, optionally, a
    distance function)
  • Examples
  • Recognition of members of protein families
  • Evaluation of diversity of compound libraries

17
Particular applications
  • Those broad computational approaches have many
    particular instantiations and applications
  • Two examples
  • Hidden Markov models for multiple sequence
    alignment and homologous family discrimination
  • Analysis of gene expression array data
  • Finding genes that vary significantly
  • Estimating the number of clusters
  • Finding high-order discriminators

18
Hidden Markov models
  • Technique from speech understanding is now widely
    used in sequence analysis
  • Good software and tutorials on the web
    http//www.cse.ucsc.edu/research/compbio/ismb99.tu
    torial.html
  • HMMs infer unobserved states that influence the
    probability distribution of observed states
  • Most common use is to model sequence families.
    Unobserved states are indel events.

19
HMM example
  • States are amino acid distributions, insertions
    and deletions. Train with sequences from a
    single protein family. Resulting model
    recognizes other members of the family, including
    distant homologs

20
Training an HMMThe E/M Algorithm
  • Start with random state and transition
    probabilities
  • Use model to ?parse? training data, assigning
    each to most likely path
  • Change the state and transition probabilities to
    reflect the assignments
  • Iterate until convergence

21
Trained HMM is a very sensitive discriminator for
families
  • HMM trained with 5 proteins has 20 fewer errors
    than single-sequence probabilistic
    Smith-Waterman using ten sequences gives 51
    fewer errors.
  • Weighting Hidden Markov Models for Maximum
    Discrimination. R. Karchin and A. Hughey,
    Bioinformatics, 14(9)772--782, 1998.
  • pfam.wustl.edu (2000 pretrained models)
  • www.cse.ucsc.edu/research/compbio/sam.html

22
Expression Array Analysis
  • Gene expression arrays are a popular new
    technique for assaying the expression level of
    tens of thousands of genes simultaneously
  • Many problems arise in analyzing this data
  • We are in collaboration with many groups for
    analysis, and developing our own tools

23
Expression Analysis Issues
  • Identifying genes that changed significantly over
    a set of observations. How much change is
    enough? What's wrong with 2-fold?
  • Estimating the number of expression clusters.
    How many groups of genes are there?
  • Finding discriminators based on expression levels
    (e.g. for response to drugs)

24
Variation filters for expression
  • Most basic question in array analysis still
    open Which genes changed significantly?
  • Several challenges
  • Unknown variances, so hard to compare small
    numbers of observations
  • Noise is inversely correlated with signal
  • Small (30) changes known to be biologically
    important.

25
A novel approach to detecting variable genes
  • Assume most genes do not change, so use median
    variance as estimate of noise
  • Ratio of gene (N-1)s2 to median s2 is c2
    distributed with N-1 degrees of freedom gene
    variance test
  • Problem is that overall variance may be low, but
    one difference can be large
  • Investigating statistics on distances.
  • Use database for estimates of s2 for each gene

26
Estimating the number of clusters
  • Required input for many clustering methods
    (e.g., k-means, SOM)
  • Hartigan test Ratio of sum squared error for K
    vs. K-1 clusters is F distributed (if K K-1
    differ by a single split class)
  • Assume multivariate Normal for each cluster
    -2logPr(K)/Pr(K-1) is c2 distributed.
  • Monte Carlo cross-validation estimates, too.

27
Estimates of K on artificial data
28
Estimates of k using real data 3454 Rat genes, 5
chips (alcohol sensitive vs. not)
29
Discrimination Regression
  • Clusters of co-expressed genes are interesting,
    but just a first step. Really want predictive
    models, gene networks, etc.
  • Biology tells us that the predictors are likely
    to involve interactions more than linear effects.
  • Traditional statistics is not strong in
    non-linear models, high order interactions, large
    datasets.

30
Linear models vs. MMCs SLAM
31
Some Advice
  • A little bioinformatics is good for you!
  • Know how to use web data resources
  • Know the kinds of analyses that are possible
  • Sequence and structure computations are
    widespread and (fairly) easy. Finding exons,
    remote homologies, structural domains, fold
    families, etc. are routine.
  • Generic clustering, discrimination/regression and
    density estimation tools exist (neural
    networks...)
  • Collaboration with bioinformaticians is no worse
    than with statisticians.... -)
Write a Comment
User Comments (0)
About PowerShow.com