Title: Introduction to Bioinformatics
1Introduction to Bioinformatics
Larry Hunter Center for Computational
Pharmacology Larry.Hunter_at_uchsc.edu 2817b
SoM 315-1094
2Overview
- Bioinformatics in the post-genomic era
- Data sources tapping public science
- Some useful databases you should know
- Metadata and integrated sources
- Inference transforming data into knowledge.
- Principles and foundations
- Example applications
- Advice for your own efforts
3Bioinformatics now
- Computational methods routine (although not quite
as integrated as statistical analysis) - High throughput assays increase the stakes.
Pharma has undergone radical change! - Competitive advantages
- Getting more out of your data
- Finding other relevant information faster.
- Exploratory, hypothesis-generating analyses
4Some challenges
- Staying current which techniques are accepted?
Are best? - Finding external data and algorithms. Which web
sites? Which commercial offerings? - Managing your own data. Beyond the spreadsheets
eyeballing - Finding and managing collaborative relationships,
especially with computer people.
5Taking advantage of public data
- An enormous amount of high quality data is
available free.
6How to find public data
- Start with http//www.ncbi.nlm.nih.gov
- Consult the Jan 2001 issue of Nucleic Acids
Research http//nar.oupjournals.org/content/vol28
/issue1 - Metadata sites (listings of databases)
- The NAR issue has an associated site
- http//www.genome.ad.jp/kegg/kegg4.html
- http//bioinformatics.weizmann.ac.il/mb/molecular_
biol_databases.html - Commercial portals, e.g. www.biolinks.com,
www.doubletwist.com
7NCBI Ground Zero
- The National Center for Biotechnology Information
is the first place to go. Sequences, structures,
PubMed, taxonomy, medical genetics, etc. - Spend some time learning all it has to offer.
There are good online tutorials at
http//www.ncbi.nlm.nih.gov/Education/ Look at
the site map, not just the front page! - Check out PROW (Protein Reviews on the Web), a
journal/reference source at NCBI.
8An Abundance of Specialized Data
- Gene sequences and protein structures are not all
there is! - Metabolic, regulatory and signaling pathway data
is growing rapidly - Carbohydrates, drugs, lipids, diseases,
organisms, etc. all have their own public
databases
9(No Transcript)
10(No Transcript)
11Integrated data sources
- Like the data, the shear volume of databases can
be overwhelming. - Integrated systems offer organized summaries of
diverse datasets. - An excellent starting place for information about
human genes are GeneCards http//bioinformatics.w
eizmann.ac.il/cards/
12(No Transcript)
13Inference from biological data
- Goal is to move from raw data to meaningful
conclusions. - Examples detecting remote homologues,
identifying coregulated genes, predicting binding
affinities - Broadly applicable computational techniques
clustering, discrimination/regression density
estimation
14Clustering
- Begin with a set of instances (e.g. gene
sequences, protein structures) and a distance
metric. - Create a collection of groups of the instances
which are more similar to each other than they
are to instances in other groups. Groups can be
hierarchically clustered themselves - Examples
- Building taxonomic trees from aligned sequences
- Identifying coregulated genes from expression
arrays
15Discrimination/Regression
- Induce a predictor of some aspect (the label) of
an instance from other aspects. Numeric
predictions are regression, class predictions
discrimination. - Beginning with a training set of labeled
instances - Produce a model which accurately predicts the
labels of other (unlabeled) instances. - Examples
- Protein secondary structure prediction
- Prediction of drug response from gene expression
16Density estimation
- Produces a method for assessing the probability
of an observation. Like a histogram - Uses a set of observations (and, optionally, a
distance function) - Examples
- Recognition of members of protein families
- Evaluation of diversity of compound libraries
17Particular applications
- Those broad computational approaches have many
particular instantiations and applications - Two examples
- Hidden Markov models for multiple sequence
alignment and homologous family discrimination - Analysis of gene expression array data
- Finding genes that vary significantly
- Estimating the number of clusters
- Finding high-order discriminators
18Hidden Markov models
- Technique from speech understanding is now widely
used in sequence analysis - Good software and tutorials on the web
http//www.cse.ucsc.edu/research/compbio/ismb99.tu
torial.html - HMMs infer unobserved states that influence the
probability distribution of observed states - Most common use is to model sequence families.
Unobserved states are indel events.
19HMM example
- States are amino acid distributions, insertions
and deletions. Train with sequences from a
single protein family. Resulting model
recognizes other members of the family, including
distant homologs
20Training an HMMThe E/M Algorithm
- Start with random state and transition
probabilities - Use model to ?parse? training data, assigning
each to most likely path - Change the state and transition probabilities to
reflect the assignments - Iterate until convergence
21Trained HMM is a very sensitive discriminator for
families
- HMM trained with 5 proteins has 20 fewer errors
than single-sequence probabilistic
Smith-Waterman using ten sequences gives 51
fewer errors. - Weighting Hidden Markov Models for Maximum
Discrimination. R. Karchin and A. Hughey,
Bioinformatics, 14(9)772--782, 1998. - pfam.wustl.edu (2000 pretrained models)
- www.cse.ucsc.edu/research/compbio/sam.html
22Expression Array Analysis
- Gene expression arrays are a popular new
technique for assaying the expression level of
tens of thousands of genes simultaneously - Many problems arise in analyzing this data
- We are in collaboration with many groups for
analysis, and developing our own tools
23Expression Analysis Issues
- Identifying genes that changed significantly over
a set of observations. How much change is
enough? What's wrong with 2-fold? - Estimating the number of expression clusters.
How many groups of genes are there? - Finding discriminators based on expression levels
(e.g. for response to drugs)
24Variation filters for expression
- Most basic question in array analysis still
open Which genes changed significantly? - Several challenges
- Unknown variances, so hard to compare small
numbers of observations - Noise is inversely correlated with signal
- Small (30) changes known to be biologically
important.
25A novel approach to detecting variable genes
- Assume most genes do not change, so use median
variance as estimate of noise - Ratio of gene (N-1)s2 to median s2 is c2
distributed with N-1 degrees of freedom gene
variance test - Problem is that overall variance may be low, but
one difference can be large - Investigating statistics on distances.
- Use database for estimates of s2 for each gene
26Estimating the number of clusters
- Required input for many clustering methods
(e.g., k-means, SOM) - Hartigan test Ratio of sum squared error for K
vs. K-1 clusters is F distributed (if K K-1
differ by a single split class) - Assume multivariate Normal for each cluster
-2logPr(K)/Pr(K-1) is c2 distributed. - Monte Carlo cross-validation estimates, too.
27Estimates of K on artificial data
28Estimates of k using real data 3454 Rat genes, 5
chips (alcohol sensitive vs. not)
29Discrimination Regression
- Clusters of co-expressed genes are interesting,
but just a first step. Really want predictive
models, gene networks, etc. - Biology tells us that the predictors are likely
to involve interactions more than linear effects. - Traditional statistics is not strong in
non-linear models, high order interactions, large
datasets.
30Linear models vs. MMCs SLAM
31Some Advice
- A little bioinformatics is good for you!
- Know how to use web data resources
- Know the kinds of analyses that are possible
- Sequence and structure computations are
widespread and (fairly) easy. Finding exons,
remote homologies, structural domains, fold
families, etc. are routine. - Generic clustering, discrimination/regression and
density estimation tools exist (neural
networks...) - Collaboration with bioinformaticians is no worse
than with statisticians.... -)