Introduction to Bioinformatics

About This Presentation

Title:

Introduction to Bioinformatics

Description:

Finding and managing collaborative relationships, especially with computer people. ... Finding exons, remote homologies, structural domains, fold families, etc. ... – PowerPoint PPT presentation

Number of Views:40

Avg rating:3.0/5.0

Slides: 32

Provided by: compbi

Category:

more less

Transcript and Presenter's Notes

Title: Introduction to Bioinformatics

1
Introduction to Bioinformatics
Larry Hunter Center for Computational
Pharmacology Larry.Hunter_at_uchsc.edu 2817b
SoM 315-1094
2
Overview

Bioinformatics in the post-genomic era
Data sources tapping public science
Some useful databases you should know
Metadata and integrated sources
Inference transforming data into knowledge.
Principles and foundations
Example applications
Advice for your own efforts

3
Bioinformatics now

Computational methods routine (although not quite
as integrated as statistical analysis)
High throughput assays increase the stakes.
Pharma has undergone radical change!
Competitive advantages
Getting more out of your data
Finding other relevant information faster.
Exploratory, hypothesis-generating analyses

4
Some challenges

Staying current which techniques are accepted?
Are best?
Finding external data and algorithms. Which web
sites? Which commercial offerings?
Managing your own data. Beyond the spreadsheets
eyeballing
Finding and managing collaborative relationships,
especially with computer people.

5
Taking advantage of public data

An enormous amount of high quality data is
available free.

6
How to find public data

Start with http//www.ncbi.nlm.nih.gov
Consult the Jan 2001 issue of Nucleic Acids
Research http//nar.oupjournals.org/content/vol28
/issue1
Metadata sites (listings of databases)
The NAR issue has an associated site
http//www.genome.ad.jp/kegg/kegg4.html
http//bioinformatics.weizmann.ac.il/mb/molecular_
biol_databases.html
Commercial portals, e.g. www.biolinks.com,
www.doubletwist.com

7
NCBI Ground Zero

The National Center for Biotechnology Information
is the first place to go. Sequences, structures,
PubMed, taxonomy, medical genetics, etc.
Spend some time learning all it has to offer.
There are good online tutorials at
http//www.ncbi.nlm.nih.gov/Education/ Look at
the site map, not just the front page!
Check out PROW (Protein Reviews on the Web), a
journal/reference source at NCBI.

8
An Abundance of Specialized Data

Gene sequences and protein structures are not all
there is!
Metabolic, regulatory and signaling pathway data
is growing rapidly
Carbohydrates, drugs, lipids, diseases,
organisms, etc. all have their own public
databases

9
(No Transcript)
10
(No Transcript)
11
Integrated data sources

Like the data, the shear volume of databases can
be overwhelming.
Integrated systems offer organized summaries of
diverse datasets.
An excellent starting place for information about
human genes are GeneCards http//bioinformatics.w
eizmann.ac.il/cards/

12
(No Transcript)
13
Inference from biological data

Goal is to move from raw data to meaningful
conclusions.
Examples detecting remote homologues,
identifying coregulated genes, predicting binding
affinities
Broadly applicable computational techniques
clustering, discrimination/regression density
estimation

14
Clustering

Begin with a set of instances (e.g. gene
sequences, protein structures) and a distance
metric.
Create a collection of groups of the instances
which are more similar to each other than they
are to instances in other groups. Groups can be
hierarchically clustered themselves
Examples
Building taxonomic trees from aligned sequences
Identifying coregulated genes from expression
arrays

15
Discrimination/Regression

Induce a predictor of some aspect (the label) of
an instance from other aspects. Numeric
predictions are regression, class predictions
discrimination.
Beginning with a training set of labeled
instances
Produce a model which accurately predicts the
labels of other (unlabeled) instances.
Examples
Protein secondary structure prediction
Prediction of drug response from gene expression

16
Density estimation

Produces a method for assessing the probability
of an observation. Like a histogram
Uses a set of observations (and, optionally, a
distance function)
Examples
Recognition of members of protein families
Evaluation of diversity of compound libraries

17
Particular applications

Those broad computational approaches have many
particular instantiations and applications
Two examples
Hidden Markov models for multiple sequence
alignment and homologous family discrimination
Analysis of gene expression array data
Finding genes that vary significantly
Estimating the number of clusters
Finding high-order discriminators

18
Hidden Markov models

Technique from speech understanding is now widely
used in sequence analysis
Good software and tutorials on the web
http//www.cse.ucsc.edu/research/compbio/ismb99.tu
torial.html
HMMs infer unobserved states that influence the
probability distribution of observed states
Most common use is to model sequence families.
Unobserved states are indel events.

19
HMM example

States are amino acid distributions, insertions
and deletions. Train with sequences from a
single protein family. Resulting model
recognizes other members of the family, including
distant homologs

20
Training an HMMThe E/M Algorithm

Start with random state and transition
probabilities
Use model to ?parse? training data, assigning
each to most likely path
Change the state and transition probabilities to
reflect the assignments
Iterate until convergence

21
Trained HMM is a very sensitive discriminator for
families

HMM trained with 5 proteins has 20 fewer errors
than single-sequence probabilistic
Smith-Waterman using ten sequences gives 51
fewer errors.
Weighting Hidden Markov Models for Maximum
Discrimination. R. Karchin and A. Hughey,
Bioinformatics, 14(9)772--782, 1998.
pfam.wustl.edu (2000 pretrained models)
www.cse.ucsc.edu/research/compbio/sam.html

22
Expression Array Analysis

Gene expression arrays are a popular new
technique for assaying the expression level of
tens of thousands of genes simultaneously
Many problems arise in analyzing this data
We are in collaboration with many groups for
analysis, and developing our own tools

23
Expression Analysis Issues

Identifying genes that changed significantly over
a set of observations. How much change is
enough? What's wrong with 2-fold?
Estimating the number of expression clusters.
How many groups of genes are there?
Finding discriminators based on expression levels
(e.g. for response to drugs)

24
Variation filters for expression

Most basic question in array analysis still
open Which genes changed significantly?
Several challenges
Unknown variances, so hard to compare small
numbers of observations
Noise is inversely correlated with signal
Small (30) changes known to be biologically
important.

25
A novel approach to detecting variable genes

Assume most genes do not change, so use median
variance as estimate of noise
Ratio of gene (N-1)s2 to median s2 is c2
distributed with N-1 degrees of freedom gene
variance test
Problem is that overall variance may be low, but
one difference can be large
Investigating statistics on distances.
Use database for estimates of s2 for each gene

26
Estimating the number of clusters

Required input for many clustering methods
(e.g., k-means, SOM)
Hartigan test Ratio of sum squared error for K
vs. K-1 clusters is F distributed (if K K-1
differ by a single split class)
Assume multivariate Normal for each cluster
-2logPr(K)/Pr(K-1) is c2 distributed.
Monte Carlo cross-validation estimates, too.

27
Estimates of K on artificial data
28
Estimates of k using real data 3454 Rat genes, 5
chips (alcohol sensitive vs. not)
29
Discrimination Regression

Clusters of co-expressed genes are interesting,
but just a first step. Really want predictive
models, gene networks, etc.
Biology tells us that the predictors are likely
to involve interactions more than linear effects.
Traditional statistics is not strong in
non-linear models, high order interactions, large
datasets.

30
Linear models vs. MMCs SLAM
31
Some Advice

A little bioinformatics is good for you!
Know how to use web data resources
Know the kinds of analyses that are possible
Sequence and structure computations are
widespread and (fairly) easy. Finding exons,
remote homologies, structural domains, fold
families, etc. are routine.
Generic clustering, discrimination/regression and
density estimation tools exist (neural
networks...)
Collaboration with bioinformaticians is no worse
than with statisticians.... -)

Write a Comment

User Comments (0)

About PowerShow.com

Introduction to Bioinformatics - PowerPoint PPT Presentation

Introduction to Bioinformatics

Finding and managing collaborative relationships, especially with computer people. ... Finding exons, remote homologies, structural domains, fold families, etc. ... – PowerPoint PPT presentation