Motif finding in groups of related sequences - PowerPoint PPT Presentation

About This Presentation

Title:

Motif finding in groups of related sequences

Description:

Alignment score defined differently in probabilistic/combinatorial cases ... Keep the Cl best alignments A1, ..., ACt. ACGGTTG , CGAACTT , GGGCTCT ... – PowerPoint PPT presentation

Number of Views:45

Avg rating:3.0/5.0

Slides: 27

Provided by: root70

Learn more at: http://people.csail.mit.edu

Category:

more less

Transcript and Presenter's Notes

Title: Motif finding in groups of related sequences

1
Motif finding in groups of related sequences
6.096 Algorithms for Computational Biology
2
Challenges in Computational Biology
4
Genome Assembly
Regulatory motif discovery
Gene Finding
DNA
Sequence alignment
Comparative Genomics
TCATGCTAT TCGTGATAA TGAGGATAT TTATCATAT TTATGATTT
Database lookup
3
Evolutionary Theory
Gene expression analysis
RNA transcript
Cluster discovery
9
Gibbs sampling
10
Protein network analysis
11
12
Regulatory network inference
Emerging network properties
13
3
Challenges in Computational Biology
Regulatory motif discovery
DNA
Group of co-regulated genes
Common subsequence
4
Overview

Introduction
Bio review Where do ambiguities come from?
Computational formulation of the problem
Combinatorial solutions
Exhaustive search
Greedy motif clustering
Wordlets and motif refinement
Probabilistic solutions
Expectation maximization
Gibbs sampling

5
Overview

Introduction
Bio review Where do ambiguities come from?
Computational formulation of the problem
Combinatorial solutions
Exhaustive search
Greedy motif clustering
Wordlets and motif refinement
Probabilistic solutions
Expectation maximization
Gibbs sampling

6
Regulatory motif discovery
GAL1
Gal4
Gal4
Mig1
ATGACTAAATCTCATTCAGAAGAAGTGA
CCCCW
CGG
CCG
CGG
CCG

Regulatory motifs
Genes are turned on / off in response to changing
environments
No direct addressing subroutines (genes)
contain sequence tags (motifs)
Specialized proteins (transcription factors)
recognize these tags
What makes motif discovery hard?
Motifs are short (6-8 bp), sometimes degenerate
Can contain any set of nucleotides (no ATG or
other rules)
Act at variable distances upstream (or
downstream) of target gene

7
Sticks and backbones
Atomic
Chemical
Fancy
Traditional
8
Where do ambiguous bases come from ?

Protein-DNA interactions
Proteins read DNA by feeling the chemical
properties of the bases
Without opening DNA (not by base complementarity)
Sequence specificity
Topology of 3D contact dictates sequence
specificity of binding
Some positions are fully constrained other
positions are degenerate
Ambiguous / degenerate positions are loosely
contacted by the transcription factor

9
Characteristics of Regulatory Motifs

Tiny
Highly Variable
Constant Size
Because a constant-size transcription factor
binds
Often repeated
Low-complexity-ish

10
Sequence Logos
entropy - n 1 (communication theory) a numerical
measure of the uncertainty of an outcome "the
signal contained thousands of bits of
information" information, selective information
2 (thermodynamics) a thermodynamic quantity
representing the amount of energy in a system
that is no longer available for doing mechanical
work "entropy increases as matter and energy in
the universe degrade to an ultimate state of
inert uniformity" randomness

Entropy at posn I, H(i) ?letter x
freq(x, i) log2 freq(x, i)
Height of x at posn i, L(x, i) freq(x, i) (2
H(i))
Examples
freq(A, i) 1 H(i) 0 L(A, i) 2
A ½ C ¼ G ¼ H(i) 1.5 L(A, i) ¼
L(not T, i) ¼

11
Problem Definition
Given a collection of promoter sequences s1,, sN
of genes with common expression

Combinatorial
Motif M m1mW
Some of the mis blank
Find M that occurs in all si with ? k differences
Or, Find M with smallest total hamming dist

Probabilistic Motif Mij 1 ? i ? W 1 ? j ?
4 Mij Prob letter j, pos i Find best M, and
positions p1,, pN in sequences
12
Finding Regulatory Motifs
. . .