Whole Genome Assembly Microarray analysis - PowerPoint PPT Presentation

About This Presentation
Title:

Whole Genome Assembly Microarray analysis

Description:

Human, Mouse, Rat, Dog, Chimpanzee.. Many Prokaryotes (One can be sequenced in a day) ... DNA signals. Gene Finding. Assembly. Other static analysis is possible ... – PowerPoint PPT presentation

Number of Views:79
Avg rating:3.0/5.0
Slides: 46
Provided by: vineet50
Learn more at: https://cseweb.ucsd.edu
Category:

less

Transcript and Presenter's Notes

Title: Whole Genome Assembly Microarray analysis


1
Whole Genome AssemblyMicroarray analysis
2
Mate Pairs
  • Mate-pairs allow you to merge islands (contigs)
    into super-contigs

3
Super-contigs are quite large
  • Make clones of truly predictable length. EX 3
    sets can be used 2Kb, 10Kb and 50Kb. The
    variance in these lengths should be small.
  • Use the mate-pairs to order and orient the
    contigs, and make super-contigs.

4
Problem 3 Repeats
5
Repeats Chimerisms
  • 40-50 of the human genome is made up of
    repetitive elements.
  • Repeats can cause great problems in the assembly!
  • Chimerism causes a clone to be from two different
    parts of the genome. Can again give a completely
    wrong assembly

6
Repeat detection
  • Lander Waterman strikes again!
  • The expected number of clones in a Repeat
    containing island is MUCH larger than in a
    non-repeat containing island (contig).
  • Thus, every contig can be marked as Unique, or
    non-unique. In the first step, throw away the
    non-unique islands.

Repeat
7
Detecting Repeat Contigs 1 Read Density
  • Compute the log-odds ratio of two hypotheses
  • H1 The contig is from a unique region of the
    genome.
  • The contig is from a region that is repeated at
    least twice

8
Detecting Chimeric reads
  • Chimeric reads Reads that contain sequence from
    two genomic locations.
  • Good overlaps G(a,b) if a,b overlap with a high
    score
  • Transitive overlap T(a,c) if G(a,b), and G(b,c)
  • Find a point x across which only transitive
    overlaps occur. X is a point of chimerism

9
Contig assembly
  • Reads are merged into contigs upto repeat
    boundaries.
  • (a,b) (a,c) overlap, (b,c) should overlap as
    well. Also,
  • shift(a,c)shift(a,b)shift(b,c)
  • Most of the contigs are unique pieces of the
    genome, and end at some Repeat boundary.
  • Some contigs might be entirely within repeats.
    These must be detected

10
Creating Super Contigs
11
Supercontig assembly
  • Supercontigs are built incrementally
  • Initially, each contig is a supercontig.
  • In each round, a pair of super-contigs is merged
    until no more can be performed.
  • Create a Priority Queue with a score for every
    pair of mergeable supercontigs.
  • Score has two terms
  • A reward for multiple mate-pair links
  • A penalty for distance between the links.

12
Supercontig merging
  • Remove the top scoring pair (S1,S2) from the
    priority queue.
  • Merge (S1,S2) to form contig T.
  • Remove all pairs in Q containing S1 or S2
  • Find all supercontigs W that share mate-pair
    links with T and insert (T,W) into the priority
    queue.
  • Detect Repeated Supercontigs and remove

13
Repeat Supercontigs
  • If the distance between two super-contigs is not
    correct, they are marked as Repeated
  • If transitivity is not maintained, then there is
    a Repeat

14
Filling gaps in Supercontigs
15
Consenus Derivation Assembly
  • Summary
  • Do an all pairs prefix-suffix alignment.
    (Speedup using k-mer hashing).
  • Construct a graph of overlapping alignments.
  • Break the graph into unique regions (Number of
    clones similar to prediction using LW), and
    repeat/chimeric regions. Each such unique
    region is called a contig.
  • Merge contigs into super-contigs using mate-pair
    links
  • For each contig, construct a multiple alignment,
    and consensus sequence.
  • Pad the consensus sequence using NNs.

16
Summary
  • Once controversial, whole genome shotgun is now
    routine
  • Human, Mouse, Rat, Dog, Chimpanzee..
  • Many Prokaryotes (One can be sequenced in a day)
  • Plant genomes Arabidopsis, Rice
  • Model organisms Worm, Fly, Yeast
  • WGS must be followed up with a finishing effort.
  • A lot is not known about genome structure,
    organization and function.
  • Comparative genomics offers low hanging fruit.

17
Biol. Data analysis Review
Assembly
Protein Sequence Analysis
Sequence Analysis/ DNA signals
Gene Finding
18
Other static analysis is possible
Genomic Analysis/ Pop. Genetics
Assembly
Protein Sequence Analysis
Sequence Analysis
Gene Finding
ncRNA
19
A Static picture of the cell is insufficient
  • Each Cell is continuously active,
  • Genes are being transcribed into RNA
  • RNA is translated into proteins
  • Proteins are PT modified and transported
  • Proteins perform various cellular functions
  • Can we probe the Cell dynamically?
  • Which transcripts are active?
  • Which proteins are active?
  • Which proteins interact?

Gene Regulation
Proteomic profiling
Transcript profiling
20
Micro-array analysis
21
The Biological Problem
  • Two conditions that need to be differentiated,
    (Have different treatments).
  • EX ALL (Acute Lymphocytic Leukemia) AML (Acute
    Myelogenous Leukima)
  • Possibly, the set of genes over-expressed are
    different in the two conditions

22
Supplementary fig. 2. Expression levels of
predictive genes in independent dataset. The
expression levels of the 50 genes most highly
correlated with the ALL-AML distinction in the
initial dataset were determined in the
independent dataset. Each row corresponds to a
gene, with the columns corresponding to
expression levels in different samples. The
expression level of each gene in the independent
dataset is shown relative to the mean of
expression levels for that gene in the initial
dataset. Expression levels greater than the mean
are shaded in red, and those below the mean are
shaded in blue. The scale indicates standard
deviations above or below the mean. The top panel
shows genes highly expressed in ALL, the bottom
panel shows genes more highly expressed in AML.
23
Gene Expression Data
  • Gene Expression data
  • Each row corresponds to a gene
  • Each column corresponds to an expression value
  • Can we separate the experiments into two or more
    classes?
  • Given a training set of two classes, can we build
    a classifier that places a new experiment in one
    of the two classes.

s1
s2
s
g
24
Three types of analysis problems
  • Cluster analysis/unsupervised learning
  • Classification into known classes (Supervised)
  • Identification of marker genes that
    characterize different tumor classes

25
Supervised Classification Basics
  • Consider genes g1 and g2
  • g1 is up-regulated in class A, and down-regulated
    in class B.
  • g2 is up-regulated in class A, and down-regulated
    in class B.
  • Intuitively, g1 and g2 are effective in
    classifying the two samples. The samples are
    linearly separable.

26
Basics
  • With 3 genes, a plane is used to separate
    (linearly separable samples). In higher
    dimensions, a hyperplane is used.

27
Non-linear separability
  • Sometimes, the data is not linearly separable,
    but can be separated by some other function
  • In general, the linearly separable problem is
    computationally easier.

28
Formalizing of the classification problem for
micro-arrays
v
  • Each experiment (sample) is a vector of
    expression values.
  • By default, all vectors v are column vectors.
  • vT is the transpose of a vector
  • The genes are the dimension of a vector.
  • Classification problem Find a surface that will
    separate the classes

vT
29
Formalizing Classification
  • Classification problem Find a surface
    (hyperplane) that will separate the classes
  • Given a new sample point, its class is then
    determined by which side of the surface it lies
    on.
  • How do we find the hyperplane? How do we find the
    side that a point lies on?

1 2 3 4 5 6
1
2
g1
3
1 .9 .8 .1 .2 .1
g2
.1 0 .2 .8 .7 .9
30
Basic geometry
  • What is x2 ?
  • What is x/x
  • Dot product?

x(x1,x2)
y
31
Dot Product
x
  • Let ? be the unit vector.
  • ? 1
  • Recall that
  • ?Tx x cos ?
  • What is ?Tx if x is orthogonal (perpendicular)
    to ??
  • How do we specify a hyperplane?

?
?
?Tx x cos ?
32
Hyperplane
  • How can we define a hyperplane L?

33
Points on the hyperplane
  • Consider a hyperplane L defined by unit vector ?,
    and distance ?0
  • Notes
  • For all x ? L, xT? must be the same, xT? ?0
  • For any two points x1, x2,
  • (x1- x2)T ?0

x2
x1
34
Hyperplane properties
  • Given an arbitrary point x, what is the distance
    from x to the plane L?
  • D(x,L) (?Tx - ?0)
  • When are points x1 and x2 on different sides of
    the hyperplane?

x
?0
35
Separating by a hyperplane
  • Input A training set of ve -ve examples
  • Goal Find a hyperplane that separates the two
    classes.
  • Classification A new point x is ve if it lies
    on the ve side of the hyperplane, -ve otherwise.
  • The hyperplane is represented by the line
  • x-?0?1x1?2x20

36
Error in classification
  • An arbitrarily chosen hyperplane might not
    separate the test. We need to minimize a
    mis-classification error
  • Error sum of distances of the misclassified
    points.
  • Let yi1 for ve example i, yi-1 otherwise.
  • Other definitions are also possible.

37
Gradient Descent
  • The function D(?) defines the error.
  • We follow an iterative refinement. In each step,
    refine ? so the error is reduced.
  • Gradient descent is an approach to such iterative
    refinement.

D(?)
D(?)
?
38
Rosenblatts perceptron learning algorithm
39
Classification based on perceptron learning
  • Use Rosenblatts algorithm to compute the
    hyperplane L(?,?0).
  • Assign x to class 1 if f(x) gt 0, and to class 2
    otherwise.

40
Perceptron learning
  • If many solutions are possible, it does no choose
    between solutions
  • If data is not linearly separable, it does not
    terminate, and it is hard to detect.
  • Time of convergence is not well understood

41
Linear Discriminant analysis
  • Provides an alternative approach to
    classification with a linear function.
  • Project all points, including the means, onto
    vector ?.
  • We want to choose ? such that
  • Difference of projected means is large.
  • Variance within group is small

?
x2
-
x1
42
LDA Contd
Fisher Criterion
43
Maximum Likelihood discrimination
  • Suppose we knew the distribution of points in
    each class.
  • We can compute Pr(x?i) for all classes i, and
    take the maximum

44
ML discrimination
  • Suppose all the points were in 1 dimension, and
    all classes were normally distributed.

45
ML discrimination recipe
  • We know the distribution for each class, but not
    the parameters
  • Estimate the mean and variance for each class.
  • For a new point x, compute the discrimination
    function gi(x) for each class i.
  • Choose argmaxi gi(x) as the class for x
Write a Comment
User Comments (0)
About PowerShow.com