Title: de novo Sequence Analysis
1de novo Sequence Analysis
Lab work
de novo cDNA analysis de novo genomic sequence
analysis de novo protein analysis
Confirms/disagrees with in silico predictions
Development of programs
Predictions
Sequence analysis tools
Major portion of slides by Jane Loveland and
Dustin Schones
2de novo Sequence Analysis
- Assign function to a sequence
- Align to annotated sequences
- Similarity searching and pairwise alignment
- BLAST cDNA and genomic clones, proteins
- BLAT genome searching
- Gene structure
- PSI-BLAST sensitive protein searching
- Multiple sequence alignment (proteins)
- CLUSTALW perform alignment
- JalView, GeneDoc edit and view alignment
- Find open reading frames
- ORF Finder
- If translated, map protein domains
- InterProScan
3sequence alignment
- sequence analysis ? sequence alignment
- what
- why
- similar sequence
- infer homology
- infer function
sequence ? structure ? function
4pairwise alignments multiple sequence
alignments
5Global vs. Local
6BLAST
Basic Local Alignment Search Tool
- idea find high scoring local alignments between
query sequence and target database - assumption true match alignments very likely to
contain within them very high scoring matches - heuristics theme search quickly for homologous
regions and then do slow/exact
alignments
7BLAST family
8BLAST family
9BLAST Steps
- For each word of length W in the query,
generate a list of all possible words
(neighborhood) with a score of at least threshold
T (determined by using the scoring matrix)
10Determine the locations of all common words
between the query and the database (word hits).
11(No Transcript)
12BLAST Steps
- use dynamic programming to extend hits until
the score drops a value of X expensive!! --
90 of time
13Evaluates the statistical significance of
extended hits and reports only those above the
determined threshold.
14(No Transcript)
15BLAST statistical evaluation
- for local, ungapped alignments
- m size of query n size of database
- E expected of HSPs with scores at least S
- p prob of finding at least one HSP with S
- good tutorial at
- http//www.ncbi.nlm.nih.gov/BLAST/tutorial
/Altschul-1.html
16BLAT
- Blast Like Alignment Tool (BLAT)
- Good for aligning mRNA, ESTs to genome
- fast
- aligns whole mRNA, not just exons
- handles introns and splice-sites
- Sequences need to be 95 ID or better
- Available at
- UCSC Genome Browser
- Ensembl
17BLAT
- Steps for cDNA alignment
- 1 break cDNA into n base chunks
- 2 use index to find regions in genome similar
to each chunk of cDNA - 3 detailed alignment between genome region and
cDNA chunk - 4 dynamic programming - stitch together
detailed alignments of chunks into alignment of
whole
18- genome cacaattatcacgaccgc (K 8-13 real
genome)
K-mers cac aat tat cac gac cgc 0
3 6 9 12 15
cDNA aattctcac
3-mers aat att ttc tct ctc tca cac
0 1 2 3 4 5 6
example from Jim Kent
19PSI-BLAST
Position Specific Iterated-BLAST
- database searches using position-specific scoring
matrices more powerful than simply using single
sequence - STEPS
- collect all DB sequences that align with E-val lt
T - align these to make position-specific scoring
matrix - use scoring matrix to search for new hits
- iterate
20PSI-Blast
21ORF-finder
- graphical analysis tool which finds all open
reading frames in a sequence - looks for start and stop codons
- assumes upstream start and downstream stop if ORF
at least 100 amino acid - ORFs can be selected to view as DNA sequence or
amino acid sequence
22Clustalw DNA and Protein alignments
Copy and paste sequences Alignment may be viewed
and edited in Jalview
Available at EBI (http//www.ebi.ac.uk)
23ClustalW Output
24Standard ClustalW Output
25JalView Alignment Editor and Viewer
26GeneDoc Alignment Editor and Viewer
27- Integrated documentation resource for protein
domains, families and sites - Integrated view of databases
- Intuitive interface for text and sequence
searches
Available at EBI (http//www.ebi.ac.uk)
28When To Use What When
- Genomic sequence searches BLAT
- DNA vs. genome
- cDNA vs. genome
- protein vs. genome
- For cDNA sequences BLAST
- cDNA vs. nucleotide (nt)
- cDNA vs. protein (nr)
- For protein sequences BLAST and PSI-BLAST
- Protein vs. protein (nr)
- BLAST with similar species
- PSI-BLAST high-sensitivity, distant species
Same species
Same or similar species