Title: Bioinformatics I, Sequence Analysis
1Bioinformatics I, Sequence Analysis
Lecture 1 Introduction, Genome Edit Distance
UNIX vi Pseudocode.
2Topics covered in this course
- Alignment, Multiple alignment. Homology/similarity
- Motif finding, footprint finding, gene finding
- Sorting, clustering, tree building, molecular
evolution - Database structure, database searching,
statistical significance. - Sequence models, Markov models.
- Protein/RNA secondary structure prediction.
- Genomics, proteomics.
- Ontologies, algorithms.
3Part 1 a story of mice and men
4Of Mice and Men
- Mouse genes and human genes are 80 to 95
identical, but their locations on the chromosome
are largely scrambled. - Mouse and human have a common ancestor about 80
MYA. (million years ago). Since then, we have
evolved independently. - Evolution occurs by point mutations and
rearrangements.
Order of genes in a Human Chromosome
ABCDEFGHIJKLMNOPQRSTUVWXYZ
Order of genes in a Mouse Chromosome
ABVWCUTSRQPKLMNOJIHGFEDXYZ
5Synteny
conservation of gene order A cluster of genes
that occur in the same order in different genomes
is a "syntenic group".
6Syntenic group in bacteria
7(No Transcript)
8Of Mice and Men
- Gene rearrangement occurs by "reversals".
A
B
D
3'
C
5'
A
C
D
3'
B
5'
How many reversals does it take to switch Mouse
to Man?
9A plausible theory
...for how reversals can occur is illegitimate
synapsis during prophase 1 of meiosis.
3'
5'
D
C
A
B
10Chromosomal evolution is a series of flip-flops
ancestral mammal
123456789
432156789 436512789 435612789 435698721
123498765 123789465 123789645 873219645
human
mouse
11The Sloppy Cook
The sloppy cook at a pancake diner makes pancakes
of all different sizes and stacks them
haphazardly.
The waiter likes the pancakes to be stacked with
the largest on the bottom and the smallest on
top. On the way to the table, using only one hand
with a spatula, he flips the pancakes until they
are arranged by size. How does he do it in the
minimum number of flips?
12In class exercise
Given the arrangement below, flip the pancakes
until they are in order. How many flips? (You
can order the numbers instead of the pancakes.)
123456
642315
13In class exercise
- Write detailed instructions on how to stack six
pancakes by flipping. The instructions should
not depend on the starting order. - Give your instructions to your partner.
- Follow your partner's written instructions to
stack the following six "pancakes" in order
125436 (Order them smallest to largest. The plate
is on the right side) - Run the "instruction set". Describe what happens.
- Fix bugs. Repeat as time permits.
14Part 2 the basics
15Biological Macromolecules are Conveniently
Represented as Linear Strings.
DNA nucleotides. 4 character alphabet. Protein
amino acids. 20 character alphabet. Lipids,
carbohydrates, other stuff not linear
heteropolymers. Not easily represented as a
sequence.
Your task MEMORIZE THE ALPHABET of AMINO ACIDS
16A DNA Sequence
1 gtcgggaaga tggcgctacg tctgctgcgg
agggcggcgc gcggagctgc ggcggcggcg 61
ctgctgaggc tgaaagcgtc tctagcagct gatatcccca
gacttggata tagttcctca 121 tcccatcaca
agtacatccc ccggagggca gtgctttatg tacctggaaa
tgatgaaaag 181 aaaataaaga agattccatc
cctgaatgta gattgtgcag tgctcgactg tgaggatgga
241 gtggctgcaa acaaaaagaa tgaagctcga ctgagaattg
taaaaactct tgaagacatt 301 gatctgggcc
ctactgaaaa atgtgtgaga gtcaactcag tttccagtgg
tctggcggaa 361 gaagacctag agaccctttt
gcaatcccgg gtccttcctt ccagcctgat gctaccaaag
421 gtggaaagtc ctgaagaaat ccagtggttt gcagacaaat
tttcattcca cttaaaaggc 481 cgaaaacttg
aacaaccaat gaatttaatc ccttttgtgg aaactgcaat
gggtttgctc 541 aattttaagg cagtgtgtga
agaaaccctg aaggtcgggc ctcaagtagg tctctttcta
601 gatgcagtcg tttttggagg agaagacttt cgagccagca
taggtgcaac aagtagtaaa 661 gaaaccctgg
atattctcta cgcccggcaa aagattgttg tcatagcgaa
agcctttggt 721 ctccaagccg tagatctggt
gtacattgac tttcgagatg gagctgggct gcttagacag
781 tcacgagaag gagccgccat gggcttcact ggtaagcagg
tgattcaccc taaccaaatt 841 gccgtggtcc
aggagcagtt ttctccttcc cctgaaaaaa ttaagtgggc
tgaagaactg 901 attgctgcct ttaaagaaca
tcaacaatta ggaaaggggg cctttacttt ccaagggagt
961 atgatcgaca tgccattact gaagcaggcc cagaacactg
ttacgcttgc cacctccatc 1021 aaggaaaaat
gatctgttaa atgaagctgt catcggggaa tgctgagctg
caatgaccat 1081 tactgtagag ttacaacaag
agggtaaagt tcatacatgg cgacctgtgt caaatccgtc
1141 cattgatctg ccctccagca cacatttact gagcttctgt
tacgtgcctg tggttcttgg 1201 aaagagcttt
ttccttctct acaaggagga atctgatgca actgacatcc
tcaatagcta 1261 cagagaactt gcaaaggagt
agagagaatg tttgaggtcc agccttggtg tagagaagcg
1321 gcagaaacag aaatcccaaa aggtgtcatg cttggctcca
gctctgtgct ctcaggactc
17RNA
A
G
The 2' OH is missing in DNA
Note U hasno methyl here
C
U
18A Protein Sequence
MVGSLNCIVAVSQNMGIGKNGDLPWPPLRNEFRYFQRMTTTSSVEGKQNL
VIMGKKTWFSIPEKNRPLKGRINLVLSRELKEPPQGAHFLSRSLDDALKL
TEQPELANKVDMVWIVGGSSVYKEAMNHPGHLKLFVTRIMQDFESDTFFP
EIDLEKYKLLPEYPGVLSDVQEEKGIKYKFEVYEKND
19Study this page http//www.johnkyrk.com/aminoacid
.html
20Chemical classification of the amino acids using
the Ven Diagram.
Taylor, W. R. (1986). Classification of amino
acid conservation. J. Theor. Biol. 119, 205-218.
21Genetic Code
wobble base
Coding regions of DNA have special constraints on
mutation.
22Learn UNIX basics
If you don't know UNIX, sit next to someone who
does.
Use the handouts or unixetc.pdf from the course
web page
23In class exercises learning UNIX
- List all files starting with lowercase L (ls).
- Make a course directory. Call it bioinfo.
(mkdir) - Change directories to your new directory. (cd)
- Copy the file "lotsofjunk" from my directory to
your directory. (see whiteboard) - Count the number lines in "lotsofjunk" that have
the string "product". (grep) answer___________
24In class exercises learning UNIX
Find out how to sort by field (man sort). Sort
the file lotsofjunk using field 2, numerically.
Same thing. Pipe the output to more () Same
thing. Instead of piping, redirect the output to
a file called sortedjunk (gt) Edit the file
lotsofjunk using vi, (vi, see next page)
25In class exercises vi lotsofjunk
Try each of the move commands and write what it
does on a separate page. Try each of the delete
commands and write what it does on a separate
page. Hit Undo ("u") after each delete, so the
file is unchanged. Try each of the modify
commands and describe what it does on a separate
page. Don't forget to hit escape! Search for the
string "protein_id" (/). Copy those lines (yy)
and put them at the top of the file (1Gp).
Delete the remaining lines (.,d), and save the
file as "junk" (w junk).