Title: TGAC Electronic Sequences
1TGACElectronic Sequences
Insulin, sequenced in 1955
2This is a sequence
- MALWTRLRPLLALLALWPPPPARAFVNQHLCGSHLVEALYLVCGERGFFY
TPKARREVEGPQVGALELAGGPGAGGLEGPPQKRGIVEQCCASVCSLYQL
ENYCN
This is also a sequence
ACCATGATTACGCCAAGCTTGCATGCCTGCAGGTCGGCTGCATTCGAGGC
TGCCAGCAAGCAGGTCCTCGCAGCCCCGCCATGGCCCTGTGGACACGCCT
GCGGCCCCTGCTGGCCCTGCTGGCGCTCTGGCCCCCCCCCCCGGCCCGCG
CCTTCGTCAACCAGCATCTGTGTGGCTCCCACCTGGTGGAGGCGCTGTAC
CTGGTGTGCGGAGAGCGCGGCTTCTTCTACACGCCCAAGGCCCGCCGGGA
GGTGGAGGGCCCGCAGGTGGGGGCGCTGGAGCTGGCCGGAGGCCCGGGCG
CGGGCGGCCTGGAGGGGCCCCCGCAGAAGCGTGGCATCGTGGAGCAGTGC
TGTGCCAGCGTCTGCTCGCTCTACCAGCTGGAGAACTACTGTAACTAGGC
CTGCCCCGACAAATAAACCCTTACGAGCAAG
3How to work with sequences
- Cut paste sequences
- Save files as text from sequence repositories
- Unix vs. Windows format
4How to Use Notepad
- Start Accessories Click Notepad
- Start Run Type notepad Click OK
5How to Use Cut and Paste on PCs
- Select sequence
- Hit Ctrl c to copy
- Hit Ctrl v to paste
- These options are also available through the Edit
menu of most programs
6Whence come sequences?
- Individual researchers
- Genome sequencing projects
- Patent applications
7Whither do the sequences go?
8(No Transcript)
9How can sequences be obtained?
10What is available via Entrez?
- PubMed
- Protein
- Nucleotide
- Structure
- Genome
- PopSet
- OMIM
- Taxonomy
- Books
- ProbeSet
- 3D Domains
- UniSTS
- SNP
- CDD
11ENTREZ XRefs Then
12ENTREZ XRefs Now
13Sequence Formats
- Raw
- Fasta
- ASN.1
- GenBank/GenPept
- DDBJ
- Ensembl
- Graphics
- XML
- convertible via ReadSeq
14Fasta Format
Greater than
No Whitespace!
Description line
- gtgi23200275pdb1KP5B Chain B, Cyclic Green
Fluorescent Protein TGSRHHHHHHSRKGEELFTGVVPILVELDG
DVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTLVTTFXVQCFS
RYPDHMKRHDFFKSAMPEGYVQERTISFKDDGNYKTRAEVKFEGDTLVNR
IELKGIDFKEDGNILGHKLEYNYNSHNVYITADKQKNGIKANFKIRHNIE
DGSVQLADHYQQNTPIGDGPVLLPDNHYLS TQSALSKDPNEKRDHMVLL
EFVTAAGLVPRGTGLYK
Line feed
Sequence in single-letter code
15GenBankFormat
16Graphics Format
17RefSeqs
- Reference sequence standards
- Available for chromosomes, mRNAs, proteins
- Non-redundant
- Curated
- Status and history are available
- Avoids redundancy in GenBank
18Nomenclature
- NW_ whole genome shotgun assembly
- NT_ BAC based contig
- NM_ Reference transcript
- XM_ Predicted transcript
- NP_ Referrence protein
- XP_ Predicted protein
- NC_ Reference chromosome(including mitchondrial
and chloroplast genomes)
19Other Sequence Formats
- GCG
- DNA Strider
- Intelligenetics
- NBRF
- convertible in ReadSeq
20Multiple Sequence Formats
- MSF
- Phylip
- PAUP
- Fitch
- Pretty
- convertible in ReadSeq
21Converting Sequence Formats
- READSEQ
- SEQIO
- GCG e.g. FROMEMBL, TOFASTA, etc.
22Batch ENTREZ
- A method for obtaining large numbers of
sequences by supplying a file containing a list
of GI or accession numbers.
23Charles Babbage
24(No Transcript)
25(No Transcript)