Sequence Analysis - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

Sequence Analysis

Description:

... important and difficult part in bioinformatics. Central Dogma ... In bioinformatics the sequence format does NOT make a difference between Uracil and Thymine. ... – PowerPoint PPT presentation

Number of Views:30
Avg rating:3.0/5.0
Slides: 24
Provided by: kyo1
Category:

less

Transcript and Presenter's Notes

Title: Sequence Analysis


1
Sequence Analysis
2
Prequisites to Sequence Analysis
  • Basic Biology so you can understand the language
    of the databases Central Dogma (transcription
    Translation, Eukaryotes, CoDing Sequence (CDS),
    3UTR, 5UTR, introns, exons, promoters, codons,
    start codons, stop codons, secondary structure,
    tertiary structure).
  • Before you can analyze sequences.. You have to
    understand their structure.. And know about Basic
    Biological Database Searching
  • Collecting data is the most important and
    difficult part in bioinformatics.

3
Central Dogma
DNA -gt RNA -gt protein - for all living things
  • Why would the cell want to have an intermediate
    between DNA and the proteins it encodes?
  • Gene information can be amplified by having many
    copies of an RNA made from one copy of DNA.
  • Regulation of gene expression can be effected by
    having specific controls at each element of the
    pathway between DNA and proteins. The more
    elements there are in the pathway, the more
    opportunities there are to control it in
    different circumstances.
  • In Eukaryotes, the DNA can then stay pristine and
    protected, away from the caustic chemistry of the
    cytoplasm.

4
Eukaryotic Central Dogma
Donor
Acceptor
5
Transcription
Information strand ( )
Template strand ( - )
6
Translation
  • http//wsrv.clas.virginia.edu/rjh9u/gif/protein.m
    ov

7
Convention for nucleotides in database
  • Because the mRNA is actually read off the minus
    strand of the DNA, the nucleotide sequence are
    always quoted on the minus strand.
  • In bioinformatics the sequence format does NOT
    make a difference between Uracil and Thymine.
    There is no symbol for Uracil.. It is always
    represented by a T.
  • Even genomic sequence follows that convention. A
    gene on the plus strand is quoted so that it is
    in the same strand as its product mRNA.

8
Biological Databases
  • Nucleotide databases
  • Genbank International Collaboration
  • NCBI(USA), EMBL(Europe), DDBJ (Japan and Asia)
  • A bank means No curation. Submission to these
    database is required for publication in a
    journal.
  • Organism specific databases (Exercise Find URLs
    using search engines)
  • FlyBase
  • ChickGBASE
  • pigbase
  • wormpep
  • YPD (Yeast Protein Database)
  • SGD(Saccharomyces Genome Database)

9
  • Protein Databases
  • NCBI
  • Swiss Prot(Free for academic use, otherwise
    commercial. Licensing restrictions on discoveries
    made using the DB. 1998 version free of any
    licensing)
  • http//us.expasy.org/sprot/
  • NCBI has the latest free version.
  • Translated Proteins from Genbank Submissions
  • EMBL
  • TrEMBL is a computer-annotated supplement of
    SWISS-PROT that contains all the translations of
    EMBL nucleotide sequence entries not yet
    integrated in SWISS-PROT
  • PIR

10
  • Structure databases
  • PDB Protein structure database.
  • http//www.rcsb.org/pdb/
  • MMDB NCBIs version of PDB with entrez links.
  • http//www.ncbi.nlm.nih.gov
  • Genome Mapping Information
  • http//www.ornl.gov/sci/techresources/Human_Genome
    /home.shtml
  • NCBI(Human)
  • Genome Centers
  • Stanford, Washington University, Stanford
  • Research Centers and Universities

11
  • Litterature databases
  • NCBI Pubmed All biomedical litterature.
  • www.ncbi.nlm.nih.gov
  • Abstracts and links to publisher sites for
  • full text retrieval/ordering
  • journal browsing.
  • Publisher web sites.
  • Biomednet Commercial site for literature search.
  • Pathways Database
  • KEGG Kyoto Encyclopedia of Genes and Genomes
    http//www.genome.jp/kegg/

12
Primary Databases
  • A primary Database is a repository of data
    derived from experiments or from research
    knowledge.
  • Genbank (Nucleotide repository)
  • Protein DB, Swissprot
  • PDB (MMDB) are primary databases.
  • Pubmed (litterature)
  • Genome Mapping databases.
  • Kegg Database.(pathways)

13
Secondary Databases
  • A secondary database contains information derived
    from other sources.
  • Refseq (Curetted collection of Genbank at NCBI)
  • Unigene (Clustering of ESTs at NCBI)
  • Organism-specific databases are often a mix
    between primary and secondary.
  • Expressed Sequence Tag (EST) is a short
    sub-sequence of a protein-coding DNA sequence. It
    was originally intended as a way to identify gene
    transcripts.

14
FASTA Format
MOST important data format!!!
  • gtidentifier descriptive text
  • nucleotide of amino-acid
  • sequence on multiple lines if needed.
  • Example
  • gtgi41embX63129.1BTA1AT B.taurus mRNA for
    alpha-1-anti-trypsin
  • GACCAGCCCTGACCTAGGACAGTGAATCGATAATGGCACTCTC
  • CATCACGCGGGGCCTTCTGCTGCTGGC .

15
  • gi41embX63129.1BTA1AT is a header of fasta
    format.
  • -gt Problem is there are many different headers
  • GenBank gigi-numbergbaccessionlocus
  • EMBL Data Library gigi-numberembaccessionlocus
  • DDBJ, DNA Database of Japan gigi-numberdbjacces
    sionlocus
  • NBRF PIR pirentry
  • Protein Research Foundation prfname
  • SWISS-PROT spaccessionentry name
  • Brookhaven Protein Data Bank pdbentrychain
  • Patents patcountrynumber
  • GenInfo Backbone Id bbsnumber
  • General database identifier gnldatabaseidentifie
    r
  • NCBI Reference Sequence refaccessionlocus
  • Local Sequence identifier lclidentifier

16
Feature table(NCBIEMBL/DDBJ)
  • http//www.ncbi.nlm.nih.gov/collab/FT/index.html

17
Entrez
  • Index Based search system. Each field in the
    database is searchable
  • All primary databases are interlinked as one big
    relational database.
  • (e.g. Pubmed links in Genbank records)
  • http//www.ncbi.nlm.nih.gov/Entrez/
  • Tutorials
  • http//www.ncbi.nlm.nih.gov/Education/index.html

18
SWISSPROT
http//us.expasy.org/sprot/
  • Core data protein sequence data the citation
    information and the taxonomic data
  • Annotation
  • Function(s) of the protein
  • Domains and sites. For example calcium binding
    regions, ATP-binding sites, zinc fingers,
    homeobox, kringle, etc.
  • Post-translational modification(s). For example
    carbohydrates, phosphorylation, acetylation,
    GPI-anchor, etc.
  • Secondary structure
  • Quaternary structure. For example homodimer,
    heterotrimer, etc.
  • Similarities to other proteins
  • Disease(s) associated with deficiencies in the
    protein
  • Sequence conflicts, variants, etc.

19
Break
20
Weka
  • http//www.cs.waikato.ac.nz/ml/weka/
  • GUI
  • With the provided source code, a user can easily
    modify the program.

21
  • Input file format
  • Comments start with this
  • Header
  • _at_relation name of the learning concept
  • _at_attribute name real for numeric attributes
  • _at_attribute name list of attribute names
    separated by comma for nominal attributes
  • Ex. _at_attribute fuel_type diesel, gas
  • _at_data This tells the data starts from the next
    line.
  • Ex. _at_data
  • 6, 148, 72, 35, 0, 33, 6, diesel, 34, no

22
  • Examples for using the source code in java
  • Read in a input file
  • BufferedReader reader new BufferedReader(new
    FileReader(filename))
  • Instances data new Instances(reader)
  • Preprocesses for instanced read in
  • data.setClassIndex(data.numAttributes() -1)
  • int seed 1234342
  • data.randomize(new Random(seed))
  • Training a classifier
  • MultilayerPerceptron nn new MultilayerPerceptron
    ()
  • String args new String -L, 0.3,
    -M, 0.2, -N, 500
  • nn.setOptions(args)
  • nn.buildClassifier(data)
  • Test instances
  • N-fold cross validation
  • Using separated test file

23
  • Test instances
  • N-fold cross validation Need to save some of the
    instances from the input file for the test
  • This can be done easily with Wekas function.
  • Ex. Instances data new Instances(reader)
  • // This is important step for n-fold cross
    validation
  • data.stratify(numFolds)
  • Instances train data.trainCV(numFolds,
    fold)
  • Instances test data.testCV(numFolds,
    fold)
  • // Randomize each data set
  • Using a separated test file just use same method
    to read in an input file and test each individual
    instance on the trained classifier

1
2
3
n
Write a Comment
User Comments (0)
About PowerShow.com