Outline for Today - PowerPoint PPT Presentation

1 / 58
About This Presentation
Title:

Outline for Today

Description:

Outline for Today – PowerPoint PPT presentation

Number of Views:25
Avg rating:3.0/5.0
Slides: 59
Provided by: Bro5152
Category:
Tags: fugu | outline | today

less

Transcript and Presenter's Notes

Title: Outline for Today


1
Outline for Today
  • 100-230 Perl for Bioinformatics structured
    lab (contd)
  • 230-300 Break
  • 300-500 Perl and the Web structured lab and
    lectures
  • 500-??? Practice

2
Perl and the Web
  • David Wishart
  • david.wishart_at_ualberta.ca

3
Some Web Definitions
  • WWW World Wide Web
  • HTML HyperText Markup Language
  • ltHTMLgtltHEADgt ltTITLEgt My first form lt/TITLEgt
    lt/HEADgtltBODYgt
  • URLs Universal Resource Locator
  • http//gchelpdesk.ualberta.ca
  • http HyperText Transfer Protocol

4
More Web Definitions
  • CGI Common Gateway Interface
  • is a standard for external gateway programs to
    interface with information servers such as HTTP
    or web servers
  • API Application Programming Interface
  • a list of commands which a script can use to
    access an http port to get information from a
    database
  • list of functions and parameters which one uses
    to access code in a software library

5
Making Interactive Web Pages
CGI
Back End (C, Perl Or Java)
Front End (HTML)
6
Perl and the Web
  • Perl can be used to make web pages interactive
    (dynamic) through the use of CGI
  • CGI allows preparation of web forms (submitting
    data, sending results)
  • Perl can also be used to extract information from
    web servers or dBs
  • Allows automated data retrieval or screen
    scraping

7
Web Forms Perl
  • Install an Apache web server (use default
    configuration)
  • Write a program in Perl (rename from program.pl
    to program.cgi)
  • Place the program in /cgi-bin
  • Build an HTML web form with the appropriate FORM
    ACTION and POST method
  • Place the web form in /usr/public_html

8
Sample Web Form
  • ltHTMLgtltHEADgt ltTITLEgt My first form lt/TITLEgt
    lt/HEADgtltBODYgt ltFORM ACTION/cgi-bin/test2.cgi"
    METHOD"POST"gt
  • First Name ltINPUT NAME"first" TYPETEXT
    SIZE25gtltBRgt Last Name ltINPUT NAME"last"
    TYPETEXT SIZE25gtltBRgt E-mail ltINPUT
    NAME"email" TYPETEXT SIZE30gtltBRgt ltINPUT
    TYPESUBMIT VALUE"Test it"gt
  • lt/FORMgt
  • lt/BODYgtlt/HTMLgt

9
Getting Information from the Web (Screen Scraping)
  • Different than making interactive web pages
  • Idea is to access publicly available web forms
    and web programs (CGI scripts) so that you dont
    have to run/compile/install the program on your
    own.
  • Perl is very good for this kind of web data
    mining using the LWP module

10
Perl WWW Modules
  • LWP stands for the Library for the WWW in Perl
  • LWP allows the programmer to automate calls from
    a Perl script to a web page on the Internet.
  • LWPUserAgent allows one to make a request to a
    web site and create a response object from that
    request.
  • use LWPUserAgent

11
Perl WWW Modules
  • To POST or send requests to a web server we use
    another Perl module called HTTPRequestCommon
  • The following is used to make a request to the
    NCBI ORF finder site
  • my response agent-gtrequest(POST
    "http//www.ncbi.nlm.nih.gov/gorf/orfig.cgi",
    SEQUENCE gt sequence, gcode gt '1')

12
Exercise 8/9 - orf_finder.pl
13
Prokaryotic Gene Structure
ORF (open reading frame)
TATA box
Stop codon
Start codon
ATGACAGATTACAGATTACAGATTACAGGATAG
Frame 1
Frame 2
Frame 3
14
Gene Finding In Prokaryotes
  • Scan forward strand until a start codon is found
  • Staying in same frame scan in groups of three
    until a stop codon is found
  • If of codons between start and end is greater
    than 17, identify as gene and go to last start
    codon and proceed with step 1
  • If codons between start and end is less than
    18, go back to last start codon and go to step 1
  • At end of chromosome, repeat process for reverse
    complement

15
ORF Finding Tools
  • http//www.ncbi.nlm.nih.gov/gorf/gorf.html
  • http//alfa.ist.utl.pt/pedromc/SMS/orf_find.html
  • http//www.cbc.umn.edu/diogenes/diogenes.html
  • http//www.nih.go.jp/jun/cgi-bin/frameplot.pl

16
Algorithm for Exercise 8
  • Call up the LWP library
  • Call up the POST request library
  • Print request to screen
  • Open and read file provided by user (notice the
    different approach to file reading used here
    both are valid)
  • Create a new user agent or a virtual web
    browser for the task of interest
  • Post or send the sequence and program parameter
    information to the desired web site
  • Save results as a character string
  • Open a file
  • Print results to the file
  • Close the file

17
Lets Give it a Try
  • Go to Exercise 8
  • Open a text editor
  • Type in the program orf_finder.pl
  • Dont type the comments, just read them
  • Save the program
  • Get back to the Unix prompt and type chmod x
    orf_finder.pl
  • Type orf_finder.pl and see what happens

18
File Input
  • To run this program you will need to have the
    sequence file called sars.txt
  • This file can be retrieved from
    http//gchelpdesk.ualberta.ca/sequences
  • Click on the link sars.txt, copy the file and
    paste it into a text editor (your choice).
  • Save the file as sars.txt

19
Exercise 10 - genscan.pl
20
Gene Finding in Eukaryotes

21
Eukaryotes
  • Complex gene structure
  • Large genomes (0.1 to 10 billion bp)
  • Exons and Introns (interrupted)
  • Low coding density (lt30)
  • 3 in humans, 25 in Fugu, 60 in yeast
  • Alternate splicing (40-60 of all genes)
  • High abundance of repeat sequence (50 in humans)
    and pseudo genes
  • Nested genes overlapping on same or opposite
    strand or inside an intron

22
Eukaryotic Gene Structure
Transcribed Region
exon 1 intron 1 exon 2 intron 2 exon3
Stop codon
Start codon
3 UTR
5 UTR
Downstream Intergenic Region
Upstream Intergenic Region
23
Eukaryotic Gene Structure
branchpoint site
5site
3site
exon 1 intron 1 exon 2
intron 2
CAG/NT
AG/GT
24
RNA Splicing
25
Exon/Intron Structure (Detail)
ATGCTGTTAGGTGG...GCAGATCGATTGAC
Exon 1 Intron 1 Exon 2
SPLICE
ATGCTGTTAGATCGATTGAC
26
HMM for Gene Finding
27
Gene Prediction Methods and Websites
  • GRAIL (http//compbio.ornl.gov/Grail-1.3/)
  • FGENEH (http//genomic.sanger.ac.uk/gf/gf.shtml)
  • HMMgene (http//www.cbs.dtu.dk/services/HMMgene/)
  • GENSCAN(http//genes.mit.edu/GENSCAN.html)
  • Gene Parser (http//beagle.colorado.edu/eesnyder/
    GeneParser.html)
  • GRPL (GeneTool/BioTools)

28
Genscan
29
How Does it Work?
  • GENSCAN
  • 5th order Hidden Markov Model
  • Hexamer composition statistics of exons vs.
    introns
  • Exon/intron length distributions
  • Scan of promoter and polyA signals
  • Weight matrices of 5 splice signals and start
    codon region (12 bp)
  • Uses dynamic programming to optimize gene model
    using above data

30
How Well Do They Do?
"Evaluation of gene finding programs" S. Rogic,
A. K. Mackworth and B. F. F. Ouellette. Genome
Research, 11 817-832 (2001).
31
Gene Prediction (Evaluation)
TP FP TN FN
TP FN TN
Actual Predicted
Sensitivity Measure of the of false negative
results (sn
0.996 means 0.4 false negatives) Specificity M
easure of the of false positive
results Correlation Combined measure of
sensitivity and specificity
32
Gene Prediction (Evaluation)
TP FP TN FN
TP FN TN
Actual Predicted
Sensitivity SnTP/(TP FN) Specificity SpTN/(TN
FP)
Correlation CC(TPTN-FPFN)/(TPFP)(TNFN)(TP
FN)(TNFP)0.5
This is a better way of evaluating
33
Gene Prediction Accuracy at the Exon Level
WRONGEXON
CORRECTEXON
MISSING EXON
Actual
Predicted
Sensitivity
Sn
34
Algorithm for Exercise 10
  • Call up the LWP library
  • Call up the POST request library
  • Print request to screen
  • Open and read file provided by user
  • Create a new user agent or a virtual web
    browser for the task of interest
  • Post or send the sequence and program parameter
    information to the desired web site
  • Save results as a character string
  • Open a file
  • Print results to the file
  • Close the file

35
Lets Give it a Try
  • Go to Exercise 10
  • Open a text editor
  • Type in the program genscan.pl
  • Dont type the comments, just read them
  • Save the program
  • Get back to the Unix prompt and type chmod x
    genscan.pl
  • Type genscan.pl and see what happens

36
File Input
  • To run this program you will need to have the
    sequence file called genomic.txt
  • This file can be retrieved from
    http//gchelpdesk.ualberta.ca/sequences
  • Click on the link genomic.txt, copy the file
    and paste it into a text editor (your choice).
  • Save the file as genomic.txt

37
Exercise 11 - blast.pl
Select Database
38
Annotation by HomologyAn Example
  • 76 residue protein from Methanobacter
    thermoautotrophicum (newly sequenced)
  • What does it do?
  • MMKIQIYGTGCANCQMLEKNAREAVKELGIDAEFEKIKEMDQILEAGLTA
    LPGLAVDGELKIMGRVASKEEIKKILS

39
Running NCBI BLAST
Select Database
40
BLAST Output
41
BLAST Output
42
PSI-BLAST
43
PSI-BLAST
44
PSI-BLAST
45
Conclusions
  • Protein is a thioredoxin or glutaredoxin
    (function, family)
  • Protein has thioredoxin fold (2o and 3D
    structure)
  • Active site is from residues 11-14 (active site
    location)
  • Protein is soluble, cytoplasmic (cellular
    location)

46
Algorithm for Exercise 11
  • Call up the LWP library
  • Call up the POST request library
  • Print request to screen
  • Open and read file provided by user
  • Create a new user agent or a virtual web
    browser for the task of interest
  • Post or send the sequence and program parameter
    information to the desired web site
  • Save results as a character string
  • Open a file
  • Print results to the file
  • Close the file

47
Lets Give it a Try
  • Go to Exercise 11
  • Open a text editor
  • Type in the program blast.pl
  • Dont type the comments, just read them
  • Save the program
  • Get back to the Unix prompt and type chmod x
    blast.pl
  • Type blast.pl and see what happens

48
File Input
  • To run this program you will need to have the
    sequence file called proteins.txt
  • This file can be retrieved from
    http//gchelpdesk.ualberta.ca/sequences
  • Click on the link proteins.txt, copy the file
    and paste it into a text editor (your choice).
  • Save the file as proteins.txt

49
Exercise 12 MW.pl
50
Molecular Weight
  • Useful for SDS PAGE and 2D gel analysis
  • Useful for deciding on SEC matrix
  • Useful for deciding on MWC for dialysis
  • Essential in synthetic peptide analysis
  • Essential in peptide sequencing (classical or
    mass-spectrometry based)
  • Essential in proteomics and high throughput
    protein characterization

51
Molecular Weight
  • Crude MW calculation MW 110 X Numres
  • Exact MW calculation MW SAAi x MWi
  • Remember to subtract water (18.01 amu)
  • Note isotopic weights
  • Corrections for CHO, PO4, Acetyl, CONH2

52
Amino Acid Residue Masses
Monoisotopic Mass
Glycine 57.02147 Alanine 71.03712 Serine 87.03203
Proline 97.05277 Valine 99.06842 Threonine 101.04
768 Cysteine 103.00919 Isoleucine 113.08407 Leucin
e 113.08407 Asparagine 114.04293
Aspartic acid 115.02695 Glutamine 128.05858 Lysin
e 128.09497 Glutamic acid 129.0426 Methionine 13
1.04049 Histidine 137.05891 Phenylalanine 147.068
42 Arginine 156.10112 Tyrosine 163.06333 Tryptop
han 186.07932
53
Protein Identification via MW
  • MOWSE
  • http//srs.hgmp.mrc.ac.uk/cgi-bin/mowse
  • PeptideSearch
  • http//www.mann.embl-heidelberg.de/Services/Peptid
    eSearch
  • Mascot
  • www.matrixscience.com
  • AACompSim/AACompIdent
  • http//www.expasy.ch/tools

54
Molecular Weight Proteomics
2-D Gel QTOF Mass Spectrometry
55
Lets Give it a Try
  • Go to Exercise 12
  • Open a text editor
  • Type in the program MW.pl
  • Dont type the comments, just read them
  • Save the program
  • Get back to the Unix prompt and type chmod x
    MW.pl
  • Type MW.pl and see what happens

56
File Input
  • To run this program you will need to have the
    sequence file called proteins.txt
  • This file can be retrieved from
    http//gchelpdesk.ualberta.ca/sequences
  • Click on the link proteins.txt, copy the file
    and paste it into a text editor (your choice).
  • Save the file as proteins.txt

57
EST Annotation with Perl
  • It is now possible to combine many of the tools
    youve learned about yesterday and today to build
    a sophisticated piece of software for doing
    automated EST annotation
  • See Exercise X for the source code to perform
    this kind of EST annotation
  • See if you can modify it to suit your special
    interests

58
More Help? Check out the Canadian Bioinformatics
Help Desk
  • A Service for Genome Canada Researchers

www.gchelpdesk.ualberta.ca
Write a Comment
User Comments (0)
About PowerShow.com