Title: Outline for Today
1Outline for Today
- 100-230 Perl for Bioinformatics structured
lab (contd) - 230-300 Break
- 300-500 Perl and the Web structured lab and
lectures - 500-??? Practice
2Perl and the Web
- David Wishart
- david.wishart_at_ualberta.ca
3Some Web Definitions
- WWW World Wide Web
- HTML HyperText Markup Language
- ltHTMLgtltHEADgt ltTITLEgt My first form lt/TITLEgt
lt/HEADgtltBODYgt - URLs Universal Resource Locator
- http//gchelpdesk.ualberta.ca
- http HyperText Transfer Protocol
4More Web Definitions
- CGI Common Gateway Interface
- is a standard for external gateway programs to
interface with information servers such as HTTP
or web servers - API Application Programming Interface
- a list of commands which a script can use to
access an http port to get information from a
database - list of functions and parameters which one uses
to access code in a software library
5Making Interactive Web Pages
CGI
Back End (C, Perl Or Java)
Front End (HTML)
6Perl and the Web
- Perl can be used to make web pages interactive
(dynamic) through the use of CGI - CGI allows preparation of web forms (submitting
data, sending results) - Perl can also be used to extract information from
web servers or dBs - Allows automated data retrieval or screen
scraping
7Web Forms Perl
- Install an Apache web server (use default
configuration) - Write a program in Perl (rename from program.pl
to program.cgi) - Place the program in /cgi-bin
- Build an HTML web form with the appropriate FORM
ACTION and POST method - Place the web form in /usr/public_html
8Sample Web Form
- ltHTMLgtltHEADgt ltTITLEgt My first form lt/TITLEgt
lt/HEADgtltBODYgt ltFORM ACTION/cgi-bin/test2.cgi"
METHOD"POST"gt - First Name ltINPUT NAME"first" TYPETEXT
SIZE25gtltBRgt Last Name ltINPUT NAME"last"
TYPETEXT SIZE25gtltBRgt E-mail ltINPUT
NAME"email" TYPETEXT SIZE30gtltBRgt ltINPUT
TYPESUBMIT VALUE"Test it"gt - lt/FORMgt
- lt/BODYgtlt/HTMLgt
9Getting Information from the Web (Screen Scraping)
- Different than making interactive web pages
- Idea is to access publicly available web forms
and web programs (CGI scripts) so that you dont
have to run/compile/install the program on your
own. - Perl is very good for this kind of web data
mining using the LWP module
10Perl WWW Modules
- LWP stands for the Library for the WWW in Perl
- LWP allows the programmer to automate calls from
a Perl script to a web page on the Internet. - LWPUserAgent allows one to make a request to a
web site and create a response object from that
request. - use LWPUserAgent
11Perl WWW Modules
- To POST or send requests to a web server we use
another Perl module called HTTPRequestCommon
- The following is used to make a request to the
NCBI ORF finder site - my response agent-gtrequest(POST
"http//www.ncbi.nlm.nih.gov/gorf/orfig.cgi",
SEQUENCE gt sequence, gcode gt '1')
12Exercise 8/9 - orf_finder.pl
13Prokaryotic Gene Structure
ORF (open reading frame)
TATA box
Stop codon
Start codon
ATGACAGATTACAGATTACAGATTACAGGATAG
Frame 1
Frame 2
Frame 3
14Gene Finding In Prokaryotes
- Scan forward strand until a start codon is found
- Staying in same frame scan in groups of three
until a stop codon is found - If of codons between start and end is greater
than 17, identify as gene and go to last start
codon and proceed with step 1 - If codons between start and end is less than
18, go back to last start codon and go to step 1 - At end of chromosome, repeat process for reverse
complement
15ORF Finding Tools
- http//www.ncbi.nlm.nih.gov/gorf/gorf.html
- http//alfa.ist.utl.pt/pedromc/SMS/orf_find.html
- http//www.cbc.umn.edu/diogenes/diogenes.html
- http//www.nih.go.jp/jun/cgi-bin/frameplot.pl
16Algorithm for Exercise 8
- Call up the LWP library
- Call up the POST request library
- Print request to screen
- Open and read file provided by user (notice the
different approach to file reading used here
both are valid) - Create a new user agent or a virtual web
browser for the task of interest - Post or send the sequence and program parameter
information to the desired web site - Save results as a character string
- Open a file
- Print results to the file
- Close the file
17Lets Give it a Try
- Go to Exercise 8
- Open a text editor
- Type in the program orf_finder.pl
- Dont type the comments, just read them
- Save the program
- Get back to the Unix prompt and type chmod x
orf_finder.pl - Type orf_finder.pl and see what happens
18File Input
- To run this program you will need to have the
sequence file called sars.txt - This file can be retrieved from
http//gchelpdesk.ualberta.ca/sequences - Click on the link sars.txt, copy the file and
paste it into a text editor (your choice). - Save the file as sars.txt
19Exercise 10 - genscan.pl
20Gene Finding in Eukaryotes
21Eukaryotes
- Complex gene structure
- Large genomes (0.1 to 10 billion bp)
- Exons and Introns (interrupted)
- Low coding density (lt30)
- 3 in humans, 25 in Fugu, 60 in yeast
- Alternate splicing (40-60 of all genes)
- High abundance of repeat sequence (50 in humans)
and pseudo genes - Nested genes overlapping on same or opposite
strand or inside an intron
22Eukaryotic Gene Structure
Transcribed Region
exon 1 intron 1 exon 2 intron 2 exon3
Stop codon
Start codon
3 UTR
5 UTR
Downstream Intergenic Region
Upstream Intergenic Region
23Eukaryotic Gene Structure
branchpoint site
5site
3site
exon 1 intron 1 exon 2
intron 2
CAG/NT
AG/GT
24RNA Splicing
25Exon/Intron Structure (Detail)
ATGCTGTTAGGTGG...GCAGATCGATTGAC
Exon 1 Intron 1 Exon 2
SPLICE
ATGCTGTTAGATCGATTGAC
26HMM for Gene Finding
27Gene Prediction Methods and Websites
- GRAIL (http//compbio.ornl.gov/Grail-1.3/)
- FGENEH (http//genomic.sanger.ac.uk/gf/gf.shtml)
- HMMgene (http//www.cbs.dtu.dk/services/HMMgene/)
- GENSCAN(http//genes.mit.edu/GENSCAN.html)
- Gene Parser (http//beagle.colorado.edu/eesnyder/
GeneParser.html) - GRPL (GeneTool/BioTools)
28Genscan
29How Does it Work?
- GENSCAN
- 5th order Hidden Markov Model
- Hexamer composition statistics of exons vs.
introns - Exon/intron length distributions
- Scan of promoter and polyA signals
- Weight matrices of 5 splice signals and start
codon region (12 bp) - Uses dynamic programming to optimize gene model
using above data
30How Well Do They Do?
"Evaluation of gene finding programs" S. Rogic,
A. K. Mackworth and B. F. F. Ouellette. Genome
Research, 11 817-832 (2001).
31Gene Prediction (Evaluation)
TP FP TN FN
TP FN TN
Actual Predicted
Sensitivity Measure of the of false negative
results (sn
0.996 means 0.4 false negatives) Specificity M
easure of the of false positive
results Correlation Combined measure of
sensitivity and specificity
32Gene Prediction (Evaluation)
TP FP TN FN
TP FN TN
Actual Predicted
Sensitivity SnTP/(TP FN) Specificity SpTN/(TN
FP)
Correlation CC(TPTN-FPFN)/(TPFP)(TNFN)(TP
FN)(TNFP)0.5
This is a better way of evaluating
33Gene Prediction Accuracy at the Exon Level
WRONGEXON
CORRECTEXON
MISSING EXON
Actual
Predicted
Sensitivity
Sn
34Algorithm for Exercise 10
- Call up the LWP library
- Call up the POST request library
- Print request to screen
- Open and read file provided by user
- Create a new user agent or a virtual web
browser for the task of interest - Post or send the sequence and program parameter
information to the desired web site - Save results as a character string
- Open a file
- Print results to the file
- Close the file
35Lets Give it a Try
- Go to Exercise 10
- Open a text editor
- Type in the program genscan.pl
- Dont type the comments, just read them
- Save the program
- Get back to the Unix prompt and type chmod x
genscan.pl - Type genscan.pl and see what happens
36File Input
- To run this program you will need to have the
sequence file called genomic.txt - This file can be retrieved from
http//gchelpdesk.ualberta.ca/sequences - Click on the link genomic.txt, copy the file
and paste it into a text editor (your choice). - Save the file as genomic.txt
37Exercise 11 - blast.pl
Select Database
38Annotation by HomologyAn Example
- 76 residue protein from Methanobacter
thermoautotrophicum (newly sequenced) - What does it do?
- MMKIQIYGTGCANCQMLEKNAREAVKELGIDAEFEKIKEMDQILEAGLTA
LPGLAVDGELKIMGRVASKEEIKKILS
39Running NCBI BLAST
Select Database
40BLAST Output
41BLAST Output
42PSI-BLAST
43PSI-BLAST
44PSI-BLAST
45Conclusions
- Protein is a thioredoxin or glutaredoxin
(function, family) - Protein has thioredoxin fold (2o and 3D
structure) - Active site is from residues 11-14 (active site
location) - Protein is soluble, cytoplasmic (cellular
location)
46Algorithm for Exercise 11
- Call up the LWP library
- Call up the POST request library
- Print request to screen
- Open and read file provided by user
- Create a new user agent or a virtual web
browser for the task of interest - Post or send the sequence and program parameter
information to the desired web site - Save results as a character string
- Open a file
- Print results to the file
- Close the file
47Lets Give it a Try
- Go to Exercise 11
- Open a text editor
- Type in the program blast.pl
- Dont type the comments, just read them
- Save the program
- Get back to the Unix prompt and type chmod x
blast.pl - Type blast.pl and see what happens
48File Input
- To run this program you will need to have the
sequence file called proteins.txt - This file can be retrieved from
http//gchelpdesk.ualberta.ca/sequences - Click on the link proteins.txt, copy the file
and paste it into a text editor (your choice). - Save the file as proteins.txt
49Exercise 12 MW.pl
50Molecular Weight
- Useful for SDS PAGE and 2D gel analysis
- Useful for deciding on SEC matrix
- Useful for deciding on MWC for dialysis
- Essential in synthetic peptide analysis
- Essential in peptide sequencing (classical or
mass-spectrometry based) - Essential in proteomics and high throughput
protein characterization
51Molecular Weight
- Crude MW calculation MW 110 X Numres
- Exact MW calculation MW SAAi x MWi
- Remember to subtract water (18.01 amu)
- Note isotopic weights
- Corrections for CHO, PO4, Acetyl, CONH2
52Amino Acid Residue Masses
Monoisotopic Mass
Glycine 57.02147 Alanine 71.03712 Serine 87.03203
Proline 97.05277 Valine 99.06842 Threonine 101.04
768 Cysteine 103.00919 Isoleucine 113.08407 Leucin
e 113.08407 Asparagine 114.04293
Aspartic acid 115.02695 Glutamine 128.05858 Lysin
e 128.09497 Glutamic acid 129.0426 Methionine 13
1.04049 Histidine 137.05891 Phenylalanine 147.068
42 Arginine 156.10112 Tyrosine 163.06333 Tryptop
han 186.07932
53Protein Identification via MW
- MOWSE
- http//srs.hgmp.mrc.ac.uk/cgi-bin/mowse
- PeptideSearch
- http//www.mann.embl-heidelberg.de/Services/Peptid
eSearch - Mascot
- www.matrixscience.com
- AACompSim/AACompIdent
- http//www.expasy.ch/tools
54Molecular Weight Proteomics
2-D Gel QTOF Mass Spectrometry
55Lets Give it a Try
- Go to Exercise 12
- Open a text editor
- Type in the program MW.pl
- Dont type the comments, just read them
- Save the program
- Get back to the Unix prompt and type chmod x
MW.pl - Type MW.pl and see what happens
56File Input
- To run this program you will need to have the
sequence file called proteins.txt - This file can be retrieved from
http//gchelpdesk.ualberta.ca/sequences - Click on the link proteins.txt, copy the file
and paste it into a text editor (your choice). - Save the file as proteins.txt
57EST Annotation with Perl
- It is now possible to combine many of the tools
youve learned about yesterday and today to build
a sophisticated piece of software for doing
automated EST annotation - See Exercise X for the source code to perform
this kind of EST annotation - See if you can modify it to suit your special
interests
58More Help? Check out the Canadian Bioinformatics
Help Desk
- A Service for Genome Canada Researchers
www.gchelpdesk.ualberta.ca