Title: Keith Satterley, Bioinformatics Division, WEHI
1Bioinformatics Seminar 13/11/07
- Keith Satterley, Bioinformatics Division, WEHI
2Summary
- GABOS Get A Bit Of Sequence.
- GAFEP Get A Few Exon Primers.
- Functions and Facilities
- WEB interface.
- Command Line Interface.
- Data Management
- Genome data.
- Result data.
- Tools Used
- Perl
- HTML
- PHP
- Javascript
- Availability.
- Future Work.
3- GABOS version 1 is at http//unix28.alpha.wehi.edu
.au/bioinformatics/gabos - WEB Page version 1 limitations
- Exons, DNA, Transcripts available.
- Genomes are a hard coded list of latest version
data only. - Annotation File is a hard coded list covering all
genomes. - Chromosome selection was a list of the common
chromosome filenames. - Data Files Availability
- All data has been downloaded from UCSCs download
site. It is described at - http//hgdownload.cse.ucsc.edu/downloads.html and
can be ftp downloaded from - ftp//hgdownload.cse.ucsc.edu/goldenPath/
- Genome data is stored on the WEHI Disk Server
accessible from - WEHI Unix computers
- /home/users/lab0605/Bioinformatics/databases/genom
es/UCSC - WEHI Windows computers map a network drive to
- \\unix33\bioinformatics
- WEHI Macintoshes Connect to Server at
- smb//unix33/Bioinformatics
4- Genomes at WEHI
- Jul 24 0105 canFam -gt canFam2
- Jul 23 1516 canFam1
- Jul 23 1516 canFam2
- Jul 22 0117 danRer -gt danRer4
- Jul 23 1033 danRer3
- Jul 23 1520 danRer4
- Nov 6 0110 dm -gt dm3
- Nov 5 1637 dm3
- Jul 22 0117 galGal -gt galGal3
- Jul 20 1727 galGal2
- Jul 23 1011 galGal3
- Jul 22 0117 hg -gt hg18
- Jul 23 1029 hg17
- Jul 23 1029 hg18
- Aug 24 0110 mm -gt mm9
- Jul 23 1030 mm7
- Aug 23 1450 mm8
- Aug 23 1812 mm9
5- Chromosome data Files
- Aug 23 1409 chr9_random.fa
- Aug 23 1409 chrM.fa
- Aug 23 1409 chrUn_random.fa
- Aug 23 1414 chrX.fa
- Aug 23 1414 chrX_random.fa
- Aug 23 1414 chrY.fa
- Aug 23 1416 chrY_random.fa
- Jul 23 1611 chr9.fa
- Jul 23 1611 chrM.fa
- Jul 23 1613 chrNA_random.fa
- Jul 23 1614 chrUn_random.fa
- Jul 23 1614 md5sum.txt
- Jul 23 1614 README.txt
- Jul 23 1616 scaffoldNA_random.fa
- Jul 23 1616 scaffoldUn_random.fa
- Jun 22 0405 chr2L.fa
6- Data Management
- Amount of data
- How many genomes local? currently 10 96GB.
- 19 Vertebrates available 9 sequence only.
- 15 Insects, 5 Nematodes 4 others available.
- How many versions of each? mm7, mm8, mm9?
- 2 or 3 of each?
- Chromosome data 10-50 per genome.
- Annotation data 5-10 per genome version
- RefSeq, genscan, mgc, xenoRef, uniGene, refFlat,
- ESTs. mRNAs
- Up to date data!
- Tool currently being written to nightly check
UCSC - Download, unpack and sort annotation files.
7- GABOS Sequence Retrieval Features
- Specify Search Criteria as either
- Gene Name List
- as in Annotation Files
- NM_001037759,NM_145692, NM_027033, NM_013715 as
in RefSeq.txt - Sgk3, 4930418G15Rik, Cops5, Sulf1 as in
RefFlat.txt - Chromosome Sequence Range specification.
- Chr1013,500,000 - 14,550,000
- This will select all genes in this region that
are defined in the annotation file(s) specified. - Exons (incl. EST exons), Transcripts of Genes or
straight DNA sequence can be retrieved. - Specify either strand or both strands.
8- Extra Sequence Parameters
- Range of bases in data object (for e.g. bps in an
Exon) - 1-e all, base 1 to the end base (the default)
- 1-10 bases 1 to 10
- 10-e base 10 to end base in object.
- Range of objects requested. (for e.g. a range of
Exons) - 1-e all exons (the default)
- 1-3 exons 1 to 3.
- 1 first exon only
- e last exon only
- Possible Extensions
- (e-3)-e last three objects (or bases)
9- GABOS Extras
- Specify the line length of the FASTA output file.
- Output Sequence Lines ONLY.
- Output Fasta Description Lines ONLY.
- Concatenate ALL Sequences.
- Concatenate ONLY Sequence from a DNA object (Each
genes exons concatenated for example). - String of characters to be inserted BEFORE each
DNA object. - String of characters to be inserted AFTER each
DNA object. - Specify flanking bases.
- Show co-ordinates relative to Chromosome, Exon,
Transcript - Uses either RefSeq or Browser gene names in
refFlat.txt - GAFEP (Get a Few Exon Primers)
- Use output of GABOS to find primers around each
exon.
10- GABOS Command Line Version (CLI).
- Same code. Program detects environment and
adjusts accordingly. - CLI use of GABOS caters for programmatic use of
the tool as part of other tasks. - For eg. Collecting 5000 bases before a transcript
and 5000 into the transcript to be used for
promoter/regulation searching for thousands of
genes.
CLI Eg. gabos -afile refFlat.txt -genome mm9
-seqrange 4,482,560-4,483,185 -chr 1 -pre 420
-post 420 fastaonly gtmy_results.fa Options can
be in any order. Output can be redirected to a
file as shown. A file of gene names could be used
as input instead of a chromosome sequence
range. gabos help lists all options.
11- CLI additional abilities.
- Gene lists read from a file or piped in.
- Debugging options available.
- Specification of alternate locations for
- (enables use of program at other sites without
modification.) - Annotation files.
- Genome data files.
- Checks if data files are latest version and
updates if not (To be replaced with upgraded
procedure).
12GABOS Command Line options
All GAFEP programs can also be run at the command
line. In particular Combine_overlapping_exons, C
reate_primers1, Create_primers2
, Makep3i, P3out2tab.
- -addends,
- -addstarts,
- -dnas,
- -basedirs,
- -genomes
- -afiles,
- -adirs,
- -gdirs,
- -check!
- -names,
- -nameps,
- -namefs,
- -chrs,
- -seqranges,
- -strands,
- -dataobjects,
- -objectranges,
- -baserange
- -seqonly,
- -fastaonly,
- -linelengthi,
- -relatives,
- -prei
- -posti
- -v!
- -debug1i,
- -debug2i,
- -debug3i,
- -debug4i,
- -debug5i,
- -debug6i,
- -debugalli,
- -hhelp?,
- -version
13- Demo of GABOS version 2.
- http//unix28.alpha.wehi.edu.au/bioinformatics/gab
os/testing_index.php - Improvements
- Automatically reads genomes available
- Automatically shows chromosome data for genome
selected. - Automatically shows Annotation data files for
genome selected. - Includes ability to read EST data files.
- Uses alternate gene name in refFlat.txt.
- Faster processing of large data files
using/making presorted versions.
14- GAFEP Get A Few Exon Primers.
- This is a suite of programs.
- Combines overlapping exons into one CExon.
- Displays Primer3 options and collects choices.
- Creates input files for Primer3 in the required
format. - Runs Primer3, displays output on the web page and
reformats the output suitable for pasting into
Excel. - The same code runs from the web interface or
from a Command Line Interface.
15Combining Exons to reduce number of primers
needed.
1
CExon
16(No Transcript)
17(No Transcript)
18 19GAFEP Output
20(No Transcript)
21- An example application
- Ben Kiles lab are using GABOS/GAFEP to create
primers to search for variations in sequence
caused by the ENU mutations in mice.
22Random chemical mutagenesis in the mouse
N-ethyl-N-nitrosourea (ENU)
- Alkylating agent
- Point mutagen
- Efficiently mutates mouse spermatogonial stem
cells - Male mice treated with ENU produce offspring
heterozygous for ENU-induced mutations at the
rate of 1 mutation per 1.5 megabases
23Phenotyping screen measuring platelet number
Blood test
Mutant offspring
Platelet counts
Platelet count x103/uL
24Mapping strategy for dominant mutations
Affected
Wild-type
C57BL/6
X
1st Outcross
Balb/c
m
X
F1 Generation
2nd Outcross
Affected
Unaffected
F2 Generation
m
m
m
m
25Mapping strategy for dominant mutations
- Genome-wide scan with 80-100 microsatellites
- 20 affected and 20 unaffected animals
- Result mutation assigned to a chromosome
- 2. Fine mapping
- 200-1,000 informative meioses, genotyped with
SSLPs at increasing density - Result candidate interval refined to 1-3 Mb
- Issues
- Recombination cold spots
- Polymorphism deserts
SNP density map of mouse chromosome 1 (C57BL/6 v
129Sv)
26Candidate intervals
Heaven
Hell
Chromosome 2 20-21 Mb
Chromosome 11 70-71 Mb
27Candidate gene sequencing
- Prioritize candidates for sequencing on the basis
of - Known function
- Homology to other genes of known function
- Tissues expression pattern
- Domain structure
- Exhaustive literature searches..
28Candidate gene sequencing
1. Automated PCR primer design
Robotic liquid handling
2. Genomic PCR
In-well template clean-up
3. Direct amplicon sequencing
4. Capillary electropheresis
5. Sequence analysis
29- Tools used to develop GABOS/GAFEP
- Perl programming language for all programs.
- Web interface
- HTML coding
- PHP inserted into HTML and processed by the
webserver before the HTML is processed by the
webserver. - Javascript processed by the clients web browser
(Mozilla Firefox or Safari for example)
30WEHI Computing Layout
Unix Server unix28
php processed here
Webserver apache
Client Mac, Windows.
html produced here
Browser Firefox,IE
wan/lan
html processed here
Javascript acts here In response to user
nfs
Unix28 disk GABOS/GAFEP
unix33
Display of GABOS/GAFEP here
Genome DATA
ftp
UCSC
31- Web Interface Debugging tools
- Firefox Error Console
- Firebug Addin to Firefox
32- Future Work
- Short term
- Finalize GABOS version 2
- Transcript, DNA working
- Complete data download maintenance program
- Automate sorting of annotation files and modify
GABOS to be aware of sorted/non-sorted data and
act accordingly. - Include ability to retrieve RNA data
- Will run on any unix server not just unix28.
- Web Interface available on WEHIs public server.
- Source code will be made freely available.
- Longer Term
- Retrieve data for utrs, others?
- Provide web interface access to annotation files.
- Remove need for BioPerl to be installed.
33- Aknowledgements
- Bioinformatics Division
- Terry Speed Gordon Smyth for the opportunity to
pursue this project in an excellent environment. - All others in Bioinformatics for many and varied
help. - WEHI ITS
- Nick Tan, Jakub Szarlat for Unix help.
- Dung Tran, Scott Wood for network help.
- Tri Le and John Nguyen for MS windows support.
- Tony Kyne others in ITS for many questions
answered. - Molecular Medicine
- Doug Hilton, Ben Kile for explaining their needs.
- Users for their feedback.
- Kylie Greig, Adrienne Hilton, Greg Hather,
Carolyn de Graaf