Title: Supported by the NSF Plant Genome Research and REU Programs
1Tutorial of bioinformatics and tree generation at
the Cell Wall Genomics website
Bryan Penning
Supported by the NSF Plant Genome Research and
REU Programs
2Bioinformatics Goals
- We currently have a wealth of Arabidopsis
thaliana cell wall gene information on the
website, we wanted to - Add family information about rice and maize Type
II cell walls to compare to A. thaliana Type I
cell walls - Add links to outside information on rice genes
like we have for A. thaliana - Include annotated composite trees of A. thaliana,
rice and maize gene families - Add links to sites used to generate the data
- Add source protein sequence used for our family
trees so other researchers can make their own
adding their genes of interest - Generate a tutorial on how researchers can make
use of the bioinformatics data on our site
3Diagram of our bioinformatics approach
Too few genes, Blast other sites
N
Blast TIGR
Choose genes
Make tree
Y
Too many genes, tighten criteria
N
Diagram of the process used to find the genes and
draw family trees for cell wall related rice
genes. The same approach is used for maize.
4Diagram of our bioinformatics approach
A. thaliana genes
Draw tree with all family members
Publish to web
Annotate
Rice genes
Diagram of the process used to integrate cell
wall related genes from all three family trees
into a composite tree.
Maize genes
5BLASTing genes
- To be considerate of the bioinformatics community
with the number of BLASTs to be performed and to
speed the process, we downloaded the text or
flat file of the TIGR rice protein sequences
(available at http//www.tigr.org/tdb/e2k1/osa1/d
ata_download.shtml) and performed local blasts
using blastall from NCBI (available at
http//www.ncbi.nlm.nih.gov/BLAST/download.shtml) - Direction for use of these tools is available at
the above sites and is beyond the scope of this
tutorial - For a small number of BLASTs, you can use
web-based methods and common programs such as
Word and Excel plus any of a number of
downloadable tree drawing programs to make these
kinds of trees on your own if you are not
familiar with programming languages such as Perl
to automate the process. Although web searches
can be more time consuming, they work just as
well for a few sequences
6Web BLASTing
- For smaller numbers of BLASTs to the rice genome,
TIGR provides an excellent Web BLAST at
http//tigrblast.tigr.org/euk-blast/index.cgi?proj
ectosa1 - You can also use the new BLAST tool at Gramene
http//www.gramene.org/multi/blastview for most
cereal sequences - Note gene model versions sometimes differ
between Gramene and TIGR as one site may update
to the latest model before the other
7Web BLASTing
- Downloading the protein sequence for Arabidopsis
SUD1 (At3g46440) from TIGR, you can BLAST it
against the TIGR Rice Pseudomolecules Protein
database using BLASTp
8Web BLASTing
- You get a series of hits to the gene of
interest - A higher score and smaller probability is a
better match to the original gene - This procedure is followed for all of the genes
in a family to gather the best possible hits,
sort the hits to remove duplicates and choose the
best rice matches to the Arabidopsis families - You can use NCBIs blastall tool for multiple
simultaneous blasts as we do for this step
9Organizing BLASTs
- This is a word document generated by BLASTing
SUD1 and SUD2 of Arabidopsis against the TIGR
Rice Protein database - The hits were copied into word and set to the
font Courier New, 9 pt and saved as a text only
document (to remove the HTML code) - The file was reloaded in Word and converted to a
table (table menu) using other and the character
(shift \) to separate the columns
10Organizing BLASTs
- The Word file is copied into Excel and the Data
Sort menu is used to sort by the first column - This brings all of the same named genes together
(the two highlighted lines for example) - Duplicate genes are removed from the spreadsheet
and the far right column only (LOC_Osxxgxxxxxx)
tags can be copied back to word
11Organizing BLASTs
- You can use the table menu to convert table to
text (Paragraph Marks) to generate a list of
genes - These genes can be searched through a downloaded
database using the NCBI fastacmd (included in the
BLAST download tools) or you can search them one
at a time using a web-based database such as the
locus search name on TIGR (http//www.tigr.org/td
b/e2k1/osa1/LocusNameSearch.shtml)
12Generating a tree
- Once you have found all of your sequences, check
that each sequence name has a (denoting a new sequence name) and the sequence
starts on a new line - Copy and paste all of your sequences into an
alignment program like ClustalW (we use
http//align.genome.jp/ from the Kyoto University
Bioinformatics center, but any ClustalW program
will work)
13Generating a tree
- For our trees we use Slow/Accurate pair-wise
comparisons and Gonnet for our Weight Matrix (two
spots on the website) - Click execute alignment to get your sequence
alignment - At the end of the alignment page will be the
information needed for tree drawing programs - You can click on clustal.dnd for a quick tree or
take the information after it A Newick format
tree and copy it into a new Word file, saving it
as a text file (include all parenthesis)
14Creating a tree
- We use the program TreeDyn to generate our trees
(available at http//www.treedyn.org/) - This is an example of the Arabidopsis and rice
1.1 family - The tree text file was loaded into TreeDyn and
the frame enlarged - The red text for Arabidopsis sequences was done
by changing the font color to red and using the
find panel to find all At sequences (which turn
red) - The scale at the bottom was added by right
clicking on that space and choosing the tree
name, annotation, and scale sub-menus - This square tree is useful to see associations of
genes for different species
15Square tree example
- This is part of the family 1.1 square dendrogram
of Arabidopsis, rice and maize from our website - The red names are Arabidopsis sequence, the black
names are rice, and the green names are maize - Regions alternate between grey shaded and white
backgrounds (added with Photoshop) to indicate
clades of similar sequence genes which may relate
function (such as AUD/SUD or GME, etc)
16Radial dendrograms
- TreeDyn can also draw radial dendrograms such as
the one shown for rice family 1.1 - This can be done by right clicking on the tree
area to bring up the grey box in TreeDyn,
choosing your tree, then Conformation- Radial - Treedyn allows you to resize, rotate, and flip
clades around (see http//www.treedyn.org/ for
detailed tutorials on these processes) - For our site, we export the radial trees as jpeg
images
17Finishing a radial dendrogram
The TreeDyn tree jpeg is finished as a FLASH file
where the ovals and family names are added (Rice
family 1.1 shown)
Each individual clade of a family tree is also
prepared in TreeDyn and link buttons added later
in FLASH (AUD/SUD-like shown)
18Viewing your gene of interest
- We provide protein sequence information you can
download and add in your own sequence of interest
for comparison to these three species - Under each tree (family 1.1 shown) is the link
View the protein sequence file - Right click and choose Save Target as to
download the sequence with a filename and
location you will remember - You can do this for each Arabidopsis, rice, and
maize family
19Viewing your gene of interest
- You may have a sequence you think is related to
a particular family such as nucleotide
interconversion pathway (family 1.1) - For example, the wheat EST CV523101 from
Genebank - http//www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db
nucleotidevalCV523101 - might be related to the TIGR rice gene
Os05g29990 in the AUD/SUD clade of family 1.1
according to information from Gramene
20Viewing your gene of interest
- You can take the nucleotide sequence and covert
it to protein sequence using a program such as
Genemark (http//opal.biology.gatech.edu/GeneMark
/eukhmm.cgi) - Protein sequence returned
- CV523101_wheat
- IARIFNTYGPRMCIDDGRVVSNFVAQALRKEPLTVYGDGKQTRSFQYVS
DLVEGLMRLMEGDHIGPFNLGNPGEFTMLELAKVVQDTIDPNARIEFREN
TQDDPHKRKPDITKAKEQLGWEPKIALRDGLPLMVTDFRKRIFGDQDSAA
TATEG
21Viewing your gene of interest
- Paste all of the sequences for family 1.1
(Arabidopsis, rice, and maize) plus the Wheat
EST, CV523101_wheat, converted to protein into a
ClustalW program such as - http//align.genome.jp/
- from the Kyoto University Bioinformatics center
- Perform the multiple alignment, copy the Newick
tree data generated into a new word file, and
save a text file as previously shown
22Viewing your gene of interest
- Taking the Newick tree from clustalW into TreeDyn
as previously shown will allow you to visualize
the tree - The AUD/SUD clade of the tree generated by
TreeDyn shows that the wheat EST (in blue) is
most closely related to the rice gene Os05g29990
in the AUD clade
The AUD/SUD clade of the family 1.1 tree for
Arabidopsis (red), Rice (black), Maize (green),
and a wheat EST (blue) added to demonstrate how
you can visualize relatedness of your own genes
using our protein sequences
23Bioinformatics sites used
- General
- Multiple alignment for trees, ClustalW
(http//align.genome.jp/) - Making trees, TreeDyn (http//www.treedyn.org/)
- BLASTing NCBI (http//www.ncbi.nlm.nih.gov/BLAST/)
- Proteins translated by GeneMark
(http//opal.biology.gatech.edu/GeneMark/eukhmm.cg
i) - Rice
- Sequence BLAST using TIGR (http//www.tigr.org/tdb
/e2k1/osa1/) - Downloading rice protein sequences from TIGR
(http//www.tigr.org/tdb/e2k1/osa1/LocusNameSearch
.shtml) - Maize
- Sequence BLAST using TIGR ZmGI (http//www.tigr.or
g/tigr-scripts/tgi/T_index.cgi?speciesmaize) - Sequence BLAST using Gramene (http//www.gramene.o
rg/multi/blastview)