Title: Decoding the genome with perl, XML and SQL
1Decoding the genome with perl, XML and SQL
- Chris Mungall
- Howard Hughes Medical Institute
- UC Berkeley
- Lawrence Berkeley Lab
2Outline
- hacking the genome
- DataStag
- nested tag-value data and XML
- DBIxDBStag
- Relational to XML mapping
3(No Transcript)
4(No Transcript)
5CGACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCAT
TTTCTCTCCCATATTATAGGGAGAAATATGATCGCGTATGCGAGAGTAG
T GCCAACATATTGTGCTCTTTGATTTTTTGGCAACCCAAAATGGTGGCG
GA TGAACGAGATGATAATATATTCAAGTTGCCGCTAATCAGAAATAAAT
TCA TTGCAACGTTAAATACAGCACAATATATGATCGCGTATGCGAGAGT
AGTG CCAACATATTGTGCTAATGAGTGCCTCTCGTTCTCTGTCTTATAT
TACCG CAAACCCAAAAAGACAATACACGACAGAGAGAGAGAGCAGCGGA
GATATT TAGATTGCCTATTAAATATGATCGCGTATGCGAGAGTAGTGCC
AACATAT TGTGCTCTCTATATAATGACTGCCTCTCATTCTGTCTTATTT
TACCGCAA ACCCAAATCGACAATGCACGACAGAGGAAGCAGAACAGATA
TTTAGATTG CCTCTCATTTTCTCTCCCATATTATAGGGAGAAATATGAT
CGCGTATGCG AGAGTAGTGCCAACATATTGTGCTCTTTGATTTTTTGGC
AACCCAAAATG GTGGCGGATGAACGAGATGATAATATATTCAAGTTGCC
GCTAATCAGAAA TAAATTCATTGCAACGTTAAATACAGCACAATATATG
ATCGCGTATGCGA GAGTAGTGCCAACATATTGTGCTAATGAGTGCCTCT
CGTTCTCTGTCTTA TATTACCGCAAACCCAAAAAGACAATACACGACAG
AGAGAGAGAGCAGCG GAGATATTTAGATTGCCTATTAAATATGATCGCG
TATGCGAGAGTAGTGC CAACATATTGTGCTCTCTATATAATGACTGCCT
CTCATTCTGTCTTATTT TACCGCAAACCCAAATCGACAATGCACGACAG
AGGAAGCAGAACAGATAT TTAGATTGCCTCTCATTTTCTCTCCCATATT
ATAGGGAGAAATATGATCG CGTATGCGAGAGTAGTGCCAACATATTGTG
CTCTTTGATTTTTTGGCAAC CCAAAATGGTGGCGGATGAACGAGATGAT
AATATATTCAAGTTGCCGCTA ATCAGAAATAAATTCATTGCAACGTTAA
ATACAGCACAATATATGATCGC GTATGCGAGAGTAGTGCCAACATATTG
TGCTAATGAGTGCCTCTCGTTCT CTGTCTTATATTACCGCAAACCCAAA
AAGACAATACACGACAGAGAGAGA GAGCAGCGGAGATATTTAGATTGCC
TATTAAATATGATCGCGTATGCGAG AGTAGTGCCAACATATTGTGCTCT
CTATATAATGACTGCCTCTCATTCTG TCTTATTTTACCGCAAACCCAAA
TCGACAATGCACGACAGAGGAAGCAGA ACAGATATTTAGATTGCCTCTC
ATTTTCTCTCCCATATTATAGGGAGAAA TATGATCGCGTATGCGAGAGT
AGTGCCAACATATTGTGCTCTTTGATTTT TTGGCAACCCAAAATGGTGG
CGGATGAACGAGATGATAATATATTCAAGT TGCCGCTAATCAGAAATAA
ATTCATTGCAACGTTAAATACAGCACAATAT ATGATCGCGTATGCGAGA
GTAGTGCCAACATATTGTGCTAATGAGTGCCT CTCGTTCTCTGTCTTAT
ATTACCGCAAACCCAAAAAGACAATACACGACA
MVADERDDNIFKLPLIRNKFIATLNTAQYM IAYARVVPTYCANECLSFS
VLYYRKPKKTI HDREREQRRYLDCLLNMIAYARVVPTYCAL YIMTASH
SVLFYRKPKSTMHDRGSRTDI..
6Stein LD, Mungall C, Shu S, Caudy M, Mangone M,
Day A, Nickerson E, Stajich JE, Harris TW, Arva
A, Lewis S The generic genome browser a building
block for a model organism system
database. Genome Research 2002 121599-1610.
7(No Transcript)
8Use case a simple object model for genes
9A simple object model for genes
10package BioGene use base qw(BioEntity) hea
d2 symbol Usage gene-gtsymbol(HGNC)
symbol gene-gtsymbol cut sub symbol
my self shift self-gt_symbol shift if
_at__ return self-gt_symbol
11package BioGene use base qw(BioEntity) use
ClassMethodMaker scalar gt qw/symbol
taxon/, array gt qw/phenotypes functions
positions/
package BioTaxon use base qw(BioEntity) use
ClassMethodMaker scalar gt qw/id
common_name genus species
parent_node/,
12Perl meta-object systems
- eg ClassMethodMaker
- fun!
- Fine for hermetic perl
- but how do we interoperate with
- other languages
- databases
- XML
- No way of querying objects
13DataStag
- Objects and XML
- perl objects are D(A)G-ish
- XML is tree like
- Why not merge the two?
- YAML
- alternate exchange format with perl-ish semantics
- Stag
- nested tag-value pairs
- perl objects with XMLish semantics
- XML-centric
14ltdatasetgt ltgene_setgt ltgenegt
ltsymbolgtHGNClt/symbolgt lttax_idgt9606lt/tax_idgt
ltphenotypegtHemochromatosislt/phenotypegt
ltphenotypegtPorphyria variegatalt/phenotypegt
ltmol_functiongtiron homeostasislt/mol_functiongt
ltpositiongt ltstartgt10000lt/startgt
ltendgt5000lt/endgt ltchromosomegt6lt/chromosomegt
lt/positiongt lt/genegt
15my dataset DataStag-gtparse(genes.xml) my
gene_set dataset-gtget_gene_set foreach my
gene (gene_set-gtget_gene) next unless
gene-gtsget_tax_id 9606 printf s\n,
gene-gtsget_symbol printf s\n, foreach
gene-gtget_phenotype
ltdatasetgt ltgene_setgt ltgenegt
ltsymbolgtHGNClt/symbolgt lttax_idgt9606lt/tax_idgt
ltphenotypegtHemochromatosislt/phenotypegt
ltphenotypegtPorphyria variegatalt/phenotypegt
ltmol_functiongtiron homeostasislt/mol_functiongt
ltpositiongt ltstartgt10000lt/startgt
ltendgt5000lt/endgt ltchromosomegt6lt/chromosomegt
lt/positiongt lt/genegt
16Stag trees
- Recursive data structure
- node-labeled tree
- Implemented as nested lists
- node name,value
- value DATA or list of nodes
- Equivalent to element-only XML
17Stag accessors
- Tag-value pairs
- fooval obj-gtget_FOO wantarrayfalse
- fooval obj-gtsget_FOO
- _at_foovals obj-gtget_FOO wantarraytrue
- obj-gtset_FOO(fooval)
- No distinction between single and multi valued
tags - ordered (no hashes)
18my dataset DataStag-gtparse(genes.xml) my
_at_symbols dataset-gtfind_symbol
ltdatasetgt ltgene_setgt ltgenegt
ltsymbolgtHGNClt/symbolgt lttax_idgt9606lt/tax_idgt
ltphenotypegtHemochromatosislt/phenotypegt
ltphenotypegtPorphyria variegatalt/phenotypegt
ltmol_functiongtiron homeostasislt/mol_functiongt
ltpositiongt ltstartgt10000lt/startgt
ltendgt5000lt/endgt ltchromosomegt6lt/chromosomegt
lt/positiongt lt/genegt
19my dataset DataStag-gtparse(genes.xml) my
_at_symbols dataset-gtget(gene_set/gene/symbol)
ltdatasetgt ltgene_setgt ltgenegt
ltsymbolgtHGNClt/symbolgt lttax_idgt9606lt/tax_idgt
ltphenotypegtHemochromatosislt/phenotypegt
ltphenotypegtPorphyria variegatalt/phenotypegt
ltmol_functiongtiron homeostasislt/mol_functiongt
ltpositiongt ltstartgt10000lt/startgt
ltendgt5000lt/endgt ltchromosomegt6lt/chromosomegt
lt/positiongt lt/genegt
20my dataset DataStag-gtparse(genes.xml) my
_at_genes dataset-gtwhere(gene, sub
shift-gtsget(position/start) lt 5000)
ltdatasetgt ltgene_setgt ltgenegt
ltsymbolgtHGNClt/symbolgt lttax_idgt9606lt/tax_idgt
ltphenotypegtHemochromatosislt/phenotypegt
ltphenotypegtPorphyria variegatalt/phenotypegt
ltmol_functiongtiron homeostasislt/mol_functiongt
ltpositiongt ltstartgt10000lt/startgt
ltendgt5000lt/endgt ltchromosomegt6lt/chromosomegt
lt/positiongt lt/genegt
21Attributes and mixed elements
ltgene idHGNCgt ltdescgt implicated in
ltbgtHemochromatosislt/bgt lt/descgt lt/genegt
ltgenegt lt_at_gt ltidgtHGNClt/idgt lt/_at_gt ltdescgt
lt.gtImplicated inlt/.gt ltbgtHemochromatosislt/bgt
lt/descgt lt/genegt
my id gene-gtget(_at_/id)
22Stag alternate syntaxes
dataset gene_set gene symbol
HGNC tax_id 9606 phenotype
Hemochromatosis phenotype Porphyria
variegata mol_function iron homeostasis
position start 10000 end
5000 chromosome 6
(dataset (gene_set (gene (symbol
"HGNC") (tax_id "9606") (phenotype
"Hemochromatosis") (phenotype "Porphyria
variegata") (mol_function "iron
homeostasis") (position (start
"10000") (end "5000") (chromosome
"6")))))
23Stag event model
- Like SAX, but more perlish
- DataStagBaseGenerator
- DataStagBaseHandler
- Good for implementing parsers
- compatible with SAX
- slow!
24DataStag features
- Stag can be used with or without XML
- TODO strict mode
- Speed issues
- Stag queries vs XPATH?
25SQL Databases
- Relational databases are indispensable in
bioinformatics - scalable with large datasets
- expressive modeling
- queries
- Impedance mismatch problem
- relational model and imperative programming
languages are not well-matched
26Object Relational Mapping
- Tangram, Alzabo, ClassDBI,
- Object-centric
- DBStag is SQL-centric
27Why perl programmers dont like SQL
SELECT FROM gene NATURAL JOIN gene_phenotype AS
gp NATURAL JOIN taxon WHERE
28SELECT FROM gene NATURAL JOIN
gene_phenotype NATURAL JOIN taxon USE NESTING
(gene_set(gene(taxon)(gene_phenotype)))
ltgene_setgt ltgenegt ltsymbolgtHGNClt/symbol
gt lttax_idgt9606lt/tax_idgt lttaxongt
ltnamegthumanlt/namegt ltgenusgtHomolt/genusgt
ltspeciesgtsapienslt/genusgt lt/taxongt
ltgene_phenotypegt ltphenotypegtHemochromat
osislt/phenotypegt lt/gene_phenotypegt
ltgene_phenotypegt ltphenotypegtPorphyria
variegatalt/phenotypegt lt/gene_phenotypegt
lt/genegt
29relational-to-XML mappings
- Tables mapped to non-terminal nodes
- Columns mapped to terminal node (PCDATA)
- Table nodes are nested according to
- order in SQL query
- USE NESTING clause
- Tables aliases add extra level of nesting
30Can produce unusual XML
ltgene_setgt ltgenegt ltsymbolgtHGNClt/symbol
gt lttax_idgt9606lt/tax_idgt lttaxongt
ltnamegthumanlt/namegt ltgenusgtHomolt/genusgt
ltspeciesgtsapienslt/genusgt lt/taxongt
ltgene_phenotypegt ltphenotypegtHemochromat
osislt/phenotypegt lt/gene_phenotypegt
ltgene_phenotypegt ltphenotypegtPorphyria
variegatalt/phenotypegt lt/gene_phenotypegt
lt/genegt
31Desired XML?
ltgene_setgt ltgenegt ltsymbolgtHGNClt/symbol
gt lttax_idgt9606lt/tax_idgt lttaxongt
ltnamegthumanlt/namegt ltgenusgtHomolt/genusgt
ltspeciesgtsapienslt/genusgt lt/taxongt
ltphenotypegtHemochromatosislt/phenotypegt
ltphenotypegtPorphyria variegatalt/phenotypegt
lt/genegt
32use DBIxDBStag my dbh DBIxDBStag-gtconnect(
mygenedb) my gene_set dbh-gtselectall_stag(q
SELECT FROM gene NATURAL JOIN
gene_phenotype NATURAL JOIN taxon USE NESTING
(gene_set(gene(taxon)(gene_phenotype))) ) foreac
h my gene (_at_gene_set) printf Gene s in
s\n, gene-gtsget_symbol,
gene-gtsget(taxon/name) foreach my gp
(gene-gtget_gene_phenotype) printf
s\n, gp-gtsget_phenotype
Gene HGNC in human Hemochromatosis
Poryphria varietagata Gene amontillado in
fruitfly Abnormal hatching hypoactive
33DBStag Templates
gene_query.stg
SELECT FROM gene NATURAL JOIN gene_phenotype
NATURAL JOIN taxon NATURAL JOIN
gene_function WHERE gene.symbol gt
gene_symbol gene.gene_id IN (SELECT
gene_id FROM gene_phenotype WHERE phenotype
gt phenotype) gene.gene_id IN (SELECT
gene_id FROM gene_function WHERE desc gt
function) USE NESTING (gene_set(gene(taxon)
(gene_phenotype)
(gene_function)))
34Using templates
use DBIxDBStag my gene_set
DBIxDBStag-gtselectall_stag(
-templategtgene_query, -bindgtfunctiongtir
on, phenotypegthemo)
unixgt selectall_xml.pl -d mygenedb /gene_query
\ functioniron phenotypehemo gt
\ genes.xml
35Storing data
gene_set DataStag-gtparse(genes.xml) dbh-gt
storenode(gene_set)
unixgt stag-storenode.pl -d mygenedb genes.xml
additional metadata required for - many-to-many
relations - foreign keys to tables with
non-matching names
36XORT mode
ltgene_setgt lttaxon oplookup idhsagt
ltgenusgtHomolt/genusgt ltspeciesgtsapienslt/genu
sgt lt/taxongt ltgenegt ltsymbol
opupdategtHGNClt/symbolgt
lttaxon_idgthsalt/taxon_idgt ltgene_phenotypegt
ltphenotypegtHemochromatosislt/phenotypegt
lt/gene_phenotypegt ltgene_phenotypegt
ltphenotypegtPorphyria variegatalt/phenotypegt
lt/gene_phenotypegt lt/genegt
37Automatic schema building from XML dataset
- Steps
- pre-process XML
- run stag-autoddl.pl on XML
- run stag-storenode.pl
- see DBIxDBStagCookbook
- When to use
- rapid development
- ad-hoc analyses
38TODO
- SOAP bindings
- Combinable templates
- Address speed issues
39Should you use it?
- DataStag
- when you dont have an existing object model
- you prefer to separate behaviour from the data
model - if you like element-y XML
- DBIxDBStag
- your data naturally nests
- you need access to complex queries in SQL
- little or no inheritance
40DBStag and web development
- SQL Templates separate query logic from display
logic - Rendering HTML
- XSLT
- HTML Templating toolkit and Stag objects
41DBStag in use
- Analysis of gene structure in 12 insects
- Building of a data warehouse for the Gene
Ontology - Database of human genetic variation and disease
- Library of templates for many common bio-schemas
42Learning more
- http//stag.sourceforget.net