Decoding the genome with perl, XML and SQL - PowerPoint PPT Presentation

1 / 42
About This Presentation
Title:

Decoding the genome with perl, XML and SQL

Description:

Stag queries vs XPATH? SQL Databases. Relational databases are ... unix stag-storenode.pl -d mygenedb genes.xml. additional metadata required for ... – PowerPoint PPT presentation

Number of Views:65
Avg rating:3.0/5.0
Slides: 43
Provided by: chrism9
Category:
Tags: sql | xml | decoding | genome | perl | stag

less

Transcript and Presenter's Notes

Title: Decoding the genome with perl, XML and SQL


1
Decoding the genome with perl, XML and SQL
  • Chris Mungall
  • Howard Hughes Medical Institute
  • UC Berkeley
  • Lawrence Berkeley Lab

2
Outline
  • hacking the genome
  • DataStag
  • nested tag-value data and XML
  • DBIxDBStag
  • Relational to XML mapping

3
(No Transcript)
4
(No Transcript)
5
CGACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCAT
TTTCTCTCCCATATTATAGGGAGAAATATGATCGCGTATGCGAGAGTAG
T GCCAACATATTGTGCTCTTTGATTTTTTGGCAACCCAAAATGGTGGCG
GA TGAACGAGATGATAATATATTCAAGTTGCCGCTAATCAGAAATAAAT
TCA TTGCAACGTTAAATACAGCACAATATATGATCGCGTATGCGAGAGT
AGTG CCAACATATTGTGCTAATGAGTGCCTCTCGTTCTCTGTCTTATAT
TACCG CAAACCCAAAAAGACAATACACGACAGAGAGAGAGAGCAGCGGA
GATATT TAGATTGCCTATTAAATATGATCGCGTATGCGAGAGTAGTGCC
AACATAT TGTGCTCTCTATATAATGACTGCCTCTCATTCTGTCTTATTT
TACCGCAA ACCCAAATCGACAATGCACGACAGAGGAAGCAGAACAGATA
TTTAGATTG CCTCTCATTTTCTCTCCCATATTATAGGGAGAAATATGAT
CGCGTATGCG AGAGTAGTGCCAACATATTGTGCTCTTTGATTTTTTGGC
AACCCAAAATG GTGGCGGATGAACGAGATGATAATATATTCAAGTTGCC
GCTAATCAGAAA TAAATTCATTGCAACGTTAAATACAGCACAATATATG
ATCGCGTATGCGA GAGTAGTGCCAACATATTGTGCTAATGAGTGCCTCT
CGTTCTCTGTCTTA TATTACCGCAAACCCAAAAAGACAATACACGACAG
AGAGAGAGAGCAGCG GAGATATTTAGATTGCCTATTAAATATGATCGCG
TATGCGAGAGTAGTGC CAACATATTGTGCTCTCTATATAATGACTGCCT
CTCATTCTGTCTTATTT TACCGCAAACCCAAATCGACAATGCACGACAG
AGGAAGCAGAACAGATAT TTAGATTGCCTCTCATTTTCTCTCCCATATT
ATAGGGAGAAATATGATCG CGTATGCGAGAGTAGTGCCAACATATTGTG
CTCTTTGATTTTTTGGCAAC CCAAAATGGTGGCGGATGAACGAGATGAT
AATATATTCAAGTTGCCGCTA ATCAGAAATAAATTCATTGCAACGTTAA
ATACAGCACAATATATGATCGC GTATGCGAGAGTAGTGCCAACATATTG
TGCTAATGAGTGCCTCTCGTTCT CTGTCTTATATTACCGCAAACCCAAA
AAGACAATACACGACAGAGAGAGA GAGCAGCGGAGATATTTAGATTGCC
TATTAAATATGATCGCGTATGCGAG AGTAGTGCCAACATATTGTGCTCT
CTATATAATGACTGCCTCTCATTCTG TCTTATTTTACCGCAAACCCAAA
TCGACAATGCACGACAGAGGAAGCAGA ACAGATATTTAGATTGCCTCTC
ATTTTCTCTCCCATATTATAGGGAGAAA TATGATCGCGTATGCGAGAGT
AGTGCCAACATATTGTGCTCTTTGATTTT TTGGCAACCCAAAATGGTGG
CGGATGAACGAGATGATAATATATTCAAGT TGCCGCTAATCAGAAATAA
ATTCATTGCAACGTTAAATACAGCACAATAT ATGATCGCGTATGCGAGA
GTAGTGCCAACATATTGTGCTAATGAGTGCCT CTCGTTCTCTGTCTTAT
ATTACCGCAAACCCAAAAAGACAATACACGACA
MVADERDDNIFKLPLIRNKFIATLNTAQYM IAYARVVPTYCANECLSFS
VLYYRKPKKTI HDREREQRRYLDCLLNMIAYARVVPTYCAL YIMTASH
SVLFYRKPKSTMHDRGSRTDI..
6
Stein LD, Mungall C, Shu S, Caudy M, Mangone M,
Day A, Nickerson E, Stajich JE, Harris TW, Arva
A, Lewis S The generic genome browser a building
block for a model organism system
database. Genome Research 2002 121599-1610.
7
(No Transcript)
8
Use case a simple object model for genes
9
A simple object model for genes
10
package BioGene use base qw(BioEntity) hea
d2 symbol Usage gene-gtsymbol(HGNC)
symbol gene-gtsymbol cut sub symbol
my self shift self-gt_symbol shift if
_at__ return self-gt_symbol
11
package BioGene use base qw(BioEntity) use
ClassMethodMaker scalar gt qw/symbol
taxon/, array gt qw/phenotypes functions
positions/
package BioTaxon use base qw(BioEntity) use
ClassMethodMaker scalar gt qw/id
common_name genus species
parent_node/,
12
Perl meta-object systems
  • eg ClassMethodMaker
  • fun!
  • Fine for hermetic perl
  • but how do we interoperate with
  • other languages
  • databases
  • XML
  • No way of querying objects

13
DataStag
  • Objects and XML
  • perl objects are D(A)G-ish
  • XML is tree like
  • Why not merge the two?
  • YAML
  • alternate exchange format with perl-ish semantics
  • Stag
  • nested tag-value pairs
  • perl objects with XMLish semantics
  • XML-centric

14
ltdatasetgt ltgene_setgt ltgenegt
ltsymbolgtHGNClt/symbolgt lttax_idgt9606lt/tax_idgt
ltphenotypegtHemochromatosislt/phenotypegt
ltphenotypegtPorphyria variegatalt/phenotypegt
ltmol_functiongtiron homeostasislt/mol_functiongt
ltpositiongt ltstartgt10000lt/startgt
ltendgt5000lt/endgt ltchromosomegt6lt/chromosomegt
lt/positiongt lt/genegt
15
my dataset DataStag-gtparse(genes.xml) my
gene_set dataset-gtget_gene_set foreach my
gene (gene_set-gtget_gene) next unless
gene-gtsget_tax_id 9606 printf s\n,
gene-gtsget_symbol printf s\n, foreach
gene-gtget_phenotype
ltdatasetgt ltgene_setgt ltgenegt
ltsymbolgtHGNClt/symbolgt lttax_idgt9606lt/tax_idgt
ltphenotypegtHemochromatosislt/phenotypegt
ltphenotypegtPorphyria variegatalt/phenotypegt
ltmol_functiongtiron homeostasislt/mol_functiongt
ltpositiongt ltstartgt10000lt/startgt
ltendgt5000lt/endgt ltchromosomegt6lt/chromosomegt
lt/positiongt lt/genegt
16
Stag trees
  • Recursive data structure
  • node-labeled tree
  • Implemented as nested lists
  • node name,value
  • value DATA or list of nodes
  • Equivalent to element-only XML

17
Stag accessors
  • Tag-value pairs
  • fooval obj-gtget_FOO wantarrayfalse
  • fooval obj-gtsget_FOO
  • _at_foovals obj-gtget_FOO wantarraytrue
  • obj-gtset_FOO(fooval)
  • No distinction between single and multi valued
    tags
  • ordered (no hashes)

18
my dataset DataStag-gtparse(genes.xml) my
_at_symbols dataset-gtfind_symbol
ltdatasetgt ltgene_setgt ltgenegt
ltsymbolgtHGNClt/symbolgt lttax_idgt9606lt/tax_idgt
ltphenotypegtHemochromatosislt/phenotypegt
ltphenotypegtPorphyria variegatalt/phenotypegt
ltmol_functiongtiron homeostasislt/mol_functiongt
ltpositiongt ltstartgt10000lt/startgt
ltendgt5000lt/endgt ltchromosomegt6lt/chromosomegt
lt/positiongt lt/genegt
19
my dataset DataStag-gtparse(genes.xml) my
_at_symbols dataset-gtget(gene_set/gene/symbol)

ltdatasetgt ltgene_setgt ltgenegt
ltsymbolgtHGNClt/symbolgt lttax_idgt9606lt/tax_idgt
ltphenotypegtHemochromatosislt/phenotypegt
ltphenotypegtPorphyria variegatalt/phenotypegt
ltmol_functiongtiron homeostasislt/mol_functiongt
ltpositiongt ltstartgt10000lt/startgt
ltendgt5000lt/endgt ltchromosomegt6lt/chromosomegt
lt/positiongt lt/genegt
20
my dataset DataStag-gtparse(genes.xml) my
_at_genes dataset-gtwhere(gene, sub
shift-gtsget(position/start) lt 5000)
ltdatasetgt ltgene_setgt ltgenegt
ltsymbolgtHGNClt/symbolgt lttax_idgt9606lt/tax_idgt
ltphenotypegtHemochromatosislt/phenotypegt
ltphenotypegtPorphyria variegatalt/phenotypegt
ltmol_functiongtiron homeostasislt/mol_functiongt
ltpositiongt ltstartgt10000lt/startgt
ltendgt5000lt/endgt ltchromosomegt6lt/chromosomegt
lt/positiongt lt/genegt
21
Attributes and mixed elements
ltgene idHGNCgt ltdescgt implicated in
ltbgtHemochromatosislt/bgt lt/descgt lt/genegt
ltgenegt lt_at_gt ltidgtHGNClt/idgt lt/_at_gt ltdescgt
lt.gtImplicated inlt/.gt ltbgtHemochromatosislt/bgt
lt/descgt lt/genegt
my id gene-gtget(_at_/id)
22
Stag alternate syntaxes
dataset gene_set gene symbol
HGNC tax_id 9606 phenotype
Hemochromatosis phenotype Porphyria
variegata mol_function iron homeostasis
position start 10000 end
5000 chromosome 6
(dataset (gene_set (gene (symbol
"HGNC") (tax_id "9606") (phenotype
"Hemochromatosis") (phenotype "Porphyria
variegata") (mol_function "iron
homeostasis") (position (start
"10000") (end "5000") (chromosome
"6")))))
23
Stag event model
  • Like SAX, but more perlish
  • DataStagBaseGenerator
  • DataStagBaseHandler
  • Good for implementing parsers
  • compatible with SAX
  • slow!

24
DataStag features
  • Stag can be used with or without XML
  • TODO strict mode
  • Speed issues
  • Stag queries vs XPATH?

25
SQL Databases
  • Relational databases are indispensable in
    bioinformatics
  • scalable with large datasets
  • expressive modeling
  • queries
  • Impedance mismatch problem
  • relational model and imperative programming
    languages are not well-matched

26
Object Relational Mapping
  • Tangram, Alzabo, ClassDBI,
  • Object-centric
  • DBStag is SQL-centric

27
Why perl programmers dont like SQL
SELECT FROM gene NATURAL JOIN gene_phenotype AS
gp NATURAL JOIN taxon WHERE
28
SELECT FROM gene NATURAL JOIN
gene_phenotype NATURAL JOIN taxon USE NESTING
(gene_set(gene(taxon)(gene_phenotype)))
ltgene_setgt ltgenegt ltsymbolgtHGNClt/symbol
gt lttax_idgt9606lt/tax_idgt lttaxongt
ltnamegthumanlt/namegt ltgenusgtHomolt/genusgt
ltspeciesgtsapienslt/genusgt lt/taxongt
ltgene_phenotypegt ltphenotypegtHemochromat
osislt/phenotypegt lt/gene_phenotypegt
ltgene_phenotypegt ltphenotypegtPorphyria
variegatalt/phenotypegt lt/gene_phenotypegt
lt/genegt
29
relational-to-XML mappings
  • Tables mapped to non-terminal nodes
  • Columns mapped to terminal node (PCDATA)
  • Table nodes are nested according to
  • order in SQL query
  • USE NESTING clause
  • Tables aliases add extra level of nesting

30
Can produce unusual XML
ltgene_setgt ltgenegt ltsymbolgtHGNClt/symbol
gt lttax_idgt9606lt/tax_idgt lttaxongt
ltnamegthumanlt/namegt ltgenusgtHomolt/genusgt
ltspeciesgtsapienslt/genusgt lt/taxongt
ltgene_phenotypegt ltphenotypegtHemochromat
osislt/phenotypegt lt/gene_phenotypegt
ltgene_phenotypegt ltphenotypegtPorphyria
variegatalt/phenotypegt lt/gene_phenotypegt
lt/genegt
31
Desired XML?
ltgene_setgt ltgenegt ltsymbolgtHGNClt/symbol
gt lttax_idgt9606lt/tax_idgt lttaxongt
ltnamegthumanlt/namegt ltgenusgtHomolt/genusgt
ltspeciesgtsapienslt/genusgt lt/taxongt
ltphenotypegtHemochromatosislt/phenotypegt
ltphenotypegtPorphyria variegatalt/phenotypegt
lt/genegt
32
use DBIxDBStag my dbh DBIxDBStag-gtconnect(
mygenedb) my gene_set dbh-gtselectall_stag(q
SELECT FROM gene NATURAL JOIN
gene_phenotype NATURAL JOIN taxon USE NESTING
(gene_set(gene(taxon)(gene_phenotype))) ) foreac
h my gene (_at_gene_set) printf Gene s in
s\n, gene-gtsget_symbol,
gene-gtsget(taxon/name) foreach my gp
(gene-gtget_gene_phenotype) printf
s\n, gp-gtsget_phenotype
Gene HGNC in human Hemochromatosis
Poryphria varietagata Gene amontillado in
fruitfly Abnormal hatching hypoactive
33
DBStag Templates
gene_query.stg
SELECT FROM gene NATURAL JOIN gene_phenotype
NATURAL JOIN taxon NATURAL JOIN
gene_function WHERE gene.symbol gt
gene_symbol gene.gene_id IN (SELECT
gene_id FROM gene_phenotype WHERE phenotype
gt phenotype) gene.gene_id IN (SELECT
gene_id FROM gene_function WHERE desc gt
function) USE NESTING (gene_set(gene(taxon)
(gene_phenotype)
(gene_function)))
34
Using templates
use DBIxDBStag my gene_set
DBIxDBStag-gtselectall_stag(
-templategtgene_query, -bindgtfunctiongtir
on, phenotypegthemo)
unixgt selectall_xml.pl -d mygenedb /gene_query
\ functioniron phenotypehemo gt
\ genes.xml
35
Storing data
gene_set DataStag-gtparse(genes.xml) dbh-gt
storenode(gene_set)
unixgt stag-storenode.pl -d mygenedb genes.xml
additional metadata required for - many-to-many
relations - foreign keys to tables with
non-matching names
36
XORT mode
ltgene_setgt lttaxon oplookup idhsagt
ltgenusgtHomolt/genusgt ltspeciesgtsapienslt/genu
sgt lt/taxongt ltgenegt ltsymbol
opupdategtHGNClt/symbolgt
lttaxon_idgthsalt/taxon_idgt ltgene_phenotypegt
ltphenotypegtHemochromatosislt/phenotypegt
lt/gene_phenotypegt ltgene_phenotypegt
ltphenotypegtPorphyria variegatalt/phenotypegt
lt/gene_phenotypegt lt/genegt
37
Automatic schema building from XML dataset
  • Steps
  • pre-process XML
  • run stag-autoddl.pl on XML
  • run stag-storenode.pl
  • see DBIxDBStagCookbook
  • When to use
  • rapid development
  • ad-hoc analyses

38
TODO
  • SOAP bindings
  • Combinable templates
  • Address speed issues

39
Should you use it?
  • DataStag
  • when you dont have an existing object model
  • you prefer to separate behaviour from the data
    model
  • if you like element-y XML
  • DBIxDBStag
  • your data naturally nests
  • you need access to complex queries in SQL
  • little or no inheritance

40
DBStag and web development
  • SQL Templates separate query logic from display
    logic
  • Rendering HTML
  • XSLT
  • HTML Templating toolkit and Stag objects

41
DBStag in use
  • Analysis of gene structure in 12 insects
  • Building of a data warehouse for the Gene
    Ontology
  • Database of human genetic variation and disease
  • Library of templates for many common bio-schemas

42
Learning more
  • http//stag.sourceforget.net
Write a Comment
User Comments (0)
About PowerShow.com