all.joiner - PowerPoint PPT Presentation

About This Presentation
Title:

all.joiner

Description:

set fish tetraodon,fugu,zebrafish. set worms elegans,briggsae ... tetraodon,fugu,zebrafish,elegans,briggsae. Databases by organism ... – PowerPoint PPT presentation

Number of Views:188
Avg rating:3.0/5.0
Slides: 18
Provided by: jimk88
Category:
Tags: fugu | joiner

less

Transcript and Presenter's Notes

Title: all.joiner


1
all.joiner
  • A file that describes joinable fields in the UCSC
    Genome Databases

2
basic example of an identifier
identifier softberryGeneName "Link together
Fshgene gene structure, peptide, and homolog"
gbd.softberryGene.name gbd.softberryPep.na
me gbd.softberryHom.name
  • The central concept of all.joiner is the
    identifier, which appears in fields of multiple
    tables, sometimes even multiple databases.
  • gbd is a variable that contains a
    comma-separated list of genome databases.
  • An identifier consists of a an identifier line, a
    required comment in quotes, and a list of
    database.table.field where the identifier is
    used. The first field listed is the master key.
    It contains all identifiers. Later fields may
    not contain all.

3
Variables
  • Variables are defined by the set keyword.
  • In practice they are mostly used for
    comma-separated lists of databases.
  • set fish tetraodon,fugu,zebrafish
  • set worms elegans,briggsae
  • After these two sets, typing fish,worms is
    equivalent to typing
  • tetraodon,fugu,zebrafish,elegans,briggsae

4
Databases by organism
Define databases used for various organisms set
hg hg15,hg16,hg17,hg18 set mm mm3,mm4,mm5,mm6,mm7,
mm8 set rn rn2,rn3,rn4 set fr fr1,fr2 set ce
ce1,ce2,ce4 set cb cb1,cb3 set dm dm1,dm2,dm3 set
dp dp2,dp3 set sc sc1 set sacCer sacCer1 set
panTro panTro1,panTro2 set galGal galGal2,galGal3
5
Define all genome databases. set gbd
hg,mm,rn,fr,ce,cb,dm,dp,sc,sacCer,panTr
o,galGal Only consider one of members of gbd
at a time. exclusiveSet gbd Define other
databases that we check set otherDb
visiGene,uniProt,go,proteome,hgFixed

Set up list of databases we ignore and those we
check. Program will complain about other
databases. databasesChecked gbd,otherDb database
sIgnored mysql,lostfound,proteinDb,zooDb,hgcent
raltest,hgcentralbeta
6
Define databases that support known genes set
kgDb hg,mm,rn Define databases that support
Gene Sorter (which once was the gene family
browser) set familyDb hg,mm,ce,sacCer,dm
7
Chains and nets are more complex than other
identifiers
Magic for tables split between chromosomes set
split splitPrefixchr_ Stuff to link together
self chains and nets identifier chainSelf "Link
together self chain info" gbd.chainSelf.id
split gbd.chainSelfLink.chainId split
gbd.netSelf.chainId exclude0
The splitPrefix allows logical tables to be
split. The acts as a wildcard (SQL
style). The exclude0 says that the master key
need not include 0.
8
Other chains and nets use a macro expansion of
sorts so we dont need to define a separate
identifier for each one.
set chainDest Hg15,Hg16,Hg17,Mm4,Mm5,Mm6 identif
ier chainchainDestId "Link together chain
info" gbd.chain.id split
gbd.chainLink.chainId split
gbd.net.chainId gbd.allChain.id
gbd.netRxBest.chainId exclude0
gbd.netNonGap.chainId exclude0
gbd.netSynteny.chainId exclude0
9
Genbank/trEMBL Accessions and meaningful
subsets thereof identifier genbankAccession
externalgenbank "Generic Genbank Accession.
More specific Genbank accessions follow"
gbd.seq.acc identifier stsAccession
externalgenbank typeOfgenbankAccession "Genbank
accession of a Sequence Tag Site (STS)
sequence." gbd.stsInfo2.genbank
dupeOk identifier bacEndAccession
typeOfgenbankAccession "Genbank accession of a
BAC end read." gbd.all_bacends.qName dupeOk
gbd.bacEndPairs.lfNames comma
hg.fishClones.beNames comma minCheck0.70
The typeOf line allows joins between parent and
child, but not between siblings.
10
identifier hugoName externalHUGO
fuzzy "International Human Gene Identifier"
hg.refLink.name hg.atlasOncoGene.locusSymbol
hg.kgAlias.alias hg.kgXref.geneSymbol
hg.refFlat.geneName hg.jaxOrtholog.humanS
ymbol hg13,hg15.geneBands.name
Biological names for human genes are so messy,
no validation is done (note fuzzy keyword).
11
identifier ensemblTranscriptId externalEnsembl
dependency "Ensembl Transcript ID"
gbd.ensGene.name chopAfter.
gbd.superfamily.name gbd.ensGeneXref.transcr
ipt_name chopAfter. minCheck0.20
mm3,hg13.ensemblXref.transcript_name chopAfter.
minCheck0.20 mm3.ensemblXref2.transcript_name
chopAfter. minCheck0.20 gbd.ensGtp.transcr
ipt chopAfter. minCheck0.98
gbd.ensPep.name chopAfter. minCheck0.98
gbd.ensTranscript.transcript_name chopAfter.
minCheck0.20 kgDb.knownToEnsembl.value
chopAfter. gbd.sfDescription.name
chopAfter. mm3.superfamily.name chopAfter.
Ensembl isnt fuzzy but requires relaxed
minCheck
12
Table types - describe tables sharing a common
format. type genePred hg.acembly
gbd.ECgene gbd.geneid gbd.genscan
gbd.sgpGene gbd.softberryGene
gbd.twinscan gbd.ensGene gbd.vegaGene
gbd.refGene gbd.jgiFilteredModels
Table browser looks for genePred.as file based on
this, and fills in descriptions in describe
schema.
13
Dependencies not already captured in
identifiers. The joinerCheck program can
quickly check times and dependencies sort of
like make. dependency mm.affyGnfU74ADistance
mm.knownToU74 hgFixed.gnfMouseU74aMedianRatio de
pendency mm.affyGnfU74BDistance mm.knownToU74
hgFixed.gnfMouseU74bMedianRatio dependency
mm.affyGnfU74CDistance mm.knownToU74
hgFixed.gnfMouseU74cMedianRatio dependency
hg.gnfU95Distance hg.knownToU95
hgFixed.gnfHumanU95MedianRatio dependency
ce.kimExpDistance hgFixed.kimWormLifeMedianRatio
dependency dm.arbExpDistance dm.bdgpToCanonical
hgFixed.arbFlyLifeMedianRatio dependency
sacCer.choExpDistance hgFixed.yeastChoCellCycle
14
Ignored tables - no linkage here that we check
at least. tablesIgnored go instance_data
source_audit tablesIgnored gbd
ancientRepeat axtInfo chromInfo
cpgIsland trackDb chr_mrna
joinerCheck squawks about any table (or database)
not mentioned
15
joinerCheck
  • Checks database vs. all.joiner in various ways.
  • Very handy for QA but
  • Full joinerCheck takes a long long time to run
  • Output is verbose because it complains about
    missing stuff
  • The -times check is fast, but sometimes we make
    tables out of order without it being a true
    error.

16
joinerCheck - the tool youll love to hate!
joinerCheck - Parse and check joiner file usage
joinerCheck file.joiner options
-identifiername - Just validate given
identifier. -databasename - Just validate
given database. -fields - Check fields in
joiner file exist, faster with -fieldListIn
-fieldListOutfile - List all fields in all
databases to file. -fieldListInfile - Get
list of fields from file rather than mysql.
-keys - Validate (foreign) keys. Takes at least
an hour. -tableCoverage - Check that all
tables are mentioned in joiner file
-dbCoverage - Check that all databases are
mentioned in joiner file -times - Check update
times of tables are after tables they depend on
-all - Do all tests -fields -keys
-tableCoverage -dbCoverage -times
17
all.joiner in summary
  • With .as files describes our large, messy, useful
    database.
  • Missing info in all.joiner results in missing
    functionality in table browser.
  • QA can automatically catch many problems with
    joinerCheck
  • Full path - src/hg/makeDb/schema/all.joiner
  • See also src/hg/makeDb/schema/joiner.doc
Write a Comment
User Comments (0)
About PowerShow.com