Title: Bert Overduin
1EnsemblDevelopers WorkshopCore API
- Bert Overduin
- Edinburgh, 24 February 2009
2Outline
- The Ensembl Core databases and Perl API
- Documentation Help
- Data Objects, Object Adaptors, Database Adaptors
The Registry - Coordinate Systems Slices
- Features
- Genes, Transcripts, Exons Translations
- External References
- Coordinate Mappings
3The Ensembl Core databases
- The Ensembl Core databases store
- genomic sequence
- assembly information
- gene, transcript and protein models
- cDNA and protein alignments
- cytogenetic bands, markers, repeats, CpG islands
etc. - external references
- homo_sapiens_core_52_36n
- species group data version
-
assembly version -
software version
4The Ensembl Core Perl API
- Used to retrieve data from and store data in the
Ensembl Core databases - Written in Object-Oriented Perl
- Partly based on and compatible with BioPerl
objects (http//www.bioperl.org) - Used by the Ensembl analysis and annotation
pipeline and the Ensembl web code - Robust and well-supported
- Forms the basis for the other Ensembl APIs
5Documentation Help
- Installation instructions, web-browsable version
of the POD (Perldoc) and tutorial - http//www.ensembl.org/info/docs/api/core/index.h
tml - Inline Perl POD (Plain Old Documentation)
- ensembl-dev mailing list
- http//www.ensembl.org/info/about/contact/mailin
g.html - Ensembl helpdesk
- helpdesk_at_ensembl.org
6Data Objects
- Data Objects model biological entities, e.g.
Genes, Transcripts, Translations, - Each Data Object encapsulates information from
one or a few specific MySQL tables - Data Objects are retrieved from and stored in the
database using Objects Adaptors
7Object Adaptors
- Object Adaptors are Data Object factories
- Each Object Adaptor is responsible for creating
Data Objects of only one particular type
8Database Adaptors
- Database Adaptors are Object Adaptor factories
- Database Adaptors are used to connect to a single
database
9The Registry
- The Registry is a container for all Database
Adaptors - The Registry handles all database connections
- The Registry is an Object Adaptor factory
- The Registry can be initialised via a
configuration file or by automatically
discovering databases on a RDBMS instance
10System Architecture
GeneAdaptor
MarkerAdaptor
ObjectAdaptor
VariationAdaptor
GenotypeAdaptor
Core DBAdaptor
Variation DBAdaptor
Ensembl Registry
11Code Example
- Obtain the Ensembl Gene IDs for all human genes
- use BioEnsEMBLRegistry
- my registry 'BioEnsEMBLRegistry'
- registry-gtload_registry_from_db(
- -host gt 'ensembldb.ensembl.org',
- -user gt 'anonymous'
- )
- my gene_adaptor registry-gtget_adaptor(
Human, Core, Gene ) - my genes gene_adaptor-gtfetch_all
- while ( my gene shift _at_genes )
- print gene-gtstable_id, \n
-
12Code Example
- OUTPUT
- ENSG00000208234
- ENSG00000199674
- ENSG00000221622
- ENSG00000207604
- ENSG00000207431
- ENSG00000221312
- ENSG00000223135
- ENSG00000223136
- ENSG00000200159
- ENSG00000200131
- ENSG00000206672
- ENSG00000212552
- ENSG00000201452
- ENSG00000202016
- ENSG00000200455
- ENSG00000201916
ENSG00000212228 ENSG00000202261 ENSG00000207742 EN
SG00000223137 ENSG00000212550 ENSG00000223138 ENSG
00000200827 ENSG00000221638 ENSG00000201937 ENSG00
000212205 ENSG00000221428 ENSG00000202470 ENSG0000
0200236 ENSG00000223139 ENSG00000207932 ENSG000002
23140 ENSG00000221791 ENSG00000199102
ENSG00000199960 ENSG00000208013 ENSG00000223141 EN
SG00000223142 ENSG00000221363 ENSG00000213177 ENSG
00000216774 ENSG00000213194 ENSG00000207492 ENSG00
000219252 ENSG00000222962 ENSG00000206963 ENSG0000
0207934 ENSG00000199814 ENSG00000199796 ENSG000002
12450 ENSG00000207603 ENSG00000202474
ENSG00000206831 ENSG00000207331 ENSG00000221740 EN
SG00000222963 ENSG00000201923 ENSG00000198928 ENSG
00000208225 ENSG00000208227 ENSG00000208243 ENSG00
000177117 ENSG00000208245 ENSG00000209219 ENSG0000
0213184 ENSG00000211401 ENSG00000208228 ENSG000002
08229 ENSG00000208233 ENSG00000208235
13Exercise 1
- Load all databases and print their names.
- What is the name of the human core database?
- There are several solutions possible!
- Use Perldoc!
- (http//www.ensembl.org/info/docs/api/core/index.h
tml)
14Coordinate Systems
- Sequences stored in Ensembl are associated with
Coordinate Systems - Coordinate Systems vary from species to species
- human chromosome, supercontig, clone, contig
zebrafish chromosome, scaffold, contig - Sequence information is directly stored in the
database for the sequence level Coordinate
System - The Coordinate System of the highest level in a
given region is the top level Coordinate System - Features are stored in a single Coordinate System
15Coordinate Systems
Top level
Chromosome
Contigs
Clones (Tiling path)
Sequence level
16Slices
- A Slice Data Object represents an arbitrary
region of a genome - Slices are not directly stored in the database
- Slices are used to obtain sequences or features
from a specific region in a specific coordinate
system
17Code Example
- Obtain a slice covering the entire human Y
chromosome - my slice_adaptor registry-gtget_adaptor(
Human, Core, Slice ) - my slice slice_adaptor-gtfetch_by_region(
chromosome, Y ) - printf( Slice s s s-s (s)\n,
- slice-gtcoord_system_name
- slice-gtseq_region_name
- slice-gtstart
- slice-gtend
- slice-gtstrand )
- OUTPUT
- Slice chromosome Y 1-57772954 (1)
18Exercise 2
- Obtain the names of the coordinate systems for
rat. - Obtain a slice covering the first 10 MB of
chromosome 20 of human and print its sequence. - Obtain a slice covering the human gene with
Ensembl Gene ID ENSG00000101266 with 2 kb of
flanking sequence and print its sequence. - Print the name, start, end and strand of the
obtained slices as well as their coordinate
system. - If you want to output your sequences to a file,
have a look at BioSeqIO at http//doc.bioperl.org
/releases/bioperl-1.2.3/
19Features
- Features are Data Objects with a defined location
on the genome - All Features have a start, end, strand and slice
- The start coordinate of a Feature is always less
than its end coordinate, irrespective of the
strand on which it is located (exception
insertion features)
20Features
- Some examples of Features
- Gene, Transcript and Exon
- ProteinFeature
- PredictionTranscript and PredictionExon
- DNAAlignFeature and ProteinAlignFeature
- RepeatFeature
- MarkerFeature
- OligoFeature
- SimpleFeature
- MiscFeature
21Code Example
- Obtain all markers on human chromosome 1
- my slice_adaptor registry-gtget_adaptor(
Human, Core, Slice ) - my slice slice_adaptor-gtfetch_by_region(
chromosome, 1 ) - my markers slice-gtget_all_MarkerFeatures
- while ( my marker shift _at_markers )
- printf( s\ts\n,
- marker-gtslice-gtname, marker-gtfeature_Slice-gtna
me ) -
- OUTPUT
- chromosomeNCBI36112472497191
chromosomeNCBI361123714881 - chromosomeNCBI36112472497191
chromosomeNCBI361258528121 - chromosomeNCBI36112472497191
chromosomeNCBI361428450851
22Exercise 3
- Obtain all the CpG islands on the first 5 Mb of
dog chromosome 20. Print the total number of CpG
islands and the position and sequence of each CpG
island. - Obtain all the protein alignment features on the
first 5 Mb of dog chromosome 20. Print for each
alignment the name of the aligned protein, the
start and end coordinates of the matching region
on the protein and on the genome and the name of
the analysis resulting in the alignment. - Hint CpG islands are stored as SimpleFeatures
with logic_name cpg. -
-
23Genes, Transcripts Exons
- Genes, Transcript and Exons are Feature Data
Objects - A Gene is a grouping of Transcripts which share
any (partially) overlapping Exons - A Transcript is a set of Exons
- Introns are not explicitly defined in the database
24Translations
- Translations are not Feature Data Objects
- Translations define the Untranslated Region (UTR)
and Coding Sequence (CDS) composition of
Transcripts - Protein sequences are not stored in the database,
but computed on the fly using Transcript(!)
objects
25Exercise 4
- Obtain the gene with Ensembl Gene ID
ENSG00000101266 and its transcripts. Print the
total number of exons in the gene and the number
of exons in each individual transcript. Why do
the found numbers disagree with each other? - Print for each transcript of the above gene the
coding sequence and the protein sequence.
26External References
- External References (Xrefs) are cross references
of Ensembl Genes, Transcripts or Translations
with identifiers from other databases, e.g. HGNC,
WikiGenes, UniProtKB/Swiss-Prot, RefSeq, MIM etc.
etc.
27Code Example
- Obtain external references for Ensembl gene
ENSG00000139618 - my gene gene_adaptor-gtfetch_by_stable_id(
'ENSG00000139618' ) - my gene_xrefs gene-gtget_all_DBEntries
- print "Xrefs on the gene \n\n"
- while ( my gene_xref shift _at_gene_xrefs )
- printf( "s s\n,
- gene_xref-gtdbname, gene_xref-gtdisplay_id )
-
- my all_xrefs gene-gtget_all_DBLinks
- print "\nXrefs on the gene, transcript and
protein \n\n" - while ( my all_xref shift _at_all_xrefs )
- printf( "s s\n,
- all_xref-gtdbname, all_xref-gtdisplay_id )
28Code Example
- Output
- Xrefs on the gene
- HGNC BRCA2
- DBASS3 BRCA2
- UCSC uc001uub.1
- HGNC_curated_gene BRCA2
- Xrefs on the gene, transcript and protein
- shares_CDS_with_OTTT OTTHUMT00000046000
- AFFY_HC_G110 1503_at
- AFFY_HG_U95A 1503_at
- AFFY_HG_U95Av2 1503_at
- AFFY_HC_G110 1990_g_at
- AFFY_HG_U95A 1990_g_at
- AFFY_HG_U95Av2 1990_g_at
- AFFY_HuGeneFL X95152_rna1_at
AFFY_HC_G110 1989_at AFFY_HG_U95A
1989_at AFFY_HG_U95Av2 1989_at RefSeq_dna
NM_000059 HGNC BRCA2 UniGene Hs.34012 AgilentCGH
A_14_P131744 AgilentCGH A_14_P109686 AgilentPro
be A_23_P99452 Codelink GE60169 Illumina_V1
GI_4502450-S Illumina_V2 ILMN_139227 HGNC_curated
_transcript BRCA2-001 CCDS CCDS9344.1 EntrezGene
BRCA2 MIM_MORBID 114480 MIM_MORBID
155720 MIM_MORBID 227650 MIM_GENE
600185 MIM_MORBID 600185 MIM_MORBID
605724 RefSeq_peptide NP_000050.2 Uniprot/SPTREMB
L A1YBP1_HUMAN EMBL DQ897648 protein_id
ABI74674.1 Uniprot/SPTREMBL B2ZAH0_HUMAN EMBL
EU625579 protein_id ACD01217.1 EMBL
AL445212 Uniprot/SPTREMBL Q5TBJ7_HUMAN
EMBL AL137247 protein_id CAI13195.1 protein_id
CAI40479.1 Uniprot/SPTREMBL Q8IU64_HUMAN EMBL
AY151039 protein_id AAN28944.1 EMBL
AF489725 protein_id AAN61409.1 EMBL
AF489726 protein_id AAN61410.1 EMBL
AF489727 protein_id AAN61411.1 EMBL
AF489728 protein_id AAN61412.1 EMBL
AF489729 protein_id AAN61413.1 EMBL
AF489730 protein_id AAN61414.1 EMBL
AF489731 protein_id AAN61415.1 EMBL
AF489732 protein_id AAN61416.1 EMBL
AF489733 protein_id AAN61417.1 EMBL
AF489734 protein_id AAN61418.1 EMBL
AF489735 protein_id AAN61419.1 EMBL
AF489736 protein_id AAN61420.1
29Exercise 5
- Obtain the Ensembl gene(s) that correspond(s) to
UniProtKB/Swiss-Prot entry BRCA2_HUMAN. Print its
Ensembl Gene ID, name and description. - Obtain all external references for the above
gene. Print their names and databases.
30Coordinate Mappings
- The API provides the means to convert between any
related coordinate systems in the database - The Feature methods transfer, transform and
project and the Slice method project are used to
map features between coordinate systems
31Transfer
- Transfer moves a feature on a slice in a given
coordinate system to another slice in the same or
another coordinate system - Transfer needs the feature to be defined in the
requested coordinate system, i.e. it cannot
overlap an undefined region
32Transfer
Chr 1
Chr 1
Chr 1
Chr Y
33Transform
- Like transfer, but transform places the feature
on a slice that spans the entire sequence that
the feature is on in the requested coordinate
system
34Transform
35Project
- Project doesnt move a feature, but it provides a
definition of where a feature or slice lies in
another coordinate system
36Project
37Code Example
- Project gene ENSG00000155657 to the clone
coordinate system - my gene gene_adaptor-gtfetch_by_stable_id(
'ENSG00000155657' ) - my projection gene-gtproject( 'clone )
- foreach my segment ( _at_projection )
- my to_slice segment-gtto_Slice
- printf( "s s-s projects to s
ss-s(s)\n", - gene-gtstable_id,
- segment-gtfrom_start,
- segment-gtfrom_end,
- to_slice-gtcoord_system_name,
- to_slice-gtseq_region_name,
- to_slice-gtstart,
- to_slice-gtend,
- to_slice-gtstrand )
-
38Code Example
- Output
- ENSG00000155657 1-65908 projects to clone
AC023270.71-65908(-1) - ENSG00000155657 65909-241384 projects to clone
AC010680.101-175476(-1) - ENSG00000155657 241385-281434 projects to clone
AC009948.3132579-172628(-1)
39Exercise 6
- Obtain a gene located on clone AL049761.11 and
print out its coordinate system and gene
coordinates. Then transform the gene to
toplevel and again print out the coordinate
system and gene coordinates.
40Other Ensembl Core APIs
- Ruby (by Jan Aerts)
- http//bioruby-annex.rubyforge.org/
- Python (by Jenny Qing Qian)
- http//code.google.com/p/pygr/wiki/PygrOnEnsembl
41Acknowledgements
The Ensembl Core Team