Bert Overduin - PowerPoint PPT Presentation

1 / 41
About This Presentation
Title:

Bert Overduin

Description:

Used to retrieve data from and store data in the Ensembl ... Acknowledgements. The Ensembl Core Team. Glenn Proctor. Andreas Kahari. Daniel Rios. Ian Longden ... – PowerPoint PPT presentation

Number of Views:64
Avg rating:3.0/5.0
Slides: 42
Provided by: fionacun
Category:
Tags: bert | overduin | rios

less

Transcript and Presenter's Notes

Title: Bert Overduin


1
EnsemblDevelopers WorkshopCore API
  • Bert Overduin
  • Edinburgh, 24 February 2009

2
Outline
  • The Ensembl Core databases and Perl API
  • Documentation Help
  • Data Objects, Object Adaptors, Database Adaptors
    The Registry
  • Coordinate Systems Slices
  • Features
  • Genes, Transcripts, Exons Translations
  • External References
  • Coordinate Mappings

3
The Ensembl Core databases
  • The Ensembl Core databases store
  • genomic sequence
  • assembly information
  • gene, transcript and protein models
  • cDNA and protein alignments
  • cytogenetic bands, markers, repeats, CpG islands
    etc.
  • external references
  • homo_sapiens_core_52_36n
  • species group data version

  • assembly version

  • software version

4
The Ensembl Core Perl API
  • Used to retrieve data from and store data in the
    Ensembl Core databases
  • Written in Object-Oriented Perl
  • Partly based on and compatible with BioPerl
    objects (http//www.bioperl.org)
  • Used by the Ensembl analysis and annotation
    pipeline and the Ensembl web code
  • Robust and well-supported
  • Forms the basis for the other Ensembl APIs

5
Documentation Help
  • Installation instructions, web-browsable version
    of the POD (Perldoc) and tutorial
  • http//www.ensembl.org/info/docs/api/core/index.h
    tml
  • Inline Perl POD (Plain Old Documentation)
  • ensembl-dev mailing list
  • http//www.ensembl.org/info/about/contact/mailin
    g.html
  • Ensembl helpdesk
  • helpdesk_at_ensembl.org

6
Data Objects
  • Data Objects model biological entities, e.g.
    Genes, Transcripts, Translations,
  • Each Data Object encapsulates information from
    one or a few specific MySQL tables
  • Data Objects are retrieved from and stored in the
    database using Objects Adaptors

7
Object Adaptors
  • Object Adaptors are Data Object factories
  • Each Object Adaptor is responsible for creating
    Data Objects of only one particular type

8
Database Adaptors
  • Database Adaptors are Object Adaptor factories
  • Database Adaptors are used to connect to a single
    database

9
The Registry
  • The Registry is a container for all Database
    Adaptors
  • The Registry handles all database connections
  • The Registry is an Object Adaptor factory
  • The Registry can be initialised via a
    configuration file or by automatically
    discovering databases on a RDBMS instance

10
System Architecture
GeneAdaptor
MarkerAdaptor
ObjectAdaptor
VariationAdaptor
GenotypeAdaptor
Core DBAdaptor
Variation DBAdaptor
Ensembl Registry
11
Code Example
  • Obtain the Ensembl Gene IDs for all human genes
  • use BioEnsEMBLRegistry
  • my registry 'BioEnsEMBLRegistry'
  • registry-gtload_registry_from_db(
  • -host gt 'ensembldb.ensembl.org',
  • -user gt 'anonymous'
  • )
  • my gene_adaptor registry-gtget_adaptor(
    Human, Core, Gene )
  • my genes gene_adaptor-gtfetch_all
  • while ( my gene shift _at_genes )
  • print gene-gtstable_id, \n

12
Code Example
  • OUTPUT
  • ENSG00000208234
  • ENSG00000199674
  • ENSG00000221622
  • ENSG00000207604
  • ENSG00000207431
  • ENSG00000221312
  • ENSG00000223135
  • ENSG00000223136
  • ENSG00000200159
  • ENSG00000200131
  • ENSG00000206672
  • ENSG00000212552
  • ENSG00000201452
  • ENSG00000202016
  • ENSG00000200455
  • ENSG00000201916

ENSG00000212228 ENSG00000202261 ENSG00000207742 EN
SG00000223137 ENSG00000212550 ENSG00000223138 ENSG
00000200827 ENSG00000221638 ENSG00000201937 ENSG00
000212205 ENSG00000221428 ENSG00000202470 ENSG0000
0200236 ENSG00000223139 ENSG00000207932 ENSG000002
23140 ENSG00000221791 ENSG00000199102
ENSG00000199960 ENSG00000208013 ENSG00000223141 EN
SG00000223142 ENSG00000221363 ENSG00000213177 ENSG
00000216774 ENSG00000213194 ENSG00000207492 ENSG00
000219252 ENSG00000222962 ENSG00000206963 ENSG0000
0207934 ENSG00000199814 ENSG00000199796 ENSG000002
12450 ENSG00000207603 ENSG00000202474
ENSG00000206831 ENSG00000207331 ENSG00000221740 EN
SG00000222963 ENSG00000201923 ENSG00000198928 ENSG
00000208225 ENSG00000208227 ENSG00000208243 ENSG00
000177117 ENSG00000208245 ENSG00000209219 ENSG0000
0213184 ENSG00000211401 ENSG00000208228 ENSG000002
08229 ENSG00000208233 ENSG00000208235
13
Exercise 1
  • Load all databases and print their names.
  • What is the name of the human core database?
  • There are several solutions possible!
  • Use Perldoc!
  • (http//www.ensembl.org/info/docs/api/core/index.h
    tml)

14
Coordinate Systems
  • Sequences stored in Ensembl are associated with
    Coordinate Systems
  • Coordinate Systems vary from species to species
  • human chromosome, supercontig, clone, contig
    zebrafish chromosome, scaffold, contig
  • Sequence information is directly stored in the
    database for the sequence level Coordinate
    System
  • The Coordinate System of the highest level in a
    given region is the top level Coordinate System
  • Features are stored in a single Coordinate System

15
Coordinate Systems
Top level
Chromosome
Contigs
Clones (Tiling path)
Sequence level
16
Slices
  • A Slice Data Object represents an arbitrary
    region of a genome
  • Slices are not directly stored in the database
  • Slices are used to obtain sequences or features
    from a specific region in a specific coordinate
    system

17
Code Example
  • Obtain a slice covering the entire human Y
    chromosome
  • my slice_adaptor registry-gtget_adaptor(
    Human, Core, Slice )
  • my slice slice_adaptor-gtfetch_by_region(
    chromosome, Y )
  • printf( Slice s s s-s (s)\n,
  • slice-gtcoord_system_name
  • slice-gtseq_region_name
  • slice-gtstart
  • slice-gtend
  • slice-gtstrand )
  • OUTPUT
  • Slice chromosome Y 1-57772954 (1)

18
Exercise 2
  • Obtain the names of the coordinate systems for
    rat.
  • Obtain a slice covering the first 10 MB of
    chromosome 20 of human and print its sequence.
  • Obtain a slice covering the human gene with
    Ensembl Gene ID ENSG00000101266 with 2 kb of
    flanking sequence and print its sequence.
  • Print the name, start, end and strand of the
    obtained slices as well as their coordinate
    system.
  • If you want to output your sequences to a file,
    have a look at BioSeqIO at http//doc.bioperl.org
    /releases/bioperl-1.2.3/

19
Features
  • Features are Data Objects with a defined location
    on the genome
  • All Features have a start, end, strand and slice
  • The start coordinate of a Feature is always less
    than its end coordinate, irrespective of the
    strand on which it is located (exception
    insertion features)

20
Features
  • Some examples of Features
  • Gene, Transcript and Exon
  • ProteinFeature
  • PredictionTranscript and PredictionExon
  • DNAAlignFeature and ProteinAlignFeature
  • RepeatFeature
  • MarkerFeature
  • OligoFeature
  • SimpleFeature
  • MiscFeature

21
Code Example
  • Obtain all markers on human chromosome 1
  • my slice_adaptor registry-gtget_adaptor(
    Human, Core, Slice )
  • my slice slice_adaptor-gtfetch_by_region(
    chromosome, 1 )
  • my markers slice-gtget_all_MarkerFeatures
  • while ( my marker shift _at_markers )
  • printf( s\ts\n,
  • marker-gtslice-gtname, marker-gtfeature_Slice-gtna
    me )
  • OUTPUT
  • chromosomeNCBI36112472497191
    chromosomeNCBI361123714881
  • chromosomeNCBI36112472497191
    chromosomeNCBI361258528121
  • chromosomeNCBI36112472497191
    chromosomeNCBI361428450851

22
Exercise 3
  • Obtain all the CpG islands on the first 5 Mb of
    dog chromosome 20. Print the total number of CpG
    islands and the position and sequence of each CpG
    island.
  • Obtain all the protein alignment features on the
    first 5 Mb of dog chromosome 20. Print for each
    alignment the name of the aligned protein, the
    start and end coordinates of the matching region
    on the protein and on the genome and the name of
    the analysis resulting in the alignment.
  • Hint CpG islands are stored as SimpleFeatures
    with logic_name cpg.

23
Genes, Transcripts Exons
  • Genes, Transcript and Exons are Feature Data
    Objects
  • A Gene is a grouping of Transcripts which share
    any (partially) overlapping Exons
  • A Transcript is a set of Exons
  • Introns are not explicitly defined in the database

24
Translations
  • Translations are not Feature Data Objects
  • Translations define the Untranslated Region (UTR)
    and Coding Sequence (CDS) composition of
    Transcripts
  • Protein sequences are not stored in the database,
    but computed on the fly using Transcript(!)
    objects

25
Exercise 4
  • Obtain the gene with Ensembl Gene ID
    ENSG00000101266 and its transcripts. Print the
    total number of exons in the gene and the number
    of exons in each individual transcript. Why do
    the found numbers disagree with each other?
  • Print for each transcript of the above gene the
    coding sequence and the protein sequence.

26
External References
  • External References (Xrefs) are cross references
    of Ensembl Genes, Transcripts or Translations
    with identifiers from other databases, e.g. HGNC,
    WikiGenes, UniProtKB/Swiss-Prot, RefSeq, MIM etc.
    etc.

27
Code Example
  • Obtain external references for Ensembl gene
    ENSG00000139618
  • my gene gene_adaptor-gtfetch_by_stable_id(
    'ENSG00000139618' )
  • my gene_xrefs gene-gtget_all_DBEntries
  • print "Xrefs on the gene \n\n"
  • while ( my gene_xref shift _at_gene_xrefs )
  • printf( "s s\n,
  • gene_xref-gtdbname, gene_xref-gtdisplay_id )
  • my all_xrefs gene-gtget_all_DBLinks
  • print "\nXrefs on the gene, transcript and
    protein \n\n"
  • while ( my all_xref shift _at_all_xrefs )
  • printf( "s s\n,
  • all_xref-gtdbname, all_xref-gtdisplay_id )

28
Code Example
  • Output
  • Xrefs on the gene
  • HGNC BRCA2
  • DBASS3 BRCA2
  • UCSC uc001uub.1
  • HGNC_curated_gene BRCA2
  • Xrefs on the gene, transcript and protein
  • shares_CDS_with_OTTT OTTHUMT00000046000
  • AFFY_HC_G110 1503_at
  • AFFY_HG_U95A 1503_at
  • AFFY_HG_U95Av2 1503_at
  • AFFY_HC_G110 1990_g_at
  • AFFY_HG_U95A 1990_g_at
  • AFFY_HG_U95Av2 1990_g_at
  • AFFY_HuGeneFL X95152_rna1_at

AFFY_HC_G110 1989_at AFFY_HG_U95A
1989_at AFFY_HG_U95Av2 1989_at RefSeq_dna
NM_000059 HGNC BRCA2 UniGene Hs.34012 AgilentCGH
A_14_P131744 AgilentCGH A_14_P109686 AgilentPro
be A_23_P99452 Codelink GE60169 Illumina_V1
GI_4502450-S Illumina_V2 ILMN_139227 HGNC_curated
_transcript BRCA2-001 CCDS CCDS9344.1 EntrezGene
BRCA2 MIM_MORBID 114480 MIM_MORBID
155720 MIM_MORBID 227650 MIM_GENE
600185 MIM_MORBID 600185 MIM_MORBID
605724 RefSeq_peptide NP_000050.2 Uniprot/SPTREMB
L A1YBP1_HUMAN EMBL DQ897648 protein_id
ABI74674.1 Uniprot/SPTREMBL B2ZAH0_HUMAN EMBL
EU625579 protein_id ACD01217.1 EMBL
AL445212 Uniprot/SPTREMBL Q5TBJ7_HUMAN
EMBL AL137247 protein_id CAI13195.1 protein_id
CAI40479.1 Uniprot/SPTREMBL Q8IU64_HUMAN EMBL
AY151039 protein_id AAN28944.1 EMBL
AF489725 protein_id AAN61409.1 EMBL
AF489726 protein_id AAN61410.1 EMBL
AF489727 protein_id AAN61411.1 EMBL
AF489728 protein_id AAN61412.1 EMBL
AF489729 protein_id AAN61413.1 EMBL
AF489730 protein_id AAN61414.1 EMBL
AF489731 protein_id AAN61415.1 EMBL
AF489732 protein_id AAN61416.1 EMBL
AF489733 protein_id AAN61417.1 EMBL
AF489734 protein_id AAN61418.1 EMBL
AF489735 protein_id AAN61419.1 EMBL
AF489736 protein_id AAN61420.1
29
Exercise 5
  • Obtain the Ensembl gene(s) that correspond(s) to
    UniProtKB/Swiss-Prot entry BRCA2_HUMAN. Print its
    Ensembl Gene ID, name and description.
  • Obtain all external references for the above
    gene. Print their names and databases.

30
Coordinate Mappings
  • The API provides the means to convert between any
    related coordinate systems in the database
  • The Feature methods transfer, transform and
    project and the Slice method project are used to
    map features between coordinate systems

31
Transfer
  • Transfer moves a feature on a slice in a given
    coordinate system to another slice in the same or
    another coordinate system
  • Transfer needs the feature to be defined in the
    requested coordinate system, i.e. it cannot
    overlap an undefined region

32
Transfer
Chr 1
Chr 1
Chr 1
Chr Y
33
Transform
  • Like transfer, but transform places the feature
    on a slice that spans the entire sequence that
    the feature is on in the requested coordinate
    system

34
Transform
35
Project
  • Project doesnt move a feature, but it provides a
    definition of where a feature or slice lies in
    another coordinate system

36
Project
37
Code Example
  • Project gene ENSG00000155657 to the clone
    coordinate system
  • my gene gene_adaptor-gtfetch_by_stable_id(
    'ENSG00000155657' )
  • my projection gene-gtproject( 'clone )
  • foreach my segment ( _at_projection )
  • my to_slice segment-gtto_Slice
  • printf( "s s-s projects to s
    ss-s(s)\n",
  • gene-gtstable_id,
  • segment-gtfrom_start,
  • segment-gtfrom_end,
  • to_slice-gtcoord_system_name,
  • to_slice-gtseq_region_name,
  • to_slice-gtstart,
  • to_slice-gtend,
  • to_slice-gtstrand )

38
Code Example
  • Output
  • ENSG00000155657 1-65908 projects to clone
    AC023270.71-65908(-1)
  • ENSG00000155657 65909-241384 projects to clone
    AC010680.101-175476(-1)
  • ENSG00000155657 241385-281434 projects to clone
    AC009948.3132579-172628(-1)

39
Exercise 6
  • Obtain a gene located on clone AL049761.11 and
    print out its coordinate system and gene
    coordinates. Then transform the gene to
    toplevel and again print out the coordinate
    system and gene coordinates.

40
Other Ensembl Core APIs
  • Ruby (by Jan Aerts)
  • http//bioruby-annex.rubyforge.org/
  • Python (by Jenny Qing Qian)
  • http//code.google.com/p/pygr/wiki/PygrOnEnsembl

41
Acknowledgements
The Ensembl Core Team
Write a Comment
User Comments (0)
About PowerShow.com