Title: Data Mining in Ensembl with BioMart
1Data Mining in Ensembl with BioMart
www.ensembl.org/biomart/martview www.biomart.org/b
iomart/martview
Nov, 2009
2BioMart- Data mining
- BioMart is a search engine that can find multiple
terms and put them into a table format. - Such as mouse gene (IDs), chromosome and base
pair position - No programming required!
3General or Specific Data-Tables
- All the genes for one species
- Or only genes on one specific region of a
chromosome - Or genes on one region of a chromosome
associated with an InterPro domain
4The First Step Choose the Dataset
Dataset Current Ensembl, Human genes
5The Second Step Filters
Filters Define a gene set
6Attributes attach information
Attributes Determine output columns
7Results
Tables or sequences
8Query
- For the human CFTR gene, can I export the
EntrezGene ID, and also, probes with this gene
sequence from the Affy HG U133 Plus 2
microarray platform? - In the query
- Filters what we know
- Attributes what we want to know.
9Query
- For the human CFTR gene, can I export the
EntrezGene ID, and also, probes with this gene
sequence from the Affy HG U133 Plus 2
microarray platform? - In the query
- Filters what we know
- Attributes what we want to know.
10Query
- For the human CFTR gene, can I export the
EntrezGene ID, and also, probes with this gene
sequence from the Affy HG U133 Plus 2
microarray platform? - In the query
- Filters what we know
- Attributes what we want to know (columns in
the result table)
11A Brief Example
Use the current Ensembl (archives are also
available)
Select Homo sapiens
12Select the genes with Filters
Expand the REGION panel.
Click Filters
Expand the GENE panel to enter in the gene ID(s).
13Filters
Change this to HGNC symbol. Enter CFTR in the
box.
Click Count to see if genes passed through your
filters.
14Attributes (Output Options)
Expand the GENE section.
Click on Attributes
15Attributes (Output Options)
Select Description and Associated Gene Name.
Expand the EXTERNAL panel for non-Ensembl IDs.
16Attributes (Output)
.
External IDs include EntrezGene IDs and also
Microarray probe IDs.
17The Results Table - Preview
For the full result table click Go or View
ALL rows.
Results show Description, Name, EntrezGene and
Probe matches from the Affy HG U133-Plus-2
platform.
18Full Result Table
Affy HG probe
Gene Name
EntrezGene ID
Ensembl Gene and Transcript IDs
Description
19Other Export Options (Attributes)
- Sequences UTRs, flanking sequences, cDNA and
peptides, etc - Gene IDs from Ensembl and external sources (MGI,
Entrez, etc) - Microarray data
- Protein Functions/descriptions (Interpro, GO)
- Orthologous gene sets
- SNP/ Variation Data
20BioMart Data Sets
- Ensembl genes
- Vega genes
- Variations
21BioMart around the world
BioMart started at EnsemblTo where has it
travelled?
22Central Portal
www.biomart.org
23WormBase
24HapMap
25(No Transcript)
26GRAMENE
www.gramene.org
27The Potato Center
28How to Get There
- http//www.biomart.org/biomart/martview
- http//www.ensembl.org/biomart/martview
- Or click on BioMart from Ensembl
29The Flow
- Choose Dataset (All genes for a species)
- Choose Filters (narrows the gene set)
- Choose Attributes (output options)
- Now Try the Worked Example on Page 23!
30Ensembl Core Databases
- Relational Database
- Normalised
- Each data point stored only once
- Therefore
- Quick updates
- Minimal storage requirements
- But
- Many tables
- Many joins for complicated queries
- Slow for data mining applications
31Normalised Schema
32BioMart Database
- Data warehouse
- De-normalised
- Query-optimised
- Therefore
- Fast and flexible
- Ideal for data mining
- But
- Tables with apparent redundancy
- Needs rebuilding from scratch for every release
from normalised core databases
33De-Normalised Schema
34Information Flow
DATASET
FILTER
ATTRIBUTES