Title: Standards and gene expression data
1Standards and gene expression data from data
archiving to extracting biological knowledge
Helen Parkinson, PhD Production
Coordinator European Bioinformatics Institute
2Talk content
3Data sharing
4Standards Landscape
Nature Reviews Genetics, Vol 7, p.593-605 (August
2006)
5MIAME Minimal Information about a microarray
experiment
6So Has MIAME been successful?
7ArrayExpress?
8ArrayExpress history
2003 100 Expts
2004 TIGR Export 420 Expts
2005 Re-funded SMD Export 1200 Expts
2006 New UI 1600 Expts
2002 12 Expts
2007 GEO Affy Data Import Phase 1 gt2607 Expts
2008 6631 Expts
2001
4
9ArrayExpress Overview
10(No Transcript)
11(No Transcript)
12(No Transcript)
13Getting to a summary level data Atlas
14Our Use Cases
- Query support (e.g, query for 'cancer' and get
also 'leukemia') - Over-representation analysis in groups of samples
(analogous to the use of GO terms in
over-representation analysis in groups of genes) - Ontology visualisation e.g., presenting an
ontology tree to the user of what is in the
database - Data integration by ontology terms e.g., we
assume that 'kidney' in independent studies
roughly means the same, so we can count how many
kidney samples we have in the database - Intelligent template generation for different
experiment types in submission or data
presentation - Summary level data
15Oh the complexity!
Publication
External links
Normalisation
16Application Ontology Status Quo
- Text mining at data acquisition
- Tuned for queries, structured for use in
ArrayExpress GUI - Multi-species aspect
DW
AE
06.04.2014
16
17Semantic Roadmap
- Position of the ArrayExpress Experimental Factor
Ontology in the bigger picture
- Key is orthogonal coverage, reuse of existing
resources and shared frameworks
Chemical Entities of Biological Interest (ChEBI)
Relation Ontology
Cell Type Ontology
Various Species Anatomy Ontologies
Anatomy Reference Ontology
Disease Ontology
AE Ontology
18What lies beneath?
19Where does the data come from
20What is curation?
212007 Affymetrix Data landscape
22Data exchange or the failure to federate
- We need all the data in house to re-process it
- We do not have a data exchange agreement with GEO
- SOFT vs. MAGE-ML/MAGE-TAB
- No ontology usage
- Some free text annotation, little process
annotation - Mass data acquisition
- 80 solution (or less)
- Employing text mining
- Data reprocessing
- Cost effective, eliminates user support
- Using spreadsheets (not XML)
- We could almost eliminate the database if we can
index the files
232008 Data Landscape
24Flexible Data Access Models
- GUIs biologists
- Hyperlinks
- FTP bioinformaticians
- Web services workflows
- XML data dumps
- Spreadsheets
- Direct SQL access (not for ArrayExpress)
- Schema and code if you want it
- Geek for a week
25Lessons learned
- Complex architecture means a lot of SW
engineering - Biologists like excel, Bioinformaticians like
tab-delimited files - Spreadsheets scale, easy to check, harder to
parse - Generic systems will be future proof
- Legacy format converters are needed
- You dont need to keep everything
- Text based queries most common
- Text mining very useful
- Scaling problems are hard to fix
- Bleeding edge technologies should be used
sparingly - Federation doesnt really work for the goals we
have - Archiving alone does not add value
- Training is important and expensive
26Useful tools for life sciences data management
- Excel
- Whatizit text mining software from EBI
- Our spreadsheet builder, checkers and format
parsers tab2mage.sf.net - OBO foundry ontologies esp OBI, CTO, Disease
Ontology - Taverna for building workflows
- BASE open source microarray data management
tool - BioMart data warehouse for biological data
www.biomart.org
27Acknowledgements
- ArrayExpress Production Team
- Tomasz Adamusiak, Tony Burdett, Anna Farne, Ele
Holloway, James Malone, Margus Lukk, Helen
Parkinson, Tim Rayner, Eleanor Williams, Holly
Zheng - Ugis Sarkans ArrayExpress Development Team Leader
- Misha Kapushesky Main Atlas Developer
- Gabriella Rustici Training officer
- Alvis Brazma Group Leader
- Uniprot and Ensembl teams
- Funders EC (FELICS, EMERALD, DIAMONDS, GEN2PHEN,
MUGEN), NIH-NHGRI, EMBL - The submitters and microarray collaborators
- GEO especially Tanya Barrett