Data integration via XML - PowerPoint PPT Presentation

About This Presentation
Title:

Data integration via XML

Description:

Ela Hunt. John Wilson. Vangelis Pafilis. Inga Tulloch. http://xtect.cis.strath.ac.uk ... Hunt, Wilson, Pafilis and Tulloch, Glasgow ... – PowerPoint PPT presentation

Number of Views:125
Avg rating:3.0/5.0
Slides: 24
Provided by: ela50
Category:
Tags: xml | data | hunt | integration | via

less

Transcript and Presenter's Notes

Title: Data integration via XML


1
Data integration via XML
  • Ela Hunt
  • John Wilson
  • Vangelis Pafilis
  • Inga Tulloch

http//xtect.cis.strath.ac.uk/
2
Overview
  • Four biological scenarios of data integration
  • Data integration - problem definition
  • XTECT indexing approach
  • Literature review
  • Current status and further work

3
Scenario 1 Cardiovascular Functional Genomics
  • AIM discover genes causing hypertension
  • Rat animal models of hypertension (rat strains
    which suffer from stroke)
  • Microarrays are used to compare gene expression
    in sick and healthy rats, typically 100-400 genes
    are differentially expressed
  • microarray results are visualised on maps and
    data are interpreted using public web databases
    (browsing and querying)

4
SyntenyVista
5
Scenario 2 Mouse mammary gland development as a
model of cancer proliferation
  • AIM find genes active in cancer growth
  • Take mouse samples and apply to a microarray
    slide
  • Measure trends in gene expression, identify 400
    genes of interest
  • Use public web databases to interpret information
    on 400 genes (interpreting 100 genes took 6
    months, now the information is out of date)

6
Scenario 3 Rat model of schizophrenia
  • AIM understand which genes are expressed during
    schizophrenia
  • Rats have symptoms of schizophrenia after a
    chemical treatment (2 models are used)
  • Measure gene expression in two models
  • Interpret data on 250 genes find if microarray
    probes correspond to genes by using BLAST (DNA
    sequence comparison) and PubMed (bibliographic
    database)
  • Gather DNA sequences for real genes from Ensembl
    (BLAST hits), design probes

7
Scenario 4Proteomics
  • AIM understand and record protein functions
  • Case 1 study the proteome of Trypanosoma brucei.
    For all proteins identified, find information on
    the web which might shed light on their function
  • Case 2 interpret data on human proteins
    differentially expressed in human cells invaded
    by Toxoplasma gondii.
  • Compare protein and gene expression
  • Use SwissProt, PubMed, GeneOntology and any
    other web resources

8
Problem definition
  • Given a large microarray or proteomics experiment
    (a list of gene names or peptide masses)
  • Find all known information about those genes or
    proteins on the web
  • Make this information accessible

9
What we expect to achieve
Result1 table of integrated information
Result2 map of probes and synteny
Query table of names
Result3 Clusters based on to the number of
relevant query terms found
10
  • Use item matching - XML leaves - to start
  • Match starting from leaves and extend towards the
    schemas expressed as paths
  • Use database techniques - indexing
  • Use data mining techniques get statistics on
    data

11
More detail
  • Index all paths and leaves in XML trees for a
    representative set of biological databases
  • Relational technology
  • Warehouse
  • Match leaves (data values)
  • Find path overlaps gt remove redundancies in data

12
First problem solvedquery expansion
  • 30K human, 30K rat, and 30K mouse genes, some of
    them have synonyms
  • Query expansion to include the synonyms
  • Prototype in Java, 300 ms for synonym lookup
  • Same idea as in GeneCards which focuses on human
    data

13
Second indexing XML
  • Medline (40 GB) in XML (bibliographic)
  • SwissProt Trembl, 1 GB in XML (proteins)
  • OMIM and HUGO databases of genes, small (human
    diseases and human genes)
  • Affymetrix microarray files for the mouse, small,
    XML
  • Ensembl no XML files, access via MySQL (human,
    mouse, rat genomes and predicted genes)
  • Mouse Genome MGD direct access to Sybase, no
    XML
  • Rat database RGD stores little data!
  • Gene Ontology around 1GB in XML

14
  • Paths and tags indexed using integer encoding,
    preserving XML order
  • Indexing of Medline and OMIM needs to be resolved
    (text XML)

15
How the index will work
PubMed
Swiss-Prot
accession
abstract
PubMedID
GeneName
12345
.. interactions of agene1 with agene2 ...
12345
agene1
Swiss-Prot/PubMedID PubMed/accession Swiss-Pr
ot/GeneName PubMed/abstract
16
Matching
  • Db1/path1/socs3 and Db2/path2/socs3 gt synonymous
    paths
  • Get statistics for full and partial path matches
    and postulate schema matches
  • Manually inspect the matched paths, and examine
    support for each path match
  • Automate the procedure

17
Architecture
Microarray experiment Proteomics experiment
Visualisation
INTERACTION
List of names
Synonym expander
XML tree merger
PROCESSING LAYER
XML tree finder
INDEX
WAREHOUSE
Gene trees XML
Mapping generation and lookup
18
Status
  • Mirroring external XML data
  • Query expansion is implemented
  • Software to XMLise OMIM and some of the MGD
  • Testing indexing software for loading into Oracle
  • Designing an algorithm for data mining
  • Developing ideas on adding sequence comparison
    and text retrieval, and connecting to
    visualisation tools (collaboration with e-Science
    project BRIDGES)

19
THE VISION
To tabular summaries
To multiple alignment
To sequence
20
Other work
  • Schema-based approaches look at the schemas to
    find mappings between them
  • use constraints, tree shape, some data
  • involve the user/programmer YATL, Clio, REVERE
  • Data-based approaches look at data values in
    order to find mappings between attributes
  • ML approaches are inefficient, all-against-all
  • Problems
  • Expensive in terms of labour (programmer or user)
  • Only very similar schemas can be matched
  • Not scalable

21
Recent papers
  • Kurgan et al., 2002, machine learning for schema
    matching (2 very similar schemas)
  • Doan et al., VLDBJ03, machine learning, 2
    semi-structured schemas (ontologies), schemas
    some data
  • Chua et al., VLDBJ03, (RDBMS) given entity
    matches (table names), match attributes (values),
    based on a variety of statistical tests
  • Halevy et al, CIDR-2003, user-driven schema
    matching by example, and mapping by transitivity
    (no algorithm has been given)

22
Summary
  • Aim - to overcome the problems associated with
    manual or schema-based mapping approaches which
    are expensive
  • Scale up, take into account data values
  • Provide a digest of information for a list of
    gene/protein names of interest
  • Using XML and relational indexes

23
Collaborators at Glasgow
Barry Gusterson
Andy Jones Torsten Stein Inga Tulloch Catherine
Winchester Anna F. Dominiczak Neil Hanlon BRIDGES
project (uses DB2)
Vangelis Pafilis
FUNDING Carnegie Trust for the Universities of
Scotland Medical Research Council (UK) Royal
Society Synergy
John Wilson
Write a Comment
User Comments (0)
About PowerShow.com