Query SIG Recommendations - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

Query SIG Recommendations

Description:

All use cases will extend a basic functionality that includes the ability to ... Full XPath, Limited XQuery (basic FLWR support) Bronze Level Support ... – PowerPoint PPT presentation

Number of Views:20
Avg rating:3.0/5.0
Slides: 30
Provided by: scott136
Category:

less

Transcript and Presenter's Notes

Title: Query SIG Recommendations


1
caBIG Architecture Workspace Common Query
Language SIG Summary and Initial Recommendations
Face to Face Meeting Seattle, WA March 16th and
17th, 2005
2
Agenda
  • Overview of SIG
  • Use Case Review
  • Requirements Review
  • Language Candidates
  • Language Evaluation
  • Recommendations
  • Approach
  • Language
  • Implementation
  • Whats next
  • Discussion

3
Query SIG Mission
  • To come to a consensus on the requirements,
    properties, and details of the language which
    will be used to query caBIG grid resources, and
    also to define the requirements of a query engine
    capable of performing the caBIG query use cases,
    and create/identify a query language which meets
    these criteria.

4
Data Sharing Vision
Courtesy of http//ccr.cancer.gov/news/frontiers/
Sept_2004.pdf
5
Example Queries - caTIES
  • All use cases will extend a basic functionality
    that includes the ability to search for
    particular text strings or concepts and
    demographic information.
  • For textual data, three general kinds of queries
    will be supported
  • Query by text - users will be able to enter
    strings and the system will search for documents
    that exactly match this string
  • Query by concept - users will be able to enter
    strings that will be mapped to candidate EVS
    concepts, and users will be able to select one of
    more EVS concepts to be included in the query.
    During this process, users will be able to
    interact with either the Metathesaurus
    broader-than/narrower-than tree, or the NCI
    Thesaurus to further browse and refine their
    concepts.
  • Query by semantic type - users will be able to
    enter strings that will map to a subset of
    candidate EVS concepts based on predefined
    semantic types including Diagnosis, Procedure,
    and Organ. Users will be able to select one of
    more EVS concepts to be included in the query.
    During this process, users will be able to
    interact with either the Metathesaurus
    broader-than/narrower-than tree, or the NCI
    Thesaurus to further browse and refine their
    concepts.
  • For non-textual data, users will be able to
    constrain queries by age, gender, values for
    extracted quantitative data such as tumor size,
    grade and stage, and temporal relationships
    between multiple reports for a single patient.

6
Example Queries - University of Pennsylvania
  • I want to collect all microarray data (Affy only)
    available from all cancer centers from patients
    with bladder or ovarian cancer that were part of
    any clinical trial protocol using cisplatin
    within the past five years. In addition, I want
    to know all available tissue samples, cancerous
    and non-cancerous (normal) tissue localized
    within 10mm of tumor site from this patient group
    such that I can perform Affy gene expression
    studies to include with previously performed
    studies that were identified by the query.
    Finally, I need all severe adverse events for the
    group of patients identified that had a severity
    rating of 3-4 and are likely linked to cisplatin
    administration.
  • I want all solid tumors, specifically for lung
    cancer, that have a diagnosis based on tumor
    pathology. Each diagnosis must have an image of
    the tumor that allows for independent
    verification of diagnoses. Each record retrieved
    must also have either proteomics marker data or
    microarray data (Affy or two-color) included so
    that different molecular techniques can be
    correlated to the tumor pathology. In addition, I
    want all protein annotations for markers and
    genes associated with the proteomics and
    microarray data so I can perform meta-analyses.
  • I want to retrieve a dataset for all patients
    that have been in at least two clinical trials at
    any cancer center throughout the US that had two
    separate cancer diagnoses (not the same cancer
    diagnosed twice, but two different cancers). I
    also need a comprehensive treatment history for
    each patient. If treatment history is not
    complete, do not want patient included in dataset.

7
More Use Cases
  • I want all images of grade I epithelial ovarian
    cancers in .tiff format at X100 resolution with
    file size less than 20Mb where resection occurred
    between 1997 and 2002.
  • I want to make a tissue array which has benign
    colonic epithelium, adenomatous tissue, local
    adenocarcinoma, and metastases from as many colon
    cancer patients as possible. I need all four
    histologic classes of tissue on each patient, but
    I don't care what kind of adenoma is used nor
    where the metastatic tissue is from, as long as
    it's metastatic colon cancer. All samples must be
    available for use/consumption in a study I'm
    proposing. (Arizona)

8
Requirements Matrix
9
Functional Requirements
  • Expression
  • must allow the client to express complex queries
    easily and consistently
  • Structured data query
  • must minimally provide users and services the
    ability to query structured data correctly and
    completely.
  • Full text search
  • should provide a full text query capability
    correctly and completely.
  • Semantic query
  • must provide a capability to query semantic
    information correctly and completely.
  • Semantic reasoning
  • should provide a capability to perform semantic
    reasoning, and express queries using semantic
    inferences.
  • Data creation / update
  • should support the creation and modification of
    datasets.
  • Core operations
  • must provide a moderate set of core operations.
  • Operation extension
  • should support extensions to the core set of
    operations and transformations.
  • Persistence
  • should support the concept of persistent queries.

10
Non-Functional Requirements
  • XML expression
  • must be expressible in XML
  • XML results
  • must be consistent with results sets being in
    XML.
  • Result set expression
  • should describe how result sets will be
    structured.
  • Metadata consistency
  • must be straightforward for users to construct
    queries over data sets from examining an
    information model representation of the dataset.
  • Implementation independent
  • must be data resource agnostic, and not have any
    dependencies on a particular underlying storage
    technology.
  • Security
  • may need to have security-related constructs in
    it.
  • Security of results
  • the language may need to formalize the security
    requirements of result sets.
  • Standards-based
  • the language should follow a community accepted
    standard.
  • Open implementation
  • the language should have a freely available query
    engine implementation.

11
Potential Languages
  • SQL
  • OQL
  • XQuery
  • caGRID Phase 1 Query

12
SQL Structured Query Language
  • SQL is an ANSI (American National Standards
    Institute) standard computer language for
    accessing and manipulating database systems. SQL
    statements are used to retrieve and update data
    in a database.
  • SQL Data Manipulation Language (DML)
  • SELECT - extracts data from a database table
  • UPDATE - updates data in a database table
  • DELETE - deletes data from a database table
  • INSERT INTO - inserts new data into a database
    table
  • SQL Data Definition Language (DDL)
  • CREATE TABLE - creates a new database table
  • ALTER TABLE - alters (changes) a database table
  • DROP TABLE - deletes a database table
  • CREATE INDEX - creates an index (search key)
  • DROP INDEX - deletes an index
  • Basic Query Example
  • SELECT ltcolumnsgt FROM lttablesgt WHERE ltcolumngt
    ltoperatorgt ltvaluegt

13
OQL Object Query Language
  • OQL is an object-oriented SQL-like query language
    and is the query language of the ODMG-93 standard
  • OQL is a superset of the standard SQL part which
    allows you to query a database. Thus, any select
    SQL sentence which runs on relational tables,
    works with the same syntax and semantics on
    collect ions of ODMG objects. Extensions concern
    Object Oriented notions, like complex objects,
    object identity, path expression, polymorphism,
    operation invocation, late binding etc...
  • Example Basic Query
  • define jones as select distinct x from Students x
  • where x.name "Jones"
  • select distinct student_id from jones

14
XQuery XML Query Language
  • XQuery is to XML what SQL is to databases
    defined by the W3C
  • Built on XPath expressions
  • XPath is a syntax for defining parts of an XML
    document Path expressions, Axes and Node Tests,
    and Predicates
  • XPath defines a library of standard functions
    Node Set Functions, String Functions, Number
    Functions, Boolean Functions
  • XPath supports numerical, equality, relational,
    and Boolean expressions.
  • FLWR (pronounced "flower") expression is the
    analogue of the SELECT-FROM-WHERE construction in
    SQL
  • FOR-clause binds one or more variables to a
    sequence of values returned by another expression
    (usually a path expression) and iterates over the
    values.
  • LET-clause also binds one or more variables but
    without iterating.
  • WHERE-clause contains one or more predicates
    that filters or limits the set of nodes as
    generated by the FOR/LET-clauses.
  • RETURN-clause generates the output of the FLWR
    expression. The RETURN-clause usually contains
    one or more element constructors and/or
    references to variables and is executed once for
    each node-reference that is returned by the
    FOR/LET/WHERE-clauses.
  • Example Basic Query
  • FOR b IN collection(bib)/bib/book
  • WHERE b/publisher/text() "Addison-Wesley" AND
    b/_at_year "1994
  • RETURN b/title

15
caGRID Phase 1 Query Language
  • Extensible query language, derived from data
    sources, which can be expressed in XML and
    associated with an XML schema.
  • Data sources can only implement the "core" set of
    required tags and optional tags can be
    implemented as appropriate.
  • Example
  • ltcaGridXMLQuery name"caArrayQuery"gt
  • ltcriteria name"gov.nih.nci.mageom.domain.Protoco
    l.Protocol"gt
  • ltcriterion name"identifier" condition"EQUAL_TO
    value"P-MEXP-1963"/gt
  • lt/criteriagt
  • lt/caGridXMLQuerygt

16
Languages vs. Requirements
NOTE In this phase, distinctions between
versions of languages are not considered.
Indicates non-consensus or indecision
Family of SQL-based languages OQL-like
languages (OQL,HQL,etc) Family of XML Query
languages (XQuery,XPath,XUpdate)
17
Approach Recommendations
  • The common query language will be the minimum
    entry point for caBIG data services.
  • Distributed Query Services and other higher-level
    services which interact with data services should
    do so through this language
  • This does not preclude data services from
    implementing other, more specialized, languages
  • They just must also provide an implementation of
    the common query language
  • Data Services should be classified in terms of
    their support of the common query language
  • e.g. Gold, Silver, Bronze
  • Clients will interact with the distributed query
    service which will implement the Gold Common
    Query Language
  • It will have the responsibility of translating
    Gold queries into to multiple non-Gold queries to
    data services as appropriate
  • Current approach should be to focus on structural
    queries, then layer semantics
  • Structural queries can still leverage semantic
    information in the model (e.g. model built using
    semantic connector), but cant express semantic
    questions (e.g. is X a subclass of Y?)

18
Language Recommendations
  • W3C XML Query
  • The mission of the XML Query project is to
    provide flexible query facilities to extract data
    from real and virtual documents on the World Wide
    Web, thereby finally providing the needed
    interaction between the Web world and the
    database world. Ultimately, collections of XML
    files will be accessed like databases. The
    ambitious task of the XML Query (XQuery) Working
    Group is to develop the first world standard for
    querying Web documents...
  • XQuery http//www.w3.org/XML/Query
  • Use Cases http//www.w3.org/TR/xquery-use-cases/
  • Requirements http//www.w3.org/TR/xquery-requirem
    ents/
  • XPath http//www.w3.org/TR/xpath20/
  • Full Text http//www.w3.org/TR/xquery-full-text/
  • XUpdate (not W3C) http//xmldb-org.sourceforge.net
    /xupdate/

19
Preliminary Language Levels Recommendations
  • Gold Level Support
  • All of Silver
  • Full XQuery
  • Silver Level Support
  • All of Bronze
  • Full XPath, Limited XQuery (basic FLWR support)
  • Bronze Level Support
  • Limited XPath support with simple predicates
  • Optional Add-ons
  • Full-Text
  • XUpdate

20
XQuery Example Description
  • Give me all the expression data where there are
    at least 50 conditions for genes found in the
    vacuole.

vacuole
count gt 50
21
XQuery Example
  • FOR gene IN service(http//cabio.osu.edu/GeneSe
    rvice.wsdl)/Gene,go IN service(http//cabio.os
    u.edu/GeneOntologyService.wsdl)/GeneOntology,mi
    croarray IN service(http//caarray.duke.edu/caArr
    ayService.wsdl)/Microarray
  • LET subject microarray/experiment/subject
  • WHEREgo/termvacuoleAND gene/goAccgo/acc
    AND gene/gbAccmicroarray/data/geneIdAND
    count(microarray/datageneIdgene/gbAcc/condit
    ion)gt50
  • RETURNltsubjectgt ltsubjectIdgt subject/lsid
    lt/subjectIdgt ltspeciesgt subject/species
    lt/speciesgt ltmicroarrayDatagt
    microarray/data lt/microarrayDatagtlt/subjectgt

Join across objects
Simple XPath
Complex XPath
Result composition
22
Implementation Recommendations
  • Focus on getting data services at Bronze and
    Silver Level
  • Leverage some tools for initial implementation
    (the implementation can change the language and
    interfaces should not)
  • JXPath (http//jakarta.apache.org/commons/jxpath/)
    , applies XPath expressions to graphs of objects
    of all kinds JavaBeans, Maps, Servlet contexts,
    DOM etc, including mixtures thereof.
  • Multitude of Freely available parsers available
  • Design Distributed Query Processor to support
    Gold Level queries against one or more Bronze and
    Silver Level data services
  • Develop short-term clients and applications on
    custom languages and Bronze Level

23
System Overview
24
Whats next
  • We have currently focused on the query language
    itself
  • A high-level system architecture should be
    recommended
  • Incorporation of other SIG recommendations
    (Identifiers, Security, Workflow?)
  • Further evaluation of
  • Existing query engines
  • Tools to support data service implementation of
    the query language
  • Distributed Query Frameworks
  • A high-level system architecture for distributed
    query

25
Discussion
  • Open for questions and discussion.

26
  • Additional Information

27
Towards a Semantic Grid
  • XML provides syntax transport layer
  • RDF(S) provides basic relational language and
    simple ontological primitives
  • the meaning of Grid services, resources and
    entities by assertions in a common data model
  • OWL DL provides powerful but still decidable
    ontology language
  • publish and share consensually agreed upon
    ontologies
  • Further layers may (will) extend OWL
  • Query, filter, integrate and aggregate the
    metadata
  • Reason over metadata

28
Query Processing Models
29
OGSA-DAI Distributed Query Processor (DQP)
Architecture
  • Currently works with OQL and targets only
    relational grid services
  • May ultimately support XML data services
  • May (probably will) still use OQL as client
    language, but will use XPath to query data
    services
  • Timeline for XML may not work for us, but it is a
    useful reference implementation.
  • Code may not work for XML, but architecture
    should
Write a Comment
User Comments (0)
About PowerShow.com