Title: Query SIG Recommendations
1caBIG Architecture Workspace Common Query
Language SIG Summary and Initial Recommendations
Face to Face Meeting Seattle, WA March 16th and
17th, 2005
2Agenda
- Overview of SIG
- Use Case Review
- Requirements Review
- Language Candidates
- Language Evaluation
- Recommendations
- Approach
- Language
- Implementation
- Whats next
- Discussion
3Query SIG Mission
- To come to a consensus on the requirements,
properties, and details of the language which
will be used to query caBIG grid resources, and
also to define the requirements of a query engine
capable of performing the caBIG query use cases,
and create/identify a query language which meets
these criteria.
4Data Sharing Vision
Courtesy of http//ccr.cancer.gov/news/frontiers/
Sept_2004.pdf
5Example Queries - caTIES
- All use cases will extend a basic functionality
that includes the ability to search for
particular text strings or concepts and
demographic information. - For textual data, three general kinds of queries
will be supported - Query by text - users will be able to enter
strings and the system will search for documents
that exactly match this string - Query by concept - users will be able to enter
strings that will be mapped to candidate EVS
concepts, and users will be able to select one of
more EVS concepts to be included in the query.
During this process, users will be able to
interact with either the Metathesaurus
broader-than/narrower-than tree, or the NCI
Thesaurus to further browse and refine their
concepts. - Query by semantic type - users will be able to
enter strings that will map to a subset of
candidate EVS concepts based on predefined
semantic types including Diagnosis, Procedure,
and Organ. Users will be able to select one of
more EVS concepts to be included in the query.
During this process, users will be able to
interact with either the Metathesaurus
broader-than/narrower-than tree, or the NCI
Thesaurus to further browse and refine their
concepts. - For non-textual data, users will be able to
constrain queries by age, gender, values for
extracted quantitative data such as tumor size,
grade and stage, and temporal relationships
between multiple reports for a single patient.
6Example Queries - University of Pennsylvania
- I want to collect all microarray data (Affy only)
available from all cancer centers from patients
with bladder or ovarian cancer that were part of
any clinical trial protocol using cisplatin
within the past five years. In addition, I want
to know all available tissue samples, cancerous
and non-cancerous (normal) tissue localized
within 10mm of tumor site from this patient group
such that I can perform Affy gene expression
studies to include with previously performed
studies that were identified by the query.
Finally, I need all severe adverse events for the
group of patients identified that had a severity
rating of 3-4 and are likely linked to cisplatin
administration. - I want all solid tumors, specifically for lung
cancer, that have a diagnosis based on tumor
pathology. Each diagnosis must have an image of
the tumor that allows for independent
verification of diagnoses. Each record retrieved
must also have either proteomics marker data or
microarray data (Affy or two-color) included so
that different molecular techniques can be
correlated to the tumor pathology. In addition, I
want all protein annotations for markers and
genes associated with the proteomics and
microarray data so I can perform meta-analyses. - I want to retrieve a dataset for all patients
that have been in at least two clinical trials at
any cancer center throughout the US that had two
separate cancer diagnoses (not the same cancer
diagnosed twice, but two different cancers). I
also need a comprehensive treatment history for
each patient. If treatment history is not
complete, do not want patient included in dataset.
7More Use Cases
- I want all images of grade I epithelial ovarian
cancers in .tiff format at X100 resolution with
file size less than 20Mb where resection occurred
between 1997 and 2002.
- I want to make a tissue array which has benign
colonic epithelium, adenomatous tissue, local
adenocarcinoma, and metastases from as many colon
cancer patients as possible. I need all four
histologic classes of tissue on each patient, but
I don't care what kind of adenoma is used nor
where the metastatic tissue is from, as long as
it's metastatic colon cancer. All samples must be
available for use/consumption in a study I'm
proposing. (Arizona)
8Requirements Matrix
9Functional Requirements
- Expression
- must allow the client to express complex queries
easily and consistently - Structured data query
- must minimally provide users and services the
ability to query structured data correctly and
completely. - Full text search
- should provide a full text query capability
correctly and completely. - Semantic query
- must provide a capability to query semantic
information correctly and completely. - Semantic reasoning
- should provide a capability to perform semantic
reasoning, and express queries using semantic
inferences. - Data creation / update
- should support the creation and modification of
datasets. - Core operations
- must provide a moderate set of core operations.
- Operation extension
- should support extensions to the core set of
operations and transformations. - Persistence
- should support the concept of persistent queries.
10Non-Functional Requirements
- XML expression
- must be expressible in XML
- XML results
- must be consistent with results sets being in
XML. - Result set expression
- should describe how result sets will be
structured. - Metadata consistency
- must be straightforward for users to construct
queries over data sets from examining an
information model representation of the dataset. - Implementation independent
- must be data resource agnostic, and not have any
dependencies on a particular underlying storage
technology. - Security
- may need to have security-related constructs in
it. - Security of results
- the language may need to formalize the security
requirements of result sets. - Standards-based
- the language should follow a community accepted
standard. - Open implementation
- the language should have a freely available query
engine implementation.
11Potential Languages
- SQL
- OQL
- XQuery
- caGRID Phase 1 Query
12SQL Structured Query Language
- SQL is an ANSI (American National Standards
Institute) standard computer language for
accessing and manipulating database systems. SQL
statements are used to retrieve and update data
in a database. - SQL Data Manipulation Language (DML)
- SELECT - extracts data from a database table
- UPDATE - updates data in a database table
- DELETE - deletes data from a database table
- INSERT INTO - inserts new data into a database
table - SQL Data Definition Language (DDL)
- CREATE TABLE - creates a new database table
- ALTER TABLE - alters (changes) a database table
- DROP TABLE - deletes a database table
- CREATE INDEX - creates an index (search key)
- DROP INDEXÂ - deletes an index
- Basic Query Example
- SELECT ltcolumnsgt FROM lttablesgt WHERE ltcolumngt
ltoperatorgt ltvaluegt
13OQL Object Query Language
- OQL is an object-oriented SQL-like query language
and is the query language of the ODMG-93 standard - OQL is a superset of the standard SQL part which
allows you to query a database. Thus, any select
SQL sentence which runs on relational tables,
works with the same syntax and semantics on
collect ions of ODMG objects. Extensions concern
Object Oriented notions, like complex objects,
object identity, path expression, polymorphism,
operation invocation, late binding etc... - Example Basic Query
- define jones as select distinct x from Students x
- where x.name "Jones"
- select distinct student_id from jones
14XQuery XML Query Language
- XQuery is to XML what SQL is to databases
defined by the W3C - Built on XPath expressions
- XPath is a syntax for defining parts of an XML
document Path expressions, Axes and Node Tests,
and Predicates - XPath defines a library of standard functions
Node Set Functions, String Functions, Number
Functions, Boolean Functions - XPath supports numerical, equality, relational,
and Boolean expressions. - FLWR (pronounced "flower") expression is the
analogue of the SELECT-FROM-WHERE construction in
SQL - FOR-clause binds one or more variables to a
sequence of values returned by another expression
(usually a path expression) and iterates over the
values. - LET-clause also binds one or more variables but
without iterating. - WHERE-clause contains one or more predicates
that filters or limits the set of nodes as
generated by the FOR/LET-clauses. - RETURN-clause generates the output of the FLWR
expression. The RETURN-clause usually contains
one or more element constructors and/or
references to variables and is executed once for
each node-reference that is returned by the
FOR/LET/WHERE-clauses. - Example Basic Query
- FOR b IN collection(bib)/bib/book
- WHERE b/publisher/text() "Addison-Wesley" AND
b/_at_year "1994 - RETURN b/title
15caGRID Phase 1 Query Language
- Extensible query language, derived from data
sources, which can be expressed in XML and
associated with an XML schema. - Data sources can only implement the "core" set of
required tags and optional tags can be
implemented as appropriate. - Example
- ltcaGridXMLQuery name"caArrayQuery"gt
- ltcriteria name"gov.nih.nci.mageom.domain.Protoco
l.Protocol"gt - ltcriterion name"identifier" condition"EQUAL_TO
value"P-MEXP-1963"/gt - lt/criteriagt
- lt/caGridXMLQuerygt
16Languages vs. Requirements
NOTE In this phase, distinctions between
versions of languages are not considered.
Indicates non-consensus or indecision
Family of SQL-based languages OQL-like
languages (OQL,HQL,etc) Family of XML Query
languages (XQuery,XPath,XUpdate)
17Approach Recommendations
- The common query language will be the minimum
entry point for caBIG data services. - Distributed Query Services and other higher-level
services which interact with data services should
do so through this language - This does not preclude data services from
implementing other, more specialized, languages - They just must also provide an implementation of
the common query language - Data Services should be classified in terms of
their support of the common query language - e.g. Gold, Silver, Bronze
- Clients will interact with the distributed query
service which will implement the Gold Common
Query Language - It will have the responsibility of translating
Gold queries into to multiple non-Gold queries to
data services as appropriate - Current approach should be to focus on structural
queries, then layer semantics - Structural queries can still leverage semantic
information in the model (e.g. model built using
semantic connector), but cant express semantic
questions (e.g. is X a subclass of Y?)
18Language Recommendations
- W3C XML Query
- The mission of the XML Query project is to
provide flexible query facilities to extract data
from real and virtual documents on the World Wide
Web, thereby finally providing the needed
interaction between the Web world and the
database world. Ultimately, collections of XML
files will be accessed like databases. The
ambitious task of the XML Query (XQuery) Working
Group is to develop the first world standard for
querying Web documents... - XQuery http//www.w3.org/XML/Query
- Use Cases http//www.w3.org/TR/xquery-use-cases/
- Requirements http//www.w3.org/TR/xquery-requirem
ents/ - XPath http//www.w3.org/TR/xpath20/
- Full Text http//www.w3.org/TR/xquery-full-text/
- XUpdate (not W3C) http//xmldb-org.sourceforge.net
/xupdate/
19Preliminary Language Levels Recommendations
- Gold Level Support
- All of Silver
- Full XQuery
- Silver Level Support
- All of Bronze
- Full XPath, Limited XQuery (basic FLWR support)
- Bronze Level Support
- Limited XPath support with simple predicates
- Optional Add-ons
- Full-Text
- XUpdate
20XQuery Example Description
- Give me all the expression data where there are
at least 50 conditions for genes found in the
vacuole.
vacuole
count gt 50
21XQuery Example
- FOR gene IN service(http//cabio.osu.edu/GeneSe
rvice.wsdl)/Gene,go IN service(http//cabio.os
u.edu/GeneOntologyService.wsdl)/GeneOntology,mi
croarray IN service(http//caarray.duke.edu/caArr
ayService.wsdl)/Microarray - LET subject microarray/experiment/subject
- WHEREgo/termvacuoleAND gene/goAccgo/acc
AND gene/gbAccmicroarray/data/geneIdAND
count(microarray/datageneIdgene/gbAcc/condit
ion)gt50 - RETURNltsubjectgt ltsubjectIdgt subject/lsid
lt/subjectIdgt ltspeciesgt subject/species
lt/speciesgt ltmicroarrayDatagt
microarray/data lt/microarrayDatagtlt/subjectgt
Join across objects
Simple XPath
Complex XPath
Result composition
22Implementation Recommendations
- Focus on getting data services at Bronze and
Silver Level - Leverage some tools for initial implementation
(the implementation can change the language and
interfaces should not) - JXPath (http//jakarta.apache.org/commons/jxpath/)
, applies XPath expressions to graphs of objects
of all kinds JavaBeans, Maps, Servlet contexts,
DOM etc, including mixtures thereof. - Multitude of Freely available parsers available
- Design Distributed Query Processor to support
Gold Level queries against one or more Bronze and
Silver Level data services - Develop short-term clients and applications on
custom languages and Bronze Level
23System Overview
24Whats next
- We have currently focused on the query language
itself - A high-level system architecture should be
recommended - Incorporation of other SIG recommendations
(Identifiers, Security, Workflow?) - Further evaluation of
- Existing query engines
- Tools to support data service implementation of
the query language - Distributed Query Frameworks
- A high-level system architecture for distributed
query
25Discussion
- Open for questions and discussion.
26 27Towards a Semantic Grid
- XML provides syntax transport layer
- RDF(S) provides basic relational language and
simple ontological primitives - the meaning of Grid services, resources and
entities by assertions in a common data model - OWL DL provides powerful but still decidable
ontology language - publish and share consensually agreed upon
ontologies - Further layers may (will) extend OWL
- Query, filter, integrate and aggregate the
metadata - Reason over metadata
28Query Processing Models
29OGSA-DAI Distributed Query Processor (DQP)
Architecture
- Currently works with OQL and targets only
relational grid services - May ultimately support XML data services
- May (probably will) still use OQL as client
language, but will use XPath to query data
services - Timeline for XML may not work for us, but it is a
useful reference implementation. - Code may not work for XML, but architecture
should