Query SIG Recommendations

About This Presentation

Title:

Query SIG Recommendations

Description:

All use cases will extend a basic functionality that includes the ability to ... Full XPath, Limited XQuery (basic FLWR support) Bronze Level Support ... – PowerPoint PPT presentation

Number of Views:20

Avg rating:3.0/5.0

Slides: 30

Provided by: scott136

Category:

more less

Transcript and Presenter's Notes

Title: Query SIG Recommendations

1
caBIG Architecture Workspace Common Query
Language SIG Summary and Initial Recommendations
Face to Face Meeting Seattle, WA March 16th and
17th, 2005
2
Agenda

Overview of SIG
Use Case Review
Requirements Review
Language Candidates
Language Evaluation
Recommendations
Approach
Language
Implementation
Whats next
Discussion

3
Query SIG Mission

To come to a consensus on the requirements,
properties, and details of the language which
will be used to query caBIG grid resources, and
also to define the requirements of a query engine
capable of performing the caBIG query use cases,
and create/identify a query language which meets
these criteria.

4
Data Sharing Vision
Courtesy of http//ccr.cancer.gov/news/frontiers/
Sept_2004.pdf
5
Example Queries - caTIES

All use cases will extend a basic functionality
that includes the ability to search for
particular text strings or concepts and
demographic information.
For textual data, three general kinds of queries
will be supported
Query by text - users will be able to enter
strings and the system will search for documents
that exactly match this string
Query by concept - users will be able to enter
strings that will be mapped to candidate EVS
concepts, and users will be able to select one of
more EVS concepts to be included in the query.
During this process, users will be able to
interact with either the Metathesaurus
broader-than/narrower-than tree, or the NCI
Thesaurus to further browse and refine their
concepts.
Query by semantic type - users will be able to
enter strings that will map to a subset of
candidate EVS concepts based on predefined
semantic types including Diagnosis, Procedure,
and Organ. Users will be able to select one of
more EVS concepts to be included in the query.
During this process, users will be able to
interact with either the Metathesaurus
broader-than/narrower-than tree, or the NCI
Thesaurus to further browse and refine their
concepts.
For non-textual data, users will be able to
constrain queries by age, gender, values for
extracted quantitative data such as tumor size,
grade and stage, and temporal relationships
between multiple reports for a single patient.

6
Example Queries - University of Pennsylvania

I want to collect all microarray data (Affy only)
available from all cancer centers from patients
with bladder or ovarian cancer that were part of
any clinical trial protocol using cisplatin
within the past five years. In addition, I want
to know all available tissue samples, cancerous
and non-cancerous (normal) tissue localized
within 10mm of tumor site from this patient group
such that I can perform Affy gene expression
studies to include with previously performed
studies that were identified by the query.
Finally, I need all severe adverse events for the
group of patients identified that had a severity
rating of 3-4 and are likely linked to cisplatin
administration.
I want all solid tumors, specifically for lung
cancer, that have a diagnosis based on tumor
pathology. Each diagnosis must have an image of
the tumor that allows for independent
verification of diagnoses. Each record retrieved
must also have either proteomics marker data or
microarray data (Affy or two-color) included so
that different molecular techniques can be
correlated to the tumor pathology. In addition, I
want all protein annotations for markers and
genes associated with the proteomics and
microarray data so I can perform meta-analyses.
I want to retrieve a dataset for all patients
that have been in at least two clinical trials at
any cancer center throughout the US that had two
separate cancer diagnoses (not the same cancer
diagnosed twice, but two different cancers). I
also need a comprehensive treatment history for
each patient. If treatment history is not
complete, do not want patient included in dataset.

7
More Use Cases

I want all images of grade I epithelial ovarian
cancers in .tiff format at X100 resolution with
file size less than 20Mb where resection occurred
between 1997 and 2002.

I want to make a tissue array which has benign
colonic epithelium, adenomatous tissue, local
adenocarcinoma, and metastases from as many colon
cancer patients as possible. I need all four
histologic classes of tissue on each patient, but
I don't care what kind of adenoma is used nor
where the metastatic tissue is from, as long as
it's metastatic colon cancer. All samples must be
available for use/consumption in a study I'm
proposing. (Arizona)

8
Requirements Matrix
9
Functional Requirements

Expression
must allow the client to express complex queries
easily and consistently
Structured data query
must minimally provide users and services the
ability to query structured data correctly and
completely.
Full text search
should provide a full text query capability
correctly and completely.
Semantic query
must provide a capability to query semantic
information correctly and completely.
Semantic reasoning
should provide a capability to perform semantic
reasoning, and express queries using semantic
inferences.
Data creation / update
should support the creation and modification of
datasets.
Core operations
must provide a moderate set of core operations.
Operation extension
should support extensions to the core set of
operations and transformations.
Persistence
should support the concept of persistent queries.

10
Non-Functional Requirements

XML expression
must be expressible in XML
XML results
must be consistent with results sets being in
XML.
Result set expression
should describe how result sets will be
structured.
Metadata consistency
must be straightforward for users to construct
queries over data sets from examining an
information model representation of the dataset.
Implementation independent
must be data resource agnostic, and not have any
dependencies on a particular underlying storage
technology.
Security
may need to have security-related constructs in
it.
Security of results
the language may need to formalize the security
requirements of result sets.
Standards-based
the language should follow a community accepted
standard.
Open implementation
the language should have a freely available query
engine implementation.

11
Potential Languages

SQL
OQL
XQuery
caGRID Phase 1 Query

12
SQL Structured Query Language

SQL is an ANSI (American National Standards
Institute) standard computer language for
accessing and manipulating database systems. SQL
statements are used to retrieve and update data
in a database.
SQL Data Manipulation Language (DML)
SELECT - extracts data from a database table
UPDATE - updates data in a database table
DELETE - deletes data from a database table
INSERT INTO - inserts new data into a database
table
SQL Data Definition Language (DDL)
CREATE TABLE - creates a new database table
ALTER TABLE - alters (changes) a database table
DROP TABLE - deletes a database table
CREATE INDEX - creates an index (search key)
DROP INDEX - deletes an index
Basic Query Example
SELECT ltcolumnsgt FROM lttablesgt WHERE ltcolumngt
ltoperatorgt ltvaluegt

13
OQL Object Query Language

OQL is an object-oriented SQL-like query language
and is the query language of the ODMG-93 standard
OQL is a superset of the standard SQL part which
allows you to query a database. Thus, any select
SQL sentence which runs on relational tables,
works with the same syntax and semantics on
collect ions of ODMG objects. Extensions concern
Object Oriented notions, like complex objects,
object identity, path expression, polymorphism,
operation invocation, late binding etc...
Example Basic Query
define jones as select distinct x from Students x
where x.name "Jones"
select distinct student_id from jones

14
XQuery XML Query Language

XQuery is to XML what SQL is to databases
defined by the W3C
Built on XPath expressions
XPath is a syntax for defining parts of an XML
document Path expressions, Axes and Node Tests,
and Predicates
XPath defines a library of standard functions
Node Set Functions, String Functions, Number
Functions, Boolean Functions
XPath supports numerical, equality, relational,
and Boolean expressions.
FLWR (pronounced "flower") expression is the
analogue of the SELECT-FROM-WHERE construction in
SQL
FOR-clause binds one or more variables to a
sequence of values returned by another expression
(usually a path expression) and iterates over the
values.
LET-clause also binds one or more variables but
without iterating.
WHERE-clause contains one or more predicates
that filters or limits the set of nodes as
generated by the FOR/LET-clauses.
RETURN-clause generates the output of the FLWR
expression. The RETURN-clause usually contains
one or more element constructors and/or
references to variables and is executed once for
each node-reference that is returned by the
FOR/LET/WHERE-clauses.
Example Basic Query
FOR b IN collection(bib)/bib/book
WHERE b/publisher/text() "Addison-Wesley" AND
b/_at_year "1994
RETURN b/title

15
caGRID Phase 1 Query Language

Extensible query language, derived from data
sources, which can be expressed in XML and
associated with an XML schema.
Data sources can only implement the "core" set of
required tags and optional tags can be
implemented as appropriate.
Example
ltcaGridXMLQuery name"caArrayQuery"gt
ltcriteria name"gov.nih.nci.mageom.domain.Protoco
l.Protocol"gt
ltcriterion name"identifier" condition"EQUAL_TO
value"P-MEXP-1963"/gt
lt/criteriagt
lt/caGridXMLQuerygt

16
Languages vs. Requirements
NOTE In this phase, distinctions between
versions of languages are not considered.
Indicates non-consensus or indecision
Family of SQL-based languages OQL-like
languages (OQL,HQL,etc) Family of XML Query
languages (XQuery,XPath,XUpdate)
17
Approach Recommendations

The common query language will be the minimum
entry point for caBIG data services.
Distributed Query Services and other higher-level
services which interact with data services should
do so through this language
This does not preclude data services from
implementing other, more specialized, languages
They just must also provide an implementation of
the common query language
Data Services should be classified in terms of
their support of the common query language
e.g. Gold, Silver, Bronze
Clients will interact with the distributed query
service which will implement the Gold Common
Query Language
It will have the responsibility of translating
Gold queries into to multiple non-Gold queries to
data services as appropriate
Current approach should be to focus on structural
queries, then layer semantics
Structural queries can still leverage semantic
information in the model (e.g. model built using
semantic connector), but cant express semantic
questions (e.g. is X a subclass of Y?)

18
Language Recommendations

W3C XML Query
The mission of the XML Query project is to
provide flexible query facilities to extract data
from real and virtual documents on the World Wide
Web, thereby finally providing the needed
interaction between the Web world and the
database world. Ultimately, collections of XML
files will be accessed like databases. The
ambitious task of the XML Query (XQuery) Working
Group is to develop the first world standard for
querying Web documents...
XQuery http//www.w3.org/XML/Query
Use Cases http//www.w3.org/TR/xquery-use-cases/
Requirements http//www.w3.org/TR/xquery-requirem
ents/
XPath http//www.w3.org/TR/xpath20/
Full Text http//www.w3.org/TR/xquery-full-text/
XUpdate (not W3C) http//xmldb-org.sourceforge.net
/xupdate/

19
Preliminary Language Levels Recommendations

Gold Level Support
All of Silver
Full XQuery
Silver Level Support
All of Bronze
Full XPath, Limited XQuery (basic FLWR support)
Bronze Level Support
Limited XPath support with simple predicates
Optional Add-ons
Full-Text
XUpdate

20
XQuery Example Description

Give me all the expression data where there are
at least 50 conditions for genes found in the
vacuole.

vacuole
count gt 50
21
XQuery Example

FOR gene IN service(http//cabio.osu.edu/GeneSe
rvice.wsdl)/Gene,go IN service(http//cabio.os
u.edu/GeneOntologyService.wsdl)/GeneOntology,mi
croarray IN service(http//caarray.duke.edu/caArr
ayService.wsdl)/Microarray
LET subject microarray/experiment/subject
WHEREgo/termvacuoleAND gene/goAccgo/acc
AND gene/gbAccmicroarray/data/geneIdAND
count(microarray/datageneIdgene/gbAcc/condit
ion)gt50
RETURNltsubjectgt ltsubjectIdgt subject/lsid
lt/subjectIdgt ltspeciesgt subject/species
lt/speciesgt ltmicroarrayDatagt
microarray/data lt/microarrayDatagtlt/subjectgt

Join across objects
Simple XPath
Complex XPath
Result composition
22
Implementation Recommendations

Focus on getting data services at Bronze and
Silver Level
Leverage some tools for initial implementation
(the implementation can change the language and
interfaces should not)
JXPath (http//jakarta.apache.org/commons/jxpath/)
, applies XPath expressions to graphs of objects
of all kinds JavaBeans, Maps, Servlet contexts,
DOM etc, including mixtures thereof.
Multitude of Freely available parsers available
Design Distributed Query Processor to support
Gold Level queries against one or more Bronze and
Silver Level data services
Develop short-term clients and applications on
custom languages and Bronze Level

23
System Overview
24
Whats next

We have currently focused on the query language
itself
A high-level system architecture should be
recommended
Incorporation of other SIG recommendations
(Identifiers, Security, Workflow?)
Further evaluation of
Existing query engines
Tools to support data service implementation of
the query language
Distributed Query Frameworks
A high-level system architecture for distributed
query

25
Discussion

Open for questions and discussion.

Additional Information

27
Towards a Semantic Grid

XML provides syntax transport layer
RDF(S) provides basic relational language and
simple ontological primitives
the meaning of Grid services, resources and
entities by assertions in a common data model
OWL DL provides powerful but still decidable
ontology language
publish and share consensually agreed upon
ontologies
Further layers may (will) extend OWL
Query, filter, integrate and aggregate the
metadata
Reason over metadata

28
Query Processing Models
29
OGSA-DAI Distributed Query Processor (DQP)
Architecture

Currently works with OQL and targets only
relational grid services
May ultimately support XML data services
May (probably will) still use OQL as client
language, but will use XPath to query data
services
Timeline for XML may not work for us, but it is a
useful reference implementation.
Code may not work for XML, but architecture
should