Title: An Architecture for Real-Time Warehousing of Scientific Data
1An Architecture for Real-Time Warehousing of
Scientific Data
Ramon Lawrence and Anton Kruger IIHR, University
of Iowa ramon-lawrence_at_uiowa.edu http//www.cs.uio
wa.edu/rlawrenc/ http//www.iihr.uiowa.edu/hml/p
rojects/nexrad-itr
2Overview
- Our goal is to build a general archival
architecture for storing and querying massive
amounts of scientific data. - This presentation will discuss our current
architecture and how it is being used in a
national project to archive weather radar data in
the United States. - The architecture achieves four basic design
goals - 1) scalable - can handle terabyte-scale data sets
- 2) extensible - types of data and metadata stored
can change - 3) inexpensive - uses cheap hardware and
open-source software - 4) usable - researchers can interact with the
system in a variety of intuitive ways
3Motivation
- The size of scientific data sets in many domains
is increasing dramatically. This is placing a
burden on IT infrastructure for storing,
processing, and querying the data effectively. - As sensor networks are deployed, this will get
even worse. - Although data warehousing techniques are
well-known, it is an impediment to research to
manage data sets of this scale. - One of the most basic challenges is finding data
relevant to the research (the data finding
problem). To avoid browsing a large data set,
suitable metadata describing the data must be
generated, stored, and queryable by the
researcher.
4Desirable Architecture Properties
- Our architecture is designed with four key
properties - 1) scalable - The system can accommodate more
data simply by adding low-cost PCs. Data files
are transparently allocated and replicated across
nodes without custom hardware/software. - 2) extensible - The types of metadata generated
and stored may change over time as the research
evolves. - 3) inexpensive - Low cost hardware and
open-source software is used. - 4) usable - Researcher can interact with data
archive in a variety of ways including directly
through C code, web forms, or web services.
5Archive Architecture Overview
6Architecture Components
- The components
- Extractor - is the only component specific to the
data set. It is the code module for computing
desired metadata statistics on the data. The
output is a standard XML schema defined by the
Loader. - Loader - is the module responsible for storing
metadata in the database and using rules to place
data files on retrieval servers. This component
is not data set specific. Different and evolving
metadata is supported by a general database
schema. - Metadata archive - is a relational database that
stores the metadata and pointers to the data.
SQL queries are built using the various front-end
tools (C code, web interface, etc.) to query
metadata to find data with specific properties
and file locations. - Retrieval server - is any machine capable of
running a HTTP server and acting as a data file
store.
7Case Study Archiving NEXRAD Data
Our goal is to provide the community with access
to the vast archives and real-time data collected
by the NEXRAD system.
- There are over 150 NEXt generation RADars
(NEXRAD) that collect real-time precipitation
data across the United States. - The system has been operational for about 10
years, and the amount of collected data is
continually expanding. - How a radar works
- A radar emits a coherent train of microwave
pulses and processes reflected pulses. - Each processed pulse corresponds to a bin. There
are multiple bins in a ray (beam). Rotating the
radar 360º is a sweep. After a sweep the radar
elevation angle is increased, and another sweep
performed. All sweeps together form a volume.
8Usefulness of NEXRAD Data
- Although the NEXRAD system was designed for
severe weather forecasting, data collected has
been used in many areas including - flood prediction
- bird and insect migration
- rainfall estimation
- The value of this data has been noted by a NRC
report which labeled it a critical resource. - Enhancing Access to NEXRAD DataA Critical
National Resource. National Academy Press,
Washington D.C. ISBN 0-309-06636-0, 1999
9Archiving NEXRAD Data
- Despite its value, the archival system for NEXRAD
data is unsatisfactory. The National Climatic
Data Center (NCDC) maintains a tape archive of
the RAW data, but provides few tools for finding
relevant data and processing it for research. - Some real-time data is distributed by University
Corporation for Atmospheric Research (UCAR) using
their Unidata Internet Data Distribution (IDD)
system. However, this still requires users be
able to - extract and process a RAW data stream in
real-time - archive it appropriately
- generate metadata and indexes for retrieving it
when required - filter the data set to reduce the amount of space
required - develop custom tools for analysis and processing
10Data Size Challenges
- Individual NEXRAD Level II scans are not large
(300-1000 KB). However, archiving 150 radars
that produce 10 scans per hour results in an
archive rate of 36,000 scans/day 17 GB/day. - Although the cost of storage has decreased
dramatically (1 TB for under 10,000), this still
requires a hardware investment. - A major challenge is how do you find the data
files of interest? - Answer Queryable metadata that allows you to ask
for files with certain properties without
browsing the entire collection. - One problem The metadata can be huge as well
making it inefficient to search. Even worse,
scientific metadata tends to change as research
evolves.
11User/Clients View
Find all the 2002 storms over the Ralston Creek
watershed with mean areal precipitation greater
than X mm, and with a spatial extent of more than
Z km2, with a duration of less than N hours. I
want the data in GeoTIFF.
Metadata Archive
12Current Status and Future Work
- We have implemented a prototype version of the
architecture that is currently archiving 30
radars in real-time. Some basic statistics are
being generated and can be used to retrieve data
files of interest. Accessible at - http//nexrad.cs.uiowa.edu
- Immediate plans
- Generate standardized metadata for use by
hydrologists. - Link NEXRAD data to basin information so that
rainfall estimation and flood prediction can be
performed. - This research is supported by NSF ITR Grant ATM
0427422 A Comprehensive Framework for Use of
NEXRAD Data in Hydrometeorology and Hydrology.
13NEXRAD Project Participants
- The University of Iowa (Lead)
- W.F. Krajewski (PI)
- A.A. Bradley, A. Kruger, R. Lawrence
- Princeton University
- J.A. Smith (PI)
- M. Steiner, M.L.Baeck
- National Climatic Data Center
- S.A. Delgreco (PI)
- S. Ansari
- UCAR/Unidata Program Center
- M. K. Ramamurthy (PI)
- W.J. Weber
14An Architecture for Real-Time Warehousing of
Scientific Data
Ramon Lawrence and Anton Kruger IIHR, University
of Iowa ramon-lawrence_at_uiowa.edu http//www.cs.uio
wa.edu/rlawrenc/ http//www.iihr.uiowa.edu/hml/p
rojects/nexrad-itr
Thank You!
15Extra Slides...
16NEXRAD Data Management Challenges
- Storing NEXRAD Level II data results in many
interesting database challenges - Data size - A historical archive of NEXRAD data
consumes many terabytes of space. - Flexibility/Variability - Unlike commercial
warehouses, the types of data and metadata that
should be stored in the warehouse is not well
understood and evolves over time. - Real-Time response - The data should be loaded
and queryable in real-time as it is received from
the radars. - Scientific Workflow - It is desirable to capture
and share sequences of calculations on the raw
data (scientific workflows) and develop tools
that seemlessly interact with the archive.
17Flexibility Challenges
- Ideally, the system should allow arbitrary
metadata to be associated with NEXRAD files that
can easily be added, updated, and queried. - Unfortunately, relational databases do not nicely
handle variable information. Although there are
some known schema designs that can handle
variability, they are inefficient for large data
sets. - Good news This is not unique to hydrology.
Researchers in other domains are building grids
to share data/metadata and face the same
challenges (e.g. GriPhyn - physics grid). - Bad news Representing and querying variable data
(especially within a relational database) is an
active research problem.
18Flexibility Example
- One way to represent variable metadata on a
datafile in a relational database is to have a
single table - metadata(dataFileId, attributeName,
attributeValue) - Example
- Data file 1 has three attributes ArealCoverage,
MaximumReflectivity, MinimumReflectivity. Data
file 2 has two attributes, and file 3 has only 1. - Note that this schema allows any (variable)
number of attributes per file. - A challenge How would you return all files that
have ArealCoverage gt 5 and MaximumReflectivity gt
20?
Answer Join two copies of table metadata
together.
19Scientific Workflow
- A workflow is a sequence of steps that is
performed on data. - Workflows have received considerable attention
where documents must be routed between
individuals. - Think of a funding proposal being internally
routed through your university. - A scientific workflow is a sequence of steps
performed on scientific data. Each step uses as
input the output of the previous step. An
example workflow in hydrology - retrieve the raw data files of interest
- remove ground clutter and Anomalous Propagation
(AP) - calculate estimated rain fall
- map calculations to a basin
- Our goal is to support such workflows.
- How to represent and store intermediary products?
- How to make the tools/algorithms interoperable?
20A Watershed or Basin
A watershed is an area of land that drains water,
sediment and dissolved materials to a common
receiving body or outlet.
21NRC Quote on NEXRAD Data Archiving
the limited use of ground-based radar rainfall
data outside of the operational environment is
partially attributed to the lack of
research-quality data products and partially to
poor archiving practices.
NRC Report, 2002
22Metadata
Basic
Find all the 2002 storms over the Ralston Creek
watershed with mean areal precipitation greater
than X mm, and with a spatial extent of more than
Z km2, with a duration of less than N hours. I
want the data in GeoTIFF
Derived/Complex
Find all the 2002 storms over the Ralston Creek
watershed with mean areal precipitation greater
than X mm, and with a spatial extent of more than
Z km2, with a duration of less than N hours. I
want the data in GeoTIFF
23CUAHSI
Consortium of Universities for the Advancement of
Hydrologic Sciences (CUAHSI)