An Architecture for Real-Time Warehousing of Scientific Data - PowerPoint PPT Presentation

About This Presentation
Title:

An Architecture for Real-Time Warehousing of Scientific Data

Description:

An Architecture for Real-Time Warehousing of Scientific Data – PowerPoint PPT presentation

Number of Views:46
Avg rating:3.0/5.0
Slides: 24
Provided by: RamonLawr
Category:

less

Transcript and Presenter's Notes

Title: An Architecture for Real-Time Warehousing of Scientific Data


1
An Architecture for Real-Time Warehousing of
Scientific Data
Ramon Lawrence and Anton Kruger IIHR, University
of Iowa ramon-lawrence_at_uiowa.edu http//www.cs.uio
wa.edu/rlawrenc/ http//www.iihr.uiowa.edu/hml/p
rojects/nexrad-itr
2
Overview
  • Our goal is to build a general archival
    architecture for storing and querying massive
    amounts of scientific data.
  • This presentation will discuss our current
    architecture and how it is being used in a
    national project to archive weather radar data in
    the United States.
  • The architecture achieves four basic design
    goals
  • 1) scalable - can handle terabyte-scale data sets
  • 2) extensible - types of data and metadata stored
    can change
  • 3) inexpensive - uses cheap hardware and
    open-source software
  • 4) usable - researchers can interact with the
    system in a variety of intuitive ways

3
Motivation
  • The size of scientific data sets in many domains
    is increasing dramatically. This is placing a
    burden on IT infrastructure for storing,
    processing, and querying the data effectively.
  • As sensor networks are deployed, this will get
    even worse.
  • Although data warehousing techniques are
    well-known, it is an impediment to research to
    manage data sets of this scale.
  • One of the most basic challenges is finding data
    relevant to the research (the data finding
    problem). To avoid browsing a large data set,
    suitable metadata describing the data must be
    generated, stored, and queryable by the
    researcher.

4
Desirable Architecture Properties
  • Our architecture is designed with four key
    properties
  • 1) scalable - The system can accommodate more
    data simply by adding low-cost PCs. Data files
    are transparently allocated and replicated across
    nodes without custom hardware/software.
  • 2) extensible - The types of metadata generated
    and stored may change over time as the research
    evolves.
  • 3) inexpensive - Low cost hardware and
    open-source software is used.
  • 4) usable - Researcher can interact with data
    archive in a variety of ways including directly
    through C code, web forms, or web services.

5
Archive Architecture Overview
6
Architecture Components
  • The components
  • Extractor - is the only component specific to the
    data set. It is the code module for computing
    desired metadata statistics on the data. The
    output is a standard XML schema defined by the
    Loader.
  • Loader - is the module responsible for storing
    metadata in the database and using rules to place
    data files on retrieval servers. This component
    is not data set specific. Different and evolving
    metadata is supported by a general database
    schema.
  • Metadata archive - is a relational database that
    stores the metadata and pointers to the data.
    SQL queries are built using the various front-end
    tools (C code, web interface, etc.) to query
    metadata to find data with specific properties
    and file locations.
  • Retrieval server - is any machine capable of
    running a HTTP server and acting as a data file
    store.

7
Case Study Archiving NEXRAD Data
Our goal is to provide the community with access
to the vast archives and real-time data collected
by the NEXRAD system.
  • There are over 150 NEXt generation RADars
    (NEXRAD) that collect real-time precipitation
    data across the United States.
  • The system has been operational for about 10
    years, and the amount of collected data is
    continually expanding.
  • How a radar works
  • A radar emits a coherent train of microwave
    pulses and processes reflected pulses.
  • Each processed pulse corresponds to a bin. There
    are multiple bins in a ray (beam). Rotating the
    radar 360º is a sweep. After a sweep the radar
    elevation angle is increased, and another sweep
    performed. All sweeps together form a volume.

8
Usefulness of NEXRAD Data
  • Although the NEXRAD system was designed for
    severe weather forecasting, data collected has
    been used in many areas including
  • flood prediction
  • bird and insect migration
  • rainfall estimation
  • The value of this data has been noted by a NRC
    report which labeled it a critical resource.
  • Enhancing Access to NEXRAD DataA Critical
    National Resource. National Academy Press,
    Washington D.C. ISBN 0-309-06636-0, 1999

9
Archiving NEXRAD Data
  • Despite its value, the archival system for NEXRAD
    data is unsatisfactory. The National Climatic
    Data Center (NCDC) maintains a tape archive of
    the RAW data, but provides few tools for finding
    relevant data and processing it for research.
  • Some real-time data is distributed by University
    Corporation for Atmospheric Research (UCAR) using
    their Unidata Internet Data Distribution (IDD)
    system. However, this still requires users be
    able to
  • extract and process a RAW data stream in
    real-time
  • archive it appropriately
  • generate metadata and indexes for retrieving it
    when required
  • filter the data set to reduce the amount of space
    required
  • develop custom tools for analysis and processing

10
Data Size Challenges
  • Individual NEXRAD Level II scans are not large
    (300-1000 KB). However, archiving 150 radars
    that produce 10 scans per hour results in an
    archive rate of 36,000 scans/day 17 GB/day.
  • Although the cost of storage has decreased
    dramatically (1 TB for under 10,000), this still
    requires a hardware investment.
  • A major challenge is how do you find the data
    files of interest?
  • Answer Queryable metadata that allows you to ask
    for files with certain properties without
    browsing the entire collection.
  • One problem The metadata can be huge as well
    making it inefficient to search. Even worse,
    scientific metadata tends to change as research
    evolves.

11
User/Clients View
Find all the 2002 storms over the Ralston Creek
watershed with mean areal precipitation greater
than X mm, and with a spatial extent of more than
Z km2, with a duration of less than N hours. I
want the data in GeoTIFF.
Metadata Archive
12
Current Status and Future Work
  • We have implemented a prototype version of the
    architecture that is currently archiving 30
    radars in real-time. Some basic statistics are
    being generated and can be used to retrieve data
    files of interest. Accessible at
  • http//nexrad.cs.uiowa.edu
  • Immediate plans
  • Generate standardized metadata for use by
    hydrologists.
  • Link NEXRAD data to basin information so that
    rainfall estimation and flood prediction can be
    performed.
  • This research is supported by NSF ITR Grant ATM
    0427422 A Comprehensive Framework for Use of
    NEXRAD Data in Hydrometeorology and Hydrology.

13
NEXRAD Project Participants
  • The University of Iowa (Lead)
  • W.F. Krajewski (PI)
  • A.A. Bradley, A. Kruger, R. Lawrence
  • Princeton University
  • J.A. Smith (PI)
  • M. Steiner, M.L.Baeck
  • National Climatic Data Center
  • S.A. Delgreco (PI)
  • S. Ansari
  • UCAR/Unidata Program Center
  • M. K. Ramamurthy (PI)
  • W.J. Weber

14
An Architecture for Real-Time Warehousing of
Scientific Data
Ramon Lawrence and Anton Kruger IIHR, University
of Iowa ramon-lawrence_at_uiowa.edu http//www.cs.uio
wa.edu/rlawrenc/ http//www.iihr.uiowa.edu/hml/p
rojects/nexrad-itr
Thank You!
15
Extra Slides...
16
NEXRAD Data Management Challenges
  • Storing NEXRAD Level II data results in many
    interesting database challenges
  • Data size - A historical archive of NEXRAD data
    consumes many terabytes of space.
  • Flexibility/Variability - Unlike commercial
    warehouses, the types of data and metadata that
    should be stored in the warehouse is not well
    understood and evolves over time.
  • Real-Time response - The data should be loaded
    and queryable in real-time as it is received from
    the radars.
  • Scientific Workflow - It is desirable to capture
    and share sequences of calculations on the raw
    data (scientific workflows) and develop tools
    that seemlessly interact with the archive.

17
Flexibility Challenges
  • Ideally, the system should allow arbitrary
    metadata to be associated with NEXRAD files that
    can easily be added, updated, and queried.
  • Unfortunately, relational databases do not nicely
    handle variable information. Although there are
    some known schema designs that can handle
    variability, they are inefficient for large data
    sets.
  • Good news This is not unique to hydrology.
    Researchers in other domains are building grids
    to share data/metadata and face the same
    challenges (e.g. GriPhyn - physics grid).
  • Bad news Representing and querying variable data
    (especially within a relational database) is an
    active research problem.

18
Flexibility Example
  • One way to represent variable metadata on a
    datafile in a relational database is to have a
    single table
  • metadata(dataFileId, attributeName,
    attributeValue)
  • Example
  • Data file 1 has three attributes ArealCoverage,
    MaximumReflectivity, MinimumReflectivity. Data
    file 2 has two attributes, and file 3 has only 1.
  • Note that this schema allows any (variable)
    number of attributes per file.
  • A challenge How would you return all files that
    have ArealCoverage gt 5 and MaximumReflectivity gt
    20?

Answer Join two copies of table metadata
together.
19
Scientific Workflow
  • A workflow is a sequence of steps that is
    performed on data.
  • Workflows have received considerable attention
    where documents must be routed between
    individuals.
  • Think of a funding proposal being internally
    routed through your university.
  • A scientific workflow is a sequence of steps
    performed on scientific data. Each step uses as
    input the output of the previous step. An
    example workflow in hydrology
  • retrieve the raw data files of interest
  • remove ground clutter and Anomalous Propagation
    (AP)
  • calculate estimated rain fall
  • map calculations to a basin
  • Our goal is to support such workflows.
  • How to represent and store intermediary products?
  • How to make the tools/algorithms interoperable?

20
A Watershed or Basin
A watershed is an area of land that drains water,
sediment and dissolved materials to a common
receiving body or outlet.
21
NRC Quote on NEXRAD Data Archiving
the limited use of ground-based radar rainfall
data outside of the operational environment is
partially attributed to the lack of
research-quality data products and partially to
poor archiving practices.
NRC Report, 2002
22
Metadata
Basic
Find all the 2002 storms over the Ralston Creek
watershed with mean areal precipitation greater
than X mm, and with a spatial extent of more than
Z km2, with a duration of less than N hours. I
want the data in GeoTIFF
Derived/Complex
Find all the 2002 storms over the Ralston Creek
watershed with mean areal precipitation greater
than X mm, and with a spatial extent of more than
Z km2, with a duration of less than N hours. I
want the data in GeoTIFF
23
CUAHSI
Consortium of Universities for the Advancement of
Hydrologic Sciences (CUAHSI)
Write a Comment
User Comments (0)
About PowerShow.com