Query Processing, Resource Management and Approximate in a Data Stream Management System - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

Query Processing, Resource Management and Approximate in a Data Stream Management System

Description:

Query Processing, Resource Management and Approximate in a Data Stream Management System – PowerPoint PPT presentation

Number of Views:125
Avg rating:3.0/5.0
Slides: 27
Provided by: dongm7
Category:

less

Transcript and Presenter's Notes

Title: Query Processing, Resource Management and Approximate in a Data Stream Management System


1
NDSU C.S. 783 Parallel and Vertical High
Performance Software Systems Data
StructuringData Lecture 8 By Dr. William
Perrizo and Dr. Gregory Wettstein
2
Horizontal Data
CENTRALITY OF DATA Data are central to every
computer program. If a program has no data, there
is no input, no output, no constants, no
variables... It is hard to imagine a program in
which there is no data? Therefore, virtually all
programs are data management programs and
therefore, virtually all computing involves data
management. However, not the all data in computer
programs is RESIDUALIZED. RESIDUALIZED data is
data stored and managed after the termination of
the program that generated it (for reuse
later). Database Management Systems (DBMSs) store
and manage residualized data.
HUGE VOLUME (EVERYONE HAS LOTS OF DATA
AVAILABLE TO THEM TODAY!) Data are collected much
faster than data are process or managed. NASA's
Earth Observation System (EOS), alone, has
collected over 15 petabytes of data already
(15,000,000,000,000,000 bytes). Most of it will
never be use! Most of it will never be seen!
Why not? There's so much volume, so usefulness
of much of it will never be discovered. SOLUTION
Reduce the volume and raise the information
density through structuring, querying, filtering,
mining, summarizing, aggregating... That's the
main task of Data and Database workers
today! Claude Shannon's information theory
principle comes into play here More volume
means less information.
3
Shannon's Law of Information
  • The more volume you have, the less information
    you have. (AKA Shannons Canon)
  • A simple illustration Which phone book has more
    useful information?
  • (both have the same 4 data granules Smith,
    Jones, 234-9814, 231-7237)
  • BOOK-1 BOOK-2
  • Name Number Name Number
  • Smith 234-9816 Smith 234-9816
  • Jones 231-7237 Smith 231-7237
  • Jones 234-9816
  • Jones 231-7237
  • The Red Book has no useful phone number
    information!

Data analysis, querying and mining reduce volume
and raises info level
4
STRUCTURING and RESIDUALIZING DATA
Proper Structuring of data may be the second most
important task in data and database system work
today! At the highest level, is the decision as
to whether a data set should be structured as
horizontal or vertical data (or some
combination). Another important task to be
addressed in data systems work today is
RESIDUALIZATION OF DATA MUCH WELL-STRUCTURED
DATA IS DISCARDED PREMATURELY Databases are
about storing data persistently, for later use.
RESIDUALIZING DATA may be the third most
important task in data and database system work
today!
5
WHAT IS A DATABASE?
There are many definitions in the literature.
Here is the one we will use An integrated
shared repository of operational data of interest
to an enterprise. INTEGRATED it must be the
unification of several distinct files SHARED
same data can be used by more than 1 user
(concurrently) REPOSITORY implies
"persistence". OPERATIONAL DATA data on
accounts, parts, patients, students, employees,
genes, stock, pixels,... By contrast,
nonoperational incl. I/O data, transient data in
buffers, queues... ENTERPRISE bank, warehouse,
hospital, school, corp, gov agency, person..
WHAT IS A DATABASE MANAGEMENT SYSTEM (DBMS) A
program which organizes and manages access to
residual data Databases also contains METADATA
also (data on the data). Metadata is non-user
data which contains the descriptive information
about the data and database organization (i.e.,
Catalog data).
6
WHY USE A DATABASE?
COMPACTNESS (saves space - no paper files
necessary). EASE OF USE (less drudgery, more of
the organizational and search work done by the
system user specifies what, not how).
CENTRALIZED CONTROL (by DB Administrator (DBA)
and by the CEO). REDUCES REDUNDANCY (1 copy is
enough, but concurrent use must be controlled NO
INCONSISTENCIES (again, since there is only 1
copy necessary). ENFORCE STANDARDS (corporate,
dept, industry, national, international).
INTEGRITY CONSTRAINTS (automatically
maintained) (e.g., GENDERmale gt
MAIDEN_NAMEnull). BALANCE REQUIREMENTS (even
conflicting requirements? DataBase Administrator
(DBA) can optimize for the whole company). DATA
INDEPENDENCE (occurs because applications are
immune to storage structure and access strategy
changes. Can change the storage structure without
changing the access programs and vice versa).
7
HORIZONTAL DATA
8
Stored? (versus logical)
STORED FILE is a named collection of all
occurrences of 1 type of stored record.
9
The employee file type IS the common employee
record type ( possibly, some other

type characteristics, e.g.,
max--records) In todays storage device world,
there is only linear storage space, so the 2-D
picture of a stored file, strictly speaking, not
possible in physical storage media today. Some
day there may be truly 2-D storage (e.g.,
holographic storage) and even 3-D. A more
accurately depiction of the store Employee file
(as stored on linear storage)
So we also have LOGICAL FIELD smallest
unit of logical data LOGICAL RECORD named
collection of related logical fields. LOGICAL
FILE named collection of occurrences of
1 type of logical record which may
or may not correspond to the physical entities.
10
Unfortunately theres a lot of variation in
terminology. It will suffice to equate terms as
follows COMMON USAGE RELATIONAL MODEL
TABULAR USAGE File Relation Table
Record Tuple Row Field Attribute
Column When we need to be more careful we
will use relation is a "set" of tuples whereas a
table is a "sequence" of rows or records (has
order) tuple is a "set" of fields whereas a row
or record is a "sequence" of fields (has order)
DATA MODELS for conceptualizing (logical) and
storing (physical) data we have horizontal and
vertical models Here are some of the HORIZONTAL
MODELS for files of horizontal records
(processing is typically done through vertical
scans, e.g., Get and process 1st record. Get
and process next record ...) RELATIONAL
(simple flat unordered files or
relations of records of tuples of unordered field
values) TABULAR (ordered
files of ordered fields) INVERTED LIST
(Tabular with an access paths (index?) on
every field) HIERARCHICAL (files
with hierarchical links) NETWORK
(files with record chains)
OBJECT-RELATIONAL (Relational with "Large
OBject" (LOBs)) (attributes pointing to or
containinf complex objects).
Here are some of the VERTICAL MODELS (for
vertical vectors or trees of attribute values,
processing is typically through logical
horizontal AND/OR programs). BINARY STORAGE MODEL
(Copeland 1986) BIT TRANSPOSE FILES
(Wang 1988) VIPER STRUCTURES
(1998) (Used vertical bit vectors for data
mining.) PREDICATE-Trees or Ptrees (patented by
NDSU and uses vertical bit trees) (1997). (The
last one only is described in detail later in
these notes).
11
REVIEW OF HORIZONTAL DATA MODELS
RELATIONAL DATA MODEL The only construct allowed
is a simple, flat relation for both entity
description and relationship definition, e.g.,
The STUDENT and COURSE relations represent
entities
The LOCATION relation represents a relationship
between the LCODE and STATUS attributes
(1-to-many).
The ENROLL relations represents a relationship
between Student and Course entities (a many-many
relationship)
12
REVIEW OF HORIZONTAL DATA MODELS
HIERARCHICAL DATA MODEL entitiesrecords
relationshipslinks of records forming trees
EX root type is STUDENT (with attributes
S, NAME, LOCATION),
dependent type is COURSE (with attributes C,
CNAME),
2nd-level dependent type ENROLLMENT (with
attributes, GRADE, LOC)
If the typical workload involves producing class
lists for students, this organization is very
good. Why? If the typical workload is producing
course enrollment lists for professors, this is
very poor. Why? The problem with the
Hierarchical Data Model is that it almost always
favors a particular
workload category (at the expense of the others).
13
REVIEW OF HORIZONTAL DATA MODELS
NETWORK DATA MODEL entities records
relationships owner-member chains (sets)
many-to-many relationships easily
accomodated EX 3 entities (STUDENT
ENROLLMENT COURSE)
2 owner-member chains
STUDENT-ENROLLMENT COURSE-ENROLLMENT
Easy to insert (create new record and reset
pointers), delete (reset pointers), update
(always just 1 copy to worry about, ZERO
REDUNDANCY!) network approach fast processing,
complicated structure (usually requires data
processing shop) Again, it favors one workload
type over others.
14
REVIEW OF HORIZONTAL DATA MODELS
INVERTED LIST MODEL (TABULAR)
Flat Ordered Files (like relational except
there's intrinsic order visible to user programs
on both tuples and attributes). Order is
usually "arrival order", meaning each record is
given a unique "Relative Record Number" or RRN
when it is inserted. - RRNs never change (unless
there is a reorganization). Programs can access
records by RRN. Physical placement of records on
pages is in RRN order ("clustered on RRN" so that
application programs can efficiently retrieve in
RRN order. Indexes, etc can be provided for other
access paths (and orderings).
Object Relational Model (OR model) is like
relational model except repeating groups are
allowed (many levels of repeating groups - even
nested repeating groups) and Pointers to complex
structures are allowed. (LOBs for Large OBjects,
BLOBs for Binary Large OBjects, etc. for storing,
e.g., pictures, movies, and other binary large
objects.
15
Vertical Data
In Data Processing, you run up against two curses
immediately. Curse of cardinality solutions
dont scale well with respect to record
volume. "files are too deep!" Curse of
dimensionality solutions dont scale with
respect to attribute dimension. "files are
too wide!" The curse of cardinality is a
problem in both the horizontal and vertical data
worlds! In the horizontal data world it was
disguised as curse of slow joins. In the
horizontal world we decompose relations to get
good design (e.g., 3rd normal form), but then we
pay for that by requiring many slow joins to get
the answers we need.
16
Techniques to address these curses.
  • Horizontal Processing of Vertical Data or HPVD,
    instead of the ubiquitous Vertical Processing of
    Horizontal (record orientated) Data or VPHD.
  • Parallelizing the processing engine.
  • Parallelize the software engine on clusters of
    computers.
  • Parallelize the greyware engine on clusters of
    people
  • (i.e., enable visualization and use the web...).

Again, we need better techniques for data
analysis, querying and mining because
of Parkinsons Law Data volume expands to
fill available data storage. Moores law
Available storage doubles every 9 months!
17
A few HPVD successes 1. Precision Agriculture
Yield prediction Using Remotely Sensed Imagery
(RSI) consists of an aerial photograph (RGB TIFF
image taken July) and a synchronized crop yield
map taken at harvest thus, 4 feature attributes
(B,G,R,Y) and 100,000 pixels.
Producer are able to analyze the color intensity
patterns from aerial and satellite photos taken
in mid season to predict yield (find associations
between electromagnetic reflection and
yeild). E.g., hi_green low_red ? hi_yield.
That is very intuitive.
A stronger association, hi_NIR
low_red?hi_yield, found through HPVD data
mining), allows producers to take and query
mid-season aerial photographs for low_NIR
high_red grid cells, and where low yeild is
anticipated, apply (top dress) additional
nitrogen. Can producers use Landsat images of
China of predict wheat prices before planting?
2. Infestation Detection (e.g., Grasshopper
Infestation Prediction - again involving RSI)
Pixel classification on remotely sensed imagery
holds much promise to achieve early detection.
Pixel classification (signaturing) has many,
many applications pest detection, Flood
monitoring, fire detection, wetlands monitoring
18
Sensor Network Data HPVD
  • Micro and Nano scale sensor blocks
  • are being developed for sensing
  • Biological agents
  • Chemical agents
  • Motion detection
  • coatings deterioration
  • RF-tagging of inventory (RFID tags for Supply
    Chain Mgmt)
  • Structural materials fatigue
  • There will be trillions of individual sensors
    creating mountains of data which can be data
    mined using HPVD (maybe it shouldn't be called a
    success yet?).

19
A Sensor Network Application
CubE for Active Situation Replication (CEASR)
Nano-sensors dropped into the Situation space
Wherever threshold level is sensed (chem, bio,
thermal...) a ping is registered in a compressed
structure (P-tree detailed definition coming
up) for that location.
Using Alien Technologys Fluidic Self-assembly
(FSA) technology, clear plexiglass laminates are
joined into a cube, with a embedded nano-LED at
each voxel.
........ .. . . .
..... . ..... ........
.. . . . ..... . ..... ......
.. .. . . . ..... .
.....
The single compressed structure (P-tree)
containing all the information is transmitted to
the cube, where the pattern is reconstructed
(uncompress, display).
Each energized nano-sensor transmits a ping
(location is triangulated from the ping). These
locations are then translated to 3-dimensional
coordinates at the display. The corresponding
voxel on the display lights up. This is the
expendable, one-time, cheap sensor version. A
more sophisticated CEASR device could sense and
transmit the intensity levels, lighting up the
display voxel with the same intensity.
Soldier sees replica of sensed situation prior to
entering space
20
Anthropology ApplicationDigital Archive Network
for Anthropology (DANA)(analyze, query and mine
arthropological artifacts (shape, color,
discovery location,)
21
What has spawned these successes?(i.e., What is
Data Mining?)
Querying is asking specific questions for
specific answers Data Mining is finding the
patterns that exist in data (going into
MOUNTAINS of raw data for the information gems
hidden in that mountain of data.)
22
Data Mining versus Querying
There is a whole spectrum of techniques to get
information from data
  • Even on the Query end, much work is yet to be
    done (D. DeWitt, ACM SIGMOD Record02).
  • On the Data Mining end, the surface has barely
    been scratched.
  • But even those scratches have had a great impact.
    For example, one of the early scatchers became
    the biggest corporation in the world. A
    Non-scratcher had to file for bankruptcy
    protection.
  • HPVD Approach Vertical, compressed data
    structures, Predicate-trees or Peano-trees
    (Ptrees in either case)1 processed horizontally
    (Most DBMSs process horizontal data vertically)
  • Ptrees are data-mining-ready, compressed data
    structures, which attempt to address the curses
    of cardinality and curse of dimensionality.

1 Ptree Technology is patented by North Dakota
State University
23
Predicate trees (Ptrees) vertically project each
attribute,
1-Dimensional Ptrees
then vertically project each bit position of each
attribute,
Given a table structured into horizontal records.
(which are traditionally processed vertically -
VPHD )
then compress each bit slice into a basic 1D
Ptree. e.g., compression of R11 into P11
goes as follows
2
VPHD to find the number of occurences of 7 0 1 4
HPVD to find the number of occurences of 7 0 1 4?
R(A1 A2 A3 A4)
Base 10
Base 2
2 7 6 1 6 7 6 0 3 7 5 1 2 7 5
7 3 2 1 4 2 2 1 5 7 0 1 4 7 0 1
4
R11 0 0 0 0 0 0 1 1
Top-down construction of the 1-dimensional Ptree
of R11, denoted, P11 Record the truth of the
universal predicate pure 1 in a tree recursively
on halves (1/21 subsets), until purity is
achieved.
P11
To find the number of occurences of 7 0 1 4, AND
these basic Ptrees (next slide)
But it is pure (pure0) so this branch ends
24
R(A1 A2 A3 A4)
2 7 6 1 3 7 6 0 2 7 5 1 2 7 5
7 5 2 1 4 2 2 1 5 7 0 1 4 7 0 1
4
change
To count occurrences of 7,0,1,4 use
111000001100 0
P11P12P13P21P22P23P31P32P33P41P42
P43 0 0

01

The 21-level has the only 1-bit so 1-count
121 2
25
R11 0 0 0 0 1 0 1 1
Top-down construction of basic P-trees is best
for understanding, bottom-up is much faster
(once across).
Bottom-up construction of 1-Dim, P11, is done
using in-order tree traversal, collapsing of pure
siblings as we go
R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42
R43
P11













0 1 0 1 1 1 1 1 0 0 0 1 0 1 1
1 1 1 1 1 0 0 0 0 0 1 0 1 1
0 1 0 1 0 0 1 0 1 0 1 1 1 1
0 1 1 1 1 1 0 1 0 1 0 0 0 1
1 0 0 0 1 0 0 1 0 0 0 1 1 0
1 1 1 1 0 0 0 0 0 1 1 0 0 1 1
1 0 0 0 0 0 1 1 0 0
0




Siblings are pure0 so callapse!
26
Thank you.
Write a Comment
User Comments (0)
About PowerShow.com