Partitioning A Uniform Model for Data Mining - PowerPoint PPT Presentation

About This Presentation

Title:

Partitioning A Uniform Model for Data Mining

Description:

Duality. Element-based representation: ... Duality: ... Duality allows moving to space of non-key attributes (Meta P-trees) Can we do that? ... – PowerPoint PPT presentation

Number of Views:43

Avg rating:3.0/5.0

Slides: 28

Provided by: DATA86

Learn more at: http://www.cs.ndsu.nodak.edu

Category:

more less

Transcript and Presenter's Notes

Title: Partitioning A Uniform Model for Data Mining

1
Partitioning A Uniform Model for Data Mining

Anne Denton, Qin Ding, William Jockheck, Qiang
Ding and William Perrizo

2
Motivation

Databases and data warehouses are currently
separate systems
Why?
Standard answer
Details, details, details
Our answer
Fundamental issue of representation

3
Relations Revisited

R(A1, A2, , AN)
Set of tuples
Any choices at a fundamental level?
Yes!
Duality between
Element-based representation
Space-based representation

4
Duality

Element-based representation
Standard representation of tuples with all their
attributes

Space-based representation
The existence (count?) of a tuple is represented
in its attribute space

5
Similar Dualities in Physics

Particles can be represented by the coordinates
of their position
More fundamental level
Particle

Particles can be 1 values in a grid of locations
Field

6
Space-Based Representation

Consider standard tuples as vectors in the space
of attribute domains
Represent all possible attribute combinations as
one bit
1 if data item is present
0 if it isnt
Allowing counts could be useful for projections
(?)

7
Space-Based Representation as a Partition

Partitions are mutually exclusive and
collectively exhaustive sets of elements
The Space-Based Representation partitions
attribute space into two sets
Data item present in database (1)
Data item not present (0)

8
Usefulness of Space-Based Representation

No indexes needed instant value-based access
Index locking becomes dimensional locking
Aggregation very easy due to value-based ordering
Selections become ands
What experience do we have with space-based
representations?

9
Data Cube Representation

One value (e.g., sales) given in the space of the
key attributes
Space-based with respect to key attributes
Element-based with respect to non-key attributes

10
Properties of the Domain Space

Ideally space should have distance, norm, etc.
Especially important for data mining
Does that make sense for all domains?
Can any domain be mapped to integer?

11
Can all Domains be Mapped to Integer?

Simplistic answer yes!
All information in a computer is saved as bits
Any sequence of bits can be interpreted as an
integer
Problems
Order may be irrelevant, e.g., hair-color
Order may be wrong, e.g., sign bit for int
Spacing may vary, e.g., float (solution in paper
intervalization)
Domains may be very large, e.g., movies

12
Categorical attributes (irrelevant order)

We need more than one attribute for an
appropriate representation
Data mining solution
1 attribute per domain value
Our solution
1 attribute per bit slice
Values are corners of a Hypercube in
log(Domain Size) dimensions
Distances are given trough MAX metric

13
Fundamental Partition(Space-Based Representation)

d-dimensional representation
d Number of attributes
of represented points
product of all d domain sizes
Exponential in number of dimensions!
We badly need compression!

14
How Do We Handle Exponential Growth with d?

How can we reduce of attributes, d?
Review normalization
We can decompose a relation into a set of
relations each of which contains the entire key
and one other attribute
This decomposition is
lossless
dependency preserving (BCNF relations only)

15
Compression for Non-Key Attributes

Fundamental partition contains only one non-zero
data-point in any non-key dimension
Represent number by bit-slices
Note
This works for numerical and categorical
attributes
Original values can be regained by anding
Example 5 (binary 101)
bit 0 bit 1 bit 2

16
Concept Hierarchies

Bit sliced representation have significant
benefits beyond compression
Bit slices can be combined into concept
hierarchies
Highest level bit 0
Next level bit 0 bit 1
Next level bit 0 bit 1 bit 2

17
Compression for Key Attributes

Database state-independent compression could lead
to information loss (counts gt 1)
Database state-dependent compression
Tree structure that eliminates pure subtrees gt
P-trees

18
Other Ideas

Compression is better if attribute values are
dense within their domain
We could use extent domain
Compression good
Problems with insertion
Reorganization of storage
Index locking has to be reintroduced

19
How Good is Compression?

If all domains are dense, i.e. all values occur
Size can easily be smaller than original relation
If non-key attributes are sparse
Not usually a problem good compression
Problems only in extreme cases
E.g., movies as attribute values!
If key-attributes are sparse
Larger potential for problems, but also large
potential for benefit (see data cubes)

20
Are Key-Attributes Usually Sparse?

Many key attributes are dense (structure
attributes as keys)
Automatically generated IDs are usually
sequential
x and y in spatial data mining
Time in data streams
Keys in tables that represent relationships tend
to be sparse (feature attributes as keys)
Student / course offering / grade
Data cubes!

21
What Have We Gained?(Database Aspects)

Data simultaneously acts as index
No separate index locking
(unless extent domain is used)
All information saved as bit patterns
Easy select
Other database operations discussed in class

22
Data Mining Benefits(Feature Attribute Keys)

Direct mining possible on relations with feature
attributes keys
E.g., student / course offering / grade
Rollup can be defined, etc.
Clustering, classification, ARM can make use of
proximity inherent in representation
Bit-wise representation provides concept
hierarchy for non-key attribute
Tree structure provides concept hierarchy for key
attributes

23
Data Mining Benefits(Structure Attribute Keys)

For relations with structure attribute keys data
mining requires anding
produces counts for feature attributes
Bit-wise representation provides concept
hierarchy for non-key attribute
Duality
Concept hierarchies in this representation map
exactly to tree structure when the attribute is a
key

24
Mapping Concept HierarchiesBit Slices lt-gt Tree

P-tree
Take key attributes, e.g. x and y, and bit
interleave them
x 1 0 0 1
y 1 1 0 1
1 1 0 1 0 0 1 1
Two consecutive digits form a level in the P-tree
or a level in a concept hierarchy

25
How Could We Use That Duality?