Title: Partitioning A Uniform Model for Data Mining
1Partitioning A Uniform Model for Data Mining
- Anne Denton, Qin Ding, William Jockheck, Qiang
Ding and William Perrizo
2Motivation
- Databases and data warehouses are currently
separate systems - Why?
- Standard answer
- Details, details, details
- Our answer
- Fundamental issue of representation
3Relations Revisited
- R(A1, A2, , AN)
- Set of tuples
- Any choices at a fundamental level?
- Yes!
- Duality between
- Element-based representation
- Space-based representation
4Duality
- Element-based representation
- Standard representation of tuples with all their
attributes
- Space-based representation
- The existence (count?) of a tuple is represented
in its attribute space
5Similar Dualities in Physics
- Particles can be represented by the coordinates
of their position - More fundamental level
- Particle
- Particles can be 1 values in a grid of locations
- Field
6Space-Based Representation
- Consider standard tuples as vectors in the space
of attribute domains - Represent all possible attribute combinations as
one bit - 1 if data item is present
- 0 if it isnt
- Allowing counts could be useful for projections
(?)
7Space-Based Representation as a Partition
- Partitions are mutually exclusive and
collectively exhaustive sets of elements - The Space-Based Representation partitions
attribute space into two sets - Data item present in database (1)
- Data item not present (0)
8Usefulness of Space-Based Representation
- No indexes needed instant value-based access
- Index locking becomes dimensional locking
- Aggregation very easy due to value-based ordering
- Selections become ands
- What experience do we have with space-based
representations?
9Data Cube Representation
- One value (e.g., sales) given in the space of the
key attributes - Space-based with respect to key attributes
- Element-based with respect to non-key attributes
10Properties of the Domain Space
- Ideally space should have distance, norm, etc.
- Especially important for data mining
- Does that make sense for all domains?
- Can any domain be mapped to integer?
11Can all Domains be Mapped to Integer?
- Simplistic answer yes!
- All information in a computer is saved as bits
- Any sequence of bits can be interpreted as an
integer - Problems
- Order may be irrelevant, e.g., hair-color
- Order may be wrong, e.g., sign bit for int
- Spacing may vary, e.g., float (solution in paper
intervalization) - Domains may be very large, e.g., movies
12Categorical attributes (irrelevant order)
- We need more than one attribute for an
appropriate representation - Data mining solution
- 1 attribute per domain value
- Our solution
- 1 attribute per bit slice
- Values are corners of a Hypercube in
- log(Domain Size) dimensions
- Distances are given trough MAX metric
13Fundamental Partition(Space-Based Representation)
- d-dimensional representation
- d Number of attributes
- of represented points
- product of all d domain sizes
- Exponential in number of dimensions!
- We badly need compression!
14How Do We Handle Exponential Growth with d?
- How can we reduce of attributes, d?
- Review normalization
- We can decompose a relation into a set of
relations each of which contains the entire key
and one other attribute - This decomposition is
- lossless
- dependency preserving (BCNF relations only)
15Compression for Non-Key Attributes
- Fundamental partition contains only one non-zero
data-point in any non-key dimension - Represent number by bit-slices
- Note
- This works for numerical and categorical
attributes - Original values can be regained by anding
- Example 5 (binary 101)
- bit 0 bit 1 bit 2
16Concept Hierarchies
- Bit sliced representation have significant
benefits beyond compression - Bit slices can be combined into concept
hierarchies - Highest level bit 0
- Next level bit 0 bit 1
- Next level bit 0 bit 1 bit 2
17Compression for Key Attributes
- Database state-independent compression could lead
to information loss (counts gt 1) - Database state-dependent compression
- Tree structure that eliminates pure subtrees gt
P-trees
18Other Ideas
- Compression is better if attribute values are
dense within their domain - We could use extent domain
- Compression good
- Problems with insertion
- Reorganization of storage
- Index locking has to be reintroduced
19How Good is Compression?
- If all domains are dense, i.e. all values occur
- Size can easily be smaller than original relation
- If non-key attributes are sparse
- Not usually a problem good compression
- Problems only in extreme cases
- E.g., movies as attribute values!
- If key-attributes are sparse
- Larger potential for problems, but also large
potential for benefit (see data cubes)
20Are Key-Attributes Usually Sparse?
- Many key attributes are dense (structure
attributes as keys) - Automatically generated IDs are usually
sequential - x and y in spatial data mining
- Time in data streams
- Keys in tables that represent relationships tend
to be sparse (feature attributes as keys) - Student / course offering / grade
- Data cubes!
21What Have We Gained?(Database Aspects)
- Data simultaneously acts as index
- No separate index locking
- (unless extent domain is used)
- All information saved as bit patterns
- Easy select
- Other database operations discussed in class
22Data Mining Benefits(Feature Attribute Keys)
- Direct mining possible on relations with feature
attributes keys - E.g., student / course offering / grade
- Rollup can be defined, etc.
- Clustering, classification, ARM can make use of
proximity inherent in representation - Bit-wise representation provides concept
hierarchy for non-key attribute - Tree structure provides concept hierarchy for key
attributes
23Data Mining Benefits(Structure Attribute Keys)
- For relations with structure attribute keys data
mining requires anding - produces counts for feature attributes
- Bit-wise representation provides concept
hierarchy for non-key attribute - Duality
- Concept hierarchies in this representation map
exactly to tree structure when the attribute is a
key
24Mapping Concept HierarchiesBit Slices lt-gt Tree
- P-tree
- Take key attributes, e.g. x and y, and bit
interleave them - x 1 0 0 1
- y 1 1 0 1
- 1 1 0 1 0 0 1 1
- Two consecutive digits form a level in the P-tree
or a level in a concept hierarchy
25How Could We Use That Duality?
- Join with other relations and project off key
attributes - Duality allows moving to space of non-key
attributes (Meta P-trees) - Can we do that?
- We lose uniqueness
- We can use 1 to represent 1 or more tuples
(equivalent to relational algebra) - Or we can introduce counts
- Can be useful for data mining
- Need for non-duplicate eliminating counts exists
also in other applications
26How Do Hierarchies Benefit us in Databases?
- Multi-granularity Locking
- Subtrees form suitable units for storage in a
block - Fast value-based access!
- (Data represented as multilevel index)
- Access speed proportional to
- of levels in tree
- of bits for bit slices
27Summary
- Space-based representation has many benefits
- Value-based access and storage
- No separate index needed
- Rollups easy
- P-Trees
- Follow from systematic compression
- Benefits from concept hierarchies