Partitioning A Uniform Model for Data Mining - PowerPoint PPT Presentation

About This Presentation
Title:

Partitioning A Uniform Model for Data Mining

Description:

Duality. Element-based representation: ... Duality: ... Duality allows moving to space of non-key attributes (Meta P-trees) Can we do that? ... – PowerPoint PPT presentation

Number of Views:43
Avg rating:3.0/5.0
Slides: 28
Provided by: DATA86
Category:

less

Transcript and Presenter's Notes

Title: Partitioning A Uniform Model for Data Mining


1
Partitioning A Uniform Model for Data Mining
  • Anne Denton, Qin Ding, William Jockheck, Qiang
    Ding and William Perrizo

2
Motivation
  • Databases and data warehouses are currently
    separate systems
  • Why?
  • Standard answer
  • Details, details, details
  • Our answer
  • Fundamental issue of representation

3
Relations Revisited
  • R(A1, A2, , AN)
  • Set of tuples
  • Any choices at a fundamental level?
  • Yes!
  • Duality between
  • Element-based representation
  • Space-based representation

4
Duality
  • Element-based representation
  • Standard representation of tuples with all their
    attributes
  • Space-based representation
  • The existence (count?) of a tuple is represented
    in its attribute space

5
Similar Dualities in Physics
  • Particles can be represented by the coordinates
    of their position
  • More fundamental level
  • Particle
  • Particles can be 1 values in a grid of locations
  • Field

6
Space-Based Representation
  • Consider standard tuples as vectors in the space
    of attribute domains
  • Represent all possible attribute combinations as
    one bit
  • 1 if data item is present
  • 0 if it isnt
  • Allowing counts could be useful for projections
    (?)

7
Space-Based Representation as a Partition
  • Partitions are mutually exclusive and
    collectively exhaustive sets of elements
  • The Space-Based Representation partitions
    attribute space into two sets
  • Data item present in database (1)
  • Data item not present (0)

8
Usefulness of Space-Based Representation
  • No indexes needed instant value-based access
  • Index locking becomes dimensional locking
  • Aggregation very easy due to value-based ordering
  • Selections become ands
  • What experience do we have with space-based
    representations?

9
Data Cube Representation
  • One value (e.g., sales) given in the space of the
    key attributes
  • Space-based with respect to key attributes
  • Element-based with respect to non-key attributes

10
Properties of the Domain Space
  • Ideally space should have distance, norm, etc.
  • Especially important for data mining
  • Does that make sense for all domains?
  • Can any domain be mapped to integer?

11
Can all Domains be Mapped to Integer?
  • Simplistic answer yes!
  • All information in a computer is saved as bits
  • Any sequence of bits can be interpreted as an
    integer
  • Problems
  • Order may be irrelevant, e.g., hair-color
  • Order may be wrong, e.g., sign bit for int
  • Spacing may vary, e.g., float (solution in paper
    intervalization)
  • Domains may be very large, e.g., movies

12
Categorical attributes (irrelevant order)
  • We need more than one attribute for an
    appropriate representation
  • Data mining solution
  • 1 attribute per domain value
  • Our solution
  • 1 attribute per bit slice
  • Values are corners of a Hypercube in
  • log(Domain Size) dimensions
  • Distances are given trough MAX metric

13
Fundamental Partition(Space-Based Representation)
  • d-dimensional representation
  • d Number of attributes
  • of represented points
  • product of all d domain sizes
  • Exponential in number of dimensions!
  • We badly need compression!

14
How Do We Handle Exponential Growth with d?
  • How can we reduce of attributes, d?
  • Review normalization
  • We can decompose a relation into a set of
    relations each of which contains the entire key
    and one other attribute
  • This decomposition is
  • lossless
  • dependency preserving (BCNF relations only)

15
Compression for Non-Key Attributes
  • Fundamental partition contains only one non-zero
    data-point in any non-key dimension
  • Represent number by bit-slices
  • Note
  • This works for numerical and categorical
    attributes
  • Original values can be regained by anding
  • Example 5 (binary 101)
  • bit 0 bit 1 bit 2

16
Concept Hierarchies
  • Bit sliced representation have significant
    benefits beyond compression
  • Bit slices can be combined into concept
    hierarchies
  • Highest level bit 0
  • Next level bit 0 bit 1
  • Next level bit 0 bit 1 bit 2

17
Compression for Key Attributes
  • Database state-independent compression could lead
    to information loss (counts gt 1)
  • Database state-dependent compression
  • Tree structure that eliminates pure subtrees gt
    P-trees

18
Other Ideas
  • Compression is better if attribute values are
    dense within their domain
  • We could use extent domain
  • Compression good
  • Problems with insertion
  • Reorganization of storage
  • Index locking has to be reintroduced

19
How Good is Compression?
  • If all domains are dense, i.e. all values occur
  • Size can easily be smaller than original relation
  • If non-key attributes are sparse
  • Not usually a problem good compression
  • Problems only in extreme cases
  • E.g., movies as attribute values!
  • If key-attributes are sparse
  • Larger potential for problems, but also large
    potential for benefit (see data cubes)

20
Are Key-Attributes Usually Sparse?
  • Many key attributes are dense (structure
    attributes as keys)
  • Automatically generated IDs are usually
    sequential
  • x and y in spatial data mining
  • Time in data streams
  • Keys in tables that represent relationships tend
    to be sparse (feature attributes as keys)
  • Student / course offering / grade
  • Data cubes!

21
What Have We Gained?(Database Aspects)
  • Data simultaneously acts as index
  • No separate index locking
  • (unless extent domain is used)
  • All information saved as bit patterns
  • Easy select
  • Other database operations discussed in class

22
Data Mining Benefits(Feature Attribute Keys)
  • Direct mining possible on relations with feature
    attributes keys
  • E.g., student / course offering / grade
  • Rollup can be defined, etc.
  • Clustering, classification, ARM can make use of
    proximity inherent in representation
  • Bit-wise representation provides concept
    hierarchy for non-key attribute
  • Tree structure provides concept hierarchy for key
    attributes

23
Data Mining Benefits(Structure Attribute Keys)
  • For relations with structure attribute keys data
    mining requires anding
  • produces counts for feature attributes
  • Bit-wise representation provides concept
    hierarchy for non-key attribute
  • Duality
  • Concept hierarchies in this representation map
    exactly to tree structure when the attribute is a
    key

24
Mapping Concept HierarchiesBit Slices lt-gt Tree
  • P-tree
  • Take key attributes, e.g. x and y, and bit
    interleave them
  • x 1 0 0 1
  • y 1 1 0 1
  • 1 1 0 1 0 0 1 1
  • Two consecutive digits form a level in the P-tree
    or a level in a concept hierarchy

25
How Could We Use That Duality?
  • Join with other relations and project off key
    attributes
  • Duality allows moving to space of non-key
    attributes (Meta P-trees)
  • Can we do that?
  • We lose uniqueness
  • We can use 1 to represent 1 or more tuples
    (equivalent to relational algebra)
  • Or we can introduce counts
  • Can be useful for data mining
  • Need for non-duplicate eliminating counts exists
    also in other applications

26
How Do Hierarchies Benefit us in Databases?
  • Multi-granularity Locking
  • Subtrees form suitable units for storage in a
    block
  • Fast value-based access!
  • (Data represented as multilevel index)
  • Access speed proportional to
  • of levels in tree
  • of bits for bit slices

27
Summary
  • Space-based representation has many benefits
  • Value-based access and storage
  • No separate index needed
  • Rollups easy
  • P-Trees
  • Follow from systematic compression
  • Benefits from concept hierarchies
Write a Comment
User Comments (0)
About PowerShow.com