Aggregate features for relational data Claudia Perlich, Foster Provost - PowerPoint PPT Presentation

About This Presentation

Title:

Aggregate features for relational data Claudia Perlich, Foster Provost

Description:

Feature extraction Other features added for tests Not part of their aggregation proposal AH: abstraction hierarchy (?) Pull into T all fields that are just ... – PowerPoint PPT presentation

Number of Views:70

Avg rating:3.0/5.0

Slides: 53

Provided by: tres4

Learn more at: https://courses.cs.washington.edu

Category:

more less

Transcript and Presenter's Notes

Title: Aggregate features for relational data Claudia Perlich, Foster Provost

1
Aggregate features for relational dataClaudia
Perlich, Foster Provost
Pat Tressel 16-May-2005
2
Overview

Perlich and Provost provide...
Hierarchy of aggregation methods
Survey of existing aggregation methods
New aggregation methods
Concerned w/ supervised learning only
But much seems applicable to clustering

3
The issues

Most classifiers use feature vectors
Individual features have fixed arity
No links to other objects
How do we get feature vectors from relational
data?
Flatten it
Joins
Aggregation
(Are feature vectors all there are?)

4
Joins

Why consider them?
Yield flat feature vectors
Preserve all the data
Why not use them?
They emphasize data with many references
Ok if thats what we want
Not ok if sampling was skewed
Cascaded or transitive joins blow up

5
Joins

They emphasize data with many references
Lots more Joes than there were before...

6
Joins

Why not use them?
What if we dont know the references?
Try out everything with everything else
Cross product yields all combinations
Adds fictitious relationships
Combinatorial blowup

7
Joins

What if we dont know the references?

8
Aggregates

Why use them?
Yield flat feature vectors
No blowup in number of tuples
Can group tuples in all related tables
Can keep as detailed stats as desired
Not just max, mean, etc.
Parametric dists from sufficient stats
Can apply tests for grouping
Choice of aggregates can be model-based
Better generalization
Include domain knowledge in model choice

9
Aggregates

Anything wrong with them?
Data is lost
Relational structure is lost
Influential individuals are lumped in
Doesnt discover critical individuals
Dominates other data
Any choice of aggregates assumes a model
What if its wrong?
Adding new data can require calculations
But can avoid issue by keeping sufficient
statistics

10
Taxonomy of aggregates

Why is this useful?
Promote deliberate use of aggregates
Point out gaps in current use of aggregates
Find appropriate techniques for each class
Based on complexity due to
Relational structure
Cardinality of the relations (11, 1n, mn)
Feature extraction
Computing the aggregates
Class prediction

11
Taxonomy of aggregates

Formal statement of the task
Notation (here and on following slides)
t, tuple (from target table T, with main
features)
y, class (known per t if training)
?, aggregation function
F, classification function
s, select operation (where joins preserve t)
O, all tables B, any other table, b in B
u, fields to be added to t from other tables
f, a field in u
More, that doesnt fit on this slide

12
Taxonomy of aggregates

Formal statement of the task
Notation (here and on following slides)
Caution! Simplified from whats in the paper!
t, tuple (from target table T, with main
features)
y, class (known per t if training)
?, aggregation function
F, classification function
s, select operation (where joins preserve t)
O, all tables B, any other table, b a tuple in B
u, fields to be added to t from joined tables
f, a field in u
More, that doesnt fit on this slide

13
Aggregation complexity

Simple
One field from one object type

Denoted by

14
Aggregation complexity

Multi-dimensional
Multiple fields, one object type

Denoted by

15
Aggregation complexity

Multi-type
Multiple object types

Denoted by

16
Relational concept complexity

Propositional
No aggregation
Single tuple, 1-1 or n-1 joins
n-1 is just a shared object
Not relational per se already flat

17
Relational concept complexity

Independent fields
Separate aggregation per field
Separate 1-n joins with T

18
Relational concept complexity

Dependent fields in same table
Multi-dimensional aggregation
Separate 1-n joins with T

19
Relational concept complexity

Dependent fields over multiple tables
Multi-type aggregation
Separate 1-n joins, still only with T

20
Relational concept complexity

Global
Any joins or combinations of fields
Multi-type aggregation
Multi-way joins
Joins among tables other than T

21
Current relational aggregation

First-order logic
Find clauses that directly predict the class
? is OR
Form binary features from tests
Logical and arithmetic tests
These go in the feature vector
? is any ordinary classifier

22
Current relational aggregation

The usual database aggregates
For numerical values
mean, min, max, count, sum, etc.
For categorical values
Most common value
Count per value

23
Current relational aggregation

Set distance
Two tuples, each with a set of related tuples
Distance metric between related fields
Euclidean for numerical data
Edit distance for categorical
Distance between sets is distance of closest pair

24
Proposed relational aggregation

Recall the point of this work
Tuple t from table T is part of a feature vector
Want to augment w/ info from other tables
Info added to t must be consistent w/ values in t
Need to flatten the added info to yield one
vector per tuple t
Use that to
Train classifier given class y for t
Predict class y for t

25
Proposed relational aggregation

Outline of steps
Do query to get more info u from other tables
Partition the results based on
Main features t
Class y
Predicates on t
Extract distributions over results for fields in
u
Get distribution for each partition
For now, limit to categorical fields
Suggest extension to numerical fields
Derive features from distributions

26
Do query to get info from other tables

Select
Based on the target table T
If training, known class y is included in T
Joins must preserve distinct values from T
Join on as much of Ts key as is present in other
table
Maybe need to constrain other fields?
Not a problem for correctly normalized tables
Project
Include all of t
Append additional fields u from joined tables
Anything up to all fields from joins

27
Extract distributions

Partition query results various ways, e.g.
Into cases per each t
For training, include the (known) class y in t
Also (if training) split per each class
Want this for class priors
Split per some (unspecifed) predicate c(t)
For each partition
There is a bag of associated u tuples
Ignore the t part already a flat vector
Split vertically to get bags of individual values
per each field f in u
Note this breaks association between fields!

28
Distributions for categorical fields

Let categorical field be f with values fi
Form histogram for each partition
Count instances of each value fi of f in a bag
These are sufficient statistics for
Distribution over fi values
Probability of each bag in the partition
Start with one per each tuple t and field f
Cft, (per-) case vector
Component Cfti, count for fi

29
Distributions for categorical fields

Distribution of histograms per predicatec(t) and
field f
Treat histogram counts as random variables
Regard c(t) true partition as a collection of
histogram samples
Regard histograms as vectors of random variables,
one per field value fi
Extract moments of these histogram count
distributions
mean (sort of) reference vector
variance (sort of) variance vector

30
Distributions for categorical fields

Net histogram per predicate c(t), field f
c(t) partitions tuples t into two groups
Only histogram the c(t) true group
Could include c as a predicate if we want
Dont re-count!
Already have histograms for each t and f case
reference vectors
Sum the case reference vectors columnwise
Call this a reference vector, Rfc
Proportional to average histogram over t for c(t)
true (weighted by samples per t)

31
Distributions for categorical fields

Variance of case histograms per predicatec(t)
and field f
Define variance vector, Vfc
Columnwise sum of squares of case reference
vectors / number of samples with c(t) true
Not an actual variance
Squared means not subtracted
Dont care
Its indicative of the variance...
Throw in means-based features as well to give
classifier full variance info

32
Distributions for categorical fields

What predicates might we use?
Unconditionally true, c(t) true
Result is net distribution independent of t
Unconditional reference vector, R
Per class k, ck(t) (t.y k)
Class priors
Recall for training data, y is a field in t
Per class reference vector,

33
Distributions for categorical fields

Summary of notation
c(t), a predicate based on values in a tuple t
f, a categorical field from a join with T
fi, values of f
Rfc, reference vector
histogram over fi values in bag for c(t) true
Cft, case vector
histogram over fi values for ts bag
R, unconditional reference vector
Vfc, variance vector
Columnwise average squared ref. vector
Xi, i th value in some ref. vector X

34
Distributions for numerical data

Same general idea representative distributions
per various partitions
Can use categorical techniques if we
Bin the numerical values
Treat each bin as a categorical value

35
Feature extraction

Base features on ref. and variance vectors
Two kinds
Interesting values
one value from case reference vector per t
same column in vector for all t
assorted options for choosing column
choices depend on predicate ref. vectors
Vector distances
distance between case ref. vector and predicate
ref. vector
various distance metrics
More notation acronym for each feature type

36
Feature extraction interesting values

For a given c, f, select that fi which is...
MOC Most common overall
argmaxi Ri
Most common in each class
For binary class y
Positive is y 1, Negative is y 0
MOP argmaxi Rft.y1i
MON argmaxi Rft.y0i
Most distinctive per class
Common in one class but not in other(s)
MOD argmaxi Rft.y1i - Rft.y0i
MOM argmaxi MOD / Vft.y1i - Vft.y0i
Normalizes for variance (sort of)

37
Feature extraction vector distance

Distance btw given ref. vector each case vector
Distance metrics
ED Edit not defined
Sum of abs. diffs, a.k.a. Manhattan dist?
Si Ci Ri
EU Euclidean
v(Ci T Ri ), omit v for speed
MA Mahalanobis
v(Ci T S-1 Ri ), omit v for speed
S should be covariance...of what?
CO Cosine, 1- cos(angle btw vectors)
1 - Ci T Ri / v (Ci Ri )

38
Feature extraction vector distance

Apply each metric w/ various ref. vectors
Acronym is metric w/ suffix for ref. vector
(No suffix) Unconditional ref. vector
P per-class positive ref. vector, Rft.y1
N per-class positive ref. vector, Rft.y0
D difference between P and D distances
Alphabet soup, e.g. EUP, MAD,...

39
Feature extraction

Other features added for tests
Not part of their aggregation proposal
AH abstraction hierarchy (?)
Pull into T all fields that are just shared
records via n1 references
AC autocorrelation aggregation
For joins back into T, get other cases linked
to each t
Fraction of positive cases among others

40
Learning

Find linked tables
Starting from T, do breadth-first walk of schema
graph
Up to some max depth
Cap number of paths followed
For each path, know T is linked to last table in
path
Extract aggregate fields
Pull in all fields of last table in path
Aggregate them (using new aggregates) per t
Append aggregates to t

41
Learning

Classifier
Pick 10 subsets each w/ 10 features
Random choice, weighted by performance
But theres no classifier yet...so how do
features predict class?
Build a decision tree for each feature set
Have class frequencies at leaves
Features might not completely distinguish classes
Class prediction
Select class with higher frequency
Class probability estimation
Average frequencies over trees

42
Tests

IPO data
5 tables
Most fields in the main table, used as T
Other tables had key one data field
Predicate on one field in T used as the class
Tested against
First-order logic aggregation
Extract clauses using an ILP system
Append evaluated clauses to each t
Various ILP systems
Using just data in T
(Or T and AH features?)

43
Tests

IPO data
5 tables w/ small, simple schema
Majority of fields were in the main table, i.e.
T
The only numeric fields were in main table, so no
aggregation of numeric features needed
Other tables had key one data field
Max path length 2 to reach all tables, no
recursion
Predicate on one field in T used as the class
Tested against
First-order logic aggregation
Extract clauses using an ILP system
Append evaluated clauses to each t
Various ILP systems
Using just data in T (or T and AH features?)

44
Test results

See paper for numbers
Accuracy with aggregate features
Up to 10 increase over only features from T
Depends on which and how many extra features used
Most predictive feature was in a separate table
Expect accuracy increase as more info available
Shows info was not destroyed by aggregation
Vector distance features better
Generalization

45
Interesting ideas (I) benefits (B)

Taxonomy
I Division into stages of aggregation
Slot in any procedure per stage
Estmate complexity per stage
B Might get the discussion going
Aggregate features
I Identifying a main table
Others get aggregated
I Forming partitions to aggregate over
Using queries with joins to pull in other tables
Abstract partitioning based on predicate
I Comparing case against reference histograms
I Separate comparison method and reference

46
Interesting ideas (I) benefits (B)

Learning
I Decision tree tricks
Cut DT induction off short to get class freqs
Starve DT of features to improve generalization

47
Issues

Some worrying lapses...
Lacked standard terms for common concepts
position i of vector has the number of
instances of ith value... -gt histogram
abstraction hierarchy -gt schema
value order -gt enumeration
Defined (and emphasized) terms for trivial and
commonly used things
Imprecise use of terms
variance for (something like) second moment
Im not confident they know what Mahalanobis
distance is
They say left outer join and show inner join
symbol

48
Issues

Some worrying lapses...
Did not connect reference vector and variance
vector to underlying statistics
Should relate to bag prior and field value
conditional probability, not just weighted
Did not acknowledge loss of correlation info from
splitting up joined u tuples in their features
Assumes fields are independent
Dependency was mentioned in the taxonomy
Fig 1 schema cannot support 2 example query
Missing a necessary foreign key reference

49
Issues