Title: Aggregate features for relational data Claudia Perlich, Foster Provost
1Aggregate features for relational dataClaudia
Perlich, Foster Provost
Pat Tressel 16-May-2005
2Overview
- Perlich and Provost provide...
- Hierarchy of aggregation methods
- Survey of existing aggregation methods
- New aggregation methods
- Concerned w/ supervised learning only
- But much seems applicable to clustering
3The issues
- Most classifiers use feature vectors
- Individual features have fixed arity
- No links to other objects
- How do we get feature vectors from relational
data? - Flatten it
- Joins
- Aggregation
- (Are feature vectors all there are?)
4Joins
- Why consider them?
- Yield flat feature vectors
- Preserve all the data
- Why not use them?
- They emphasize data with many references
- Ok if thats what we want
- Not ok if sampling was skewed
- Cascaded or transitive joins blow up
5Joins
- They emphasize data with many references
- Lots more Joes than there were before...
6Joins
- Why not use them?
- What if we dont know the references?
- Try out everything with everything else
- Cross product yields all combinations
- Adds fictitious relationships
- Combinatorial blowup
7Joins
- What if we dont know the references?
8Aggregates
- Why use them?
- Yield flat feature vectors
- No blowup in number of tuples
- Can group tuples in all related tables
- Can keep as detailed stats as desired
- Not just max, mean, etc.
- Parametric dists from sufficient stats
- Can apply tests for grouping
- Choice of aggregates can be model-based
- Better generalization
- Include domain knowledge in model choice
9Aggregates
- Anything wrong with them?
- Data is lost
- Relational structure is lost
- Influential individuals are lumped in
- Doesnt discover critical individuals
- Dominates other data
- Any choice of aggregates assumes a model
- What if its wrong?
- Adding new data can require calculations
- But can avoid issue by keeping sufficient
statistics
10Taxonomy of aggregates
- Why is this useful?
- Promote deliberate use of aggregates
- Point out gaps in current use of aggregates
- Find appropriate techniques for each class
- Based on complexity due to
- Relational structure
- Cardinality of the relations (11, 1n, mn)
- Feature extraction
- Computing the aggregates
- Class prediction
11Taxonomy of aggregates
- Formal statement of the task
- Notation (here and on following slides)
- t, tuple (from target table T, with main
features) - y, class (known per t if training)
- ?, aggregation function
- F, classification function
- s, select operation (where joins preserve t)
- O, all tables B, any other table, b in B
- u, fields to be added to t from other tables
- f, a field in u
- More, that doesnt fit on this slide
12Taxonomy of aggregates
- Formal statement of the task
- Notation (here and on following slides)
- Caution! Simplified from whats in the paper!
- t, tuple (from target table T, with main
features) - y, class (known per t if training)
- ?, aggregation function
- F, classification function
- s, select operation (where joins preserve t)
- O, all tables B, any other table, b a tuple in B
- u, fields to be added to t from joined tables
- f, a field in u
- More, that doesnt fit on this slide
13Aggregation complexity
- Simple
- One field from one object type
14Aggregation complexity
- Multi-dimensional
- Multiple fields, one object type
15Aggregation complexity
- Multi-type
- Multiple object types
16Relational concept complexity
- Propositional
- No aggregation
- Single tuple, 1-1 or n-1 joins
- n-1 is just a shared object
- Not relational per se already flat
17Relational concept complexity
- Independent fields
- Separate aggregation per field
- Separate 1-n joins with T
18Relational concept complexity
- Dependent fields in same table
- Multi-dimensional aggregation
- Separate 1-n joins with T
19Relational concept complexity
- Dependent fields over multiple tables
- Multi-type aggregation
- Separate 1-n joins, still only with T
20Relational concept complexity
- Global
- Any joins or combinations of fields
- Multi-type aggregation
- Multi-way joins
- Joins among tables other than T
21Current relational aggregation
- First-order logic
- Find clauses that directly predict the class
- ? is OR
- Form binary features from tests
- Logical and arithmetic tests
- These go in the feature vector
- ? is any ordinary classifier
22Current relational aggregation
- The usual database aggregates
- For numerical values
- mean, min, max, count, sum, etc.
- For categorical values
- Most common value
- Count per value
23Current relational aggregation
- Set distance
- Two tuples, each with a set of related tuples
- Distance metric between related fields
- Euclidean for numerical data
- Edit distance for categorical
- Distance between sets is distance of closest pair
24Proposed relational aggregation
- Recall the point of this work
- Tuple t from table T is part of a feature vector
- Want to augment w/ info from other tables
- Info added to t must be consistent w/ values in t
- Need to flatten the added info to yield one
vector per tuple t - Use that to
- Train classifier given class y for t
- Predict class y for t
25Proposed relational aggregation
- Outline of steps
- Do query to get more info u from other tables
- Partition the results based on
- Main features t
- Class y
- Predicates on t
- Extract distributions over results for fields in
u - Get distribution for each partition
- For now, limit to categorical fields
- Suggest extension to numerical fields
- Derive features from distributions
26Do query to get info from other tables
- Select
- Based on the target table T
- If training, known class y is included in T
- Joins must preserve distinct values from T
- Join on as much of Ts key as is present in other
table - Maybe need to constrain other fields?
- Not a problem for correctly normalized tables
- Project
- Include all of t
- Append additional fields u from joined tables
- Anything up to all fields from joins
27Extract distributions
- Partition query results various ways, e.g.
- Into cases per each t
- For training, include the (known) class y in t
- Also (if training) split per each class
- Want this for class priors
- Split per some (unspecifed) predicate c(t)
- For each partition
- There is a bag of associated u tuples
- Ignore the t part already a flat vector
- Split vertically to get bags of individual values
per each field f in u - Note this breaks association between fields!
28Distributions for categorical fields
- Let categorical field be f with values fi
- Form histogram for each partition
- Count instances of each value fi of f in a bag
- These are sufficient statistics for
- Distribution over fi values
- Probability of each bag in the partition
- Start with one per each tuple t and field f
- Cft, (per-) case vector
- Component Cfti, count for fi
29Distributions for categorical fields
- Distribution of histograms per predicatec(t) and
field f - Treat histogram counts as random variables
- Regard c(t) true partition as a collection of
histogram samples - Regard histograms as vectors of random variables,
one per field value fi - Extract moments of these histogram count
distributions - mean (sort of) reference vector
- variance (sort of) variance vector
30Distributions for categorical fields
- Net histogram per predicate c(t), field f
- c(t) partitions tuples t into two groups
- Only histogram the c(t) true group
- Could include c as a predicate if we want
- Dont re-count!
- Already have histograms for each t and f case
reference vectors - Sum the case reference vectors columnwise
- Call this a reference vector, Rfc
- Proportional to average histogram over t for c(t)
true (weighted by samples per t)
31Distributions for categorical fields
- Variance of case histograms per predicatec(t)
and field f - Define variance vector, Vfc
- Columnwise sum of squares of case reference
vectors / number of samples with c(t) true - Not an actual variance
- Squared means not subtracted
- Dont care
- Its indicative of the variance...
- Throw in means-based features as well to give
classifier full variance info
32Distributions for categorical fields
- What predicates might we use?
- Unconditionally true, c(t) true
- Result is net distribution independent of t
- Unconditional reference vector, R
- Per class k, ck(t) (t.y k)
- Class priors
- Recall for training data, y is a field in t
- Per class reference vector,
33Distributions for categorical fields
- Summary of notation
- c(t), a predicate based on values in a tuple t
- f, a categorical field from a join with T
- fi, values of f
- Rfc, reference vector
- histogram over fi values in bag for c(t) true
- Cft, case vector
- histogram over fi values for ts bag
- R, unconditional reference vector
- Vfc, variance vector
- Columnwise average squared ref. vector
- Xi, i th value in some ref. vector X
34Distributions for numerical data
- Same general idea representative distributions
per various partitions - Can use categorical techniques if we
- Bin the numerical values
- Treat each bin as a categorical value
35Feature extraction
- Base features on ref. and variance vectors
- Two kinds
- Interesting values
- one value from case reference vector per t
- same column in vector for all t
- assorted options for choosing column
- choices depend on predicate ref. vectors
- Vector distances
- distance between case ref. vector and predicate
ref. vector - various distance metrics
- More notation acronym for each feature type
36Feature extraction interesting values
- For a given c, f, select that fi which is...
- MOC Most common overall
- argmaxi Ri
- Most common in each class
- For binary class y
- Positive is y 1, Negative is y 0
- MOP argmaxi Rft.y1i
- MON argmaxi Rft.y0i
- Most distinctive per class
- Common in one class but not in other(s)
- MOD argmaxi Rft.y1i - Rft.y0i
- MOM argmaxi MOD / Vft.y1i - Vft.y0i
- Normalizes for variance (sort of)
37Feature extraction vector distance
- Distance btw given ref. vector each case vector
- Distance metrics
- ED Edit not defined
- Sum of abs. diffs, a.k.a. Manhattan dist?
- Si Ci Ri
- EU Euclidean
- v(Ci T Ri ), omit v for speed
- MA Mahalanobis
- v(Ci T S-1 Ri ), omit v for speed
- S should be covariance...of what?
- CO Cosine, 1- cos(angle btw vectors)
- 1 - Ci T Ri / v (Ci Ri )
38Feature extraction vector distance
- Apply each metric w/ various ref. vectors
- Acronym is metric w/ suffix for ref. vector
- (No suffix) Unconditional ref. vector
- P per-class positive ref. vector, Rft.y1
- N per-class positive ref. vector, Rft.y0
- D difference between P and D distances
- Alphabet soup, e.g. EUP, MAD,...
39Feature extraction
- Other features added for tests
- Not part of their aggregation proposal
- AH abstraction hierarchy (?)
- Pull into T all fields that are just shared
records via n1 references - AC autocorrelation aggregation
- For joins back into T, get other cases linked
to each t - Fraction of positive cases among others
40Learning
- Find linked tables
- Starting from T, do breadth-first walk of schema
graph - Up to some max depth
- Cap number of paths followed
- For each path, know T is linked to last table in
path - Extract aggregate fields
- Pull in all fields of last table in path
- Aggregate them (using new aggregates) per t
- Append aggregates to t
41Learning
- Classifier
- Pick 10 subsets each w/ 10 features
- Random choice, weighted by performance
- But theres no classifier yet...so how do
features predict class? - Build a decision tree for each feature set
- Have class frequencies at leaves
- Features might not completely distinguish classes
- Class prediction
- Select class with higher frequency
- Class probability estimation
- Average frequencies over trees
42Tests
- IPO data
- 5 tables
- Most fields in the main table, used as T
- Other tables had key one data field
- Predicate on one field in T used as the class
- Tested against
- First-order logic aggregation
- Extract clauses using an ILP system
- Append evaluated clauses to each t
- Various ILP systems
- Using just data in T
- (Or T and AH features?)
43Tests
- IPO data
- 5 tables w/ small, simple schema
- Majority of fields were in the main table, i.e.
T - The only numeric fields were in main table, so no
aggregation of numeric features needed - Other tables had key one data field
- Max path length 2 to reach all tables, no
recursion - Predicate on one field in T used as the class
- Tested against
- First-order logic aggregation
- Extract clauses using an ILP system
- Append evaluated clauses to each t
- Various ILP systems
- Using just data in T (or T and AH features?)
44Test results
- See paper for numbers
- Accuracy with aggregate features
- Up to 10 increase over only features from T
- Depends on which and how many extra features used
- Most predictive feature was in a separate table
- Expect accuracy increase as more info available
- Shows info was not destroyed by aggregation
- Vector distance features better
- Generalization
45Interesting ideas (I) benefits (B)
- Taxonomy
- I Division into stages of aggregation
- Slot in any procedure per stage
- Estmate complexity per stage
- B Might get the discussion going
- Aggregate features
- I Identifying a main table
- Others get aggregated
- I Forming partitions to aggregate over
- Using queries with joins to pull in other tables
- Abstract partitioning based on predicate
- I Comparing case against reference histograms
- I Separate comparison method and reference
46Interesting ideas (I) benefits (B)
- Learning
- I Decision tree tricks
- Cut DT induction off short to get class freqs
- Starve DT of features to improve generalization
47Issues
- Some worrying lapses...
- Lacked standard terms for common concepts
- position i of vector has the number of
instances of ith value... -gt histogram - abstraction hierarchy -gt schema
- value order -gt enumeration
- Defined (and emphasized) terms for trivial and
commonly used things - Imprecise use of terms
- variance for (something like) second moment
- Im not confident they know what Mahalanobis
distance is - They say left outer join and show inner join
symbol
48Issues
- Some worrying lapses...
- Did not connect reference vector and variance
vector to underlying statistics - Should relate to bag prior and field value
conditional probability, not just weighted - Did not acknowledge loss of correlation info from
splitting up joined u tuples in their features - Assumes fields are independent
- Dependency was mentioned in the taxonomy
- Fig 1 schema cannot support 2 example query
- Missing a necessary foreign key reference
49Issues
- Some worrying lapses...
- Their formal statement of the task did not show
aggregation as dependent on t - Needed for c(t) partitioning
- Did not clearly distinguish when t did or did not
contain class - No need to put it in there at all
- No, the higher Gaussian moments are not all zero!
- Only the odd ones are. Yeesh.
- Correct reason we dont need them is all can be
computed from mean and variance - Uuugly notation
50Issues
- Some worrying lapses...
- Did not cite other uses of histograms or
distributions extracted as features - Spike-triggered average / covariance / etc.
- Used by all neurobiology, neurocomputation
- E.g. de Ruyter van Steveninck Bialek
- Response-conditional ensemble
- Used by Our own Adrienne Fairhall colleagues
- E.g. Aguera Arcas, Fairhall, Bialek
- Event-triggered distribution
- Used by me ?
- E.g. CSE528 project
51Issues
- Some worrying lapses...
- Did not cite other uses of histograms or
distributions extracted as features... - So, did not use standard tricks
- Dimension reduction
- Treat histogram as a vector
- Do PCA, keep top few eigenmodes, new features are
projections - Nor special tricks
- Subtract prior covariance before PCA
- Likewise competing the classes is not new
52Issues
- Non-goof issues
- Would need bookkeeping to maintain variance
vector for online learning - Dont have sufficient statistics
- Histograms are actual samples
- Adding new data doesnt add new
sampleschanges existing ones - Could subtract old contribution, add new one
- Use a triggered query
- Dont bin those nice numerical variables!
- Binning makes vectors out of scalars
- Scalar fields can be ganged into a vector across
fields! - Do (e.g.) clustering on the bag of vectors
- Thats enough of that