Aggregate features for relational data Claudia Perlich, Foster Provost - PowerPoint PPT Presentation

About This Presentation
Title:

Aggregate features for relational data Claudia Perlich, Foster Provost

Description:

Feature extraction Other features added for tests Not part of their aggregation proposal AH: abstraction hierarchy (?) Pull into T all fields that are just ... – PowerPoint PPT presentation

Number of Views:70
Avg rating:3.0/5.0
Slides: 53
Provided by: tres4
Category:

less

Transcript and Presenter's Notes

Title: Aggregate features for relational data Claudia Perlich, Foster Provost


1
Aggregate features for relational dataClaudia
Perlich, Foster Provost
Pat Tressel 16-May-2005
2
Overview
  • Perlich and Provost provide...
  • Hierarchy of aggregation methods
  • Survey of existing aggregation methods
  • New aggregation methods
  • Concerned w/ supervised learning only
  • But much seems applicable to clustering

3
The issues
  • Most classifiers use feature vectors
  • Individual features have fixed arity
  • No links to other objects
  • How do we get feature vectors from relational
    data?
  • Flatten it
  • Joins
  • Aggregation
  • (Are feature vectors all there are?)

4
Joins
  • Why consider them?
  • Yield flat feature vectors
  • Preserve all the data
  • Why not use them?
  • They emphasize data with many references
  • Ok if thats what we want
  • Not ok if sampling was skewed
  • Cascaded or transitive joins blow up

5
Joins
  • They emphasize data with many references
  • Lots more Joes than there were before...

6
Joins
  • Why not use them?
  • What if we dont know the references?
  • Try out everything with everything else
  • Cross product yields all combinations
  • Adds fictitious relationships
  • Combinatorial blowup

7
Joins
  • What if we dont know the references?

8
Aggregates
  • Why use them?
  • Yield flat feature vectors
  • No blowup in number of tuples
  • Can group tuples in all related tables
  • Can keep as detailed stats as desired
  • Not just max, mean, etc.
  • Parametric dists from sufficient stats
  • Can apply tests for grouping
  • Choice of aggregates can be model-based
  • Better generalization
  • Include domain knowledge in model choice

9
Aggregates
  • Anything wrong with them?
  • Data is lost
  • Relational structure is lost
  • Influential individuals are lumped in
  • Doesnt discover critical individuals
  • Dominates other data
  • Any choice of aggregates assumes a model
  • What if its wrong?
  • Adding new data can require calculations
  • But can avoid issue by keeping sufficient
    statistics

10
Taxonomy of aggregates
  • Why is this useful?
  • Promote deliberate use of aggregates
  • Point out gaps in current use of aggregates
  • Find appropriate techniques for each class
  • Based on complexity due to
  • Relational structure
  • Cardinality of the relations (11, 1n, mn)
  • Feature extraction
  • Computing the aggregates
  • Class prediction

11
Taxonomy of aggregates
  • Formal statement of the task
  • Notation (here and on following slides)
  • t, tuple (from target table T, with main
    features)
  • y, class (known per t if training)
  • ?, aggregation function
  • F, classification function
  • s, select operation (where joins preserve t)
  • O, all tables B, any other table, b in B
  • u, fields to be added to t from other tables
  • f, a field in u
  • More, that doesnt fit on this slide

12
Taxonomy of aggregates
  • Formal statement of the task
  • Notation (here and on following slides)
  • Caution! Simplified from whats in the paper!
  • t, tuple (from target table T, with main
    features)
  • y, class (known per t if training)
  • ?, aggregation function
  • F, classification function
  • s, select operation (where joins preserve t)
  • O, all tables B, any other table, b a tuple in B
  • u, fields to be added to t from joined tables
  • f, a field in u
  • More, that doesnt fit on this slide

13
Aggregation complexity
  • Simple
  • One field from one object type
  • Denoted by

14
Aggregation complexity
  • Multi-dimensional
  • Multiple fields, one object type
  • Denoted by

15
Aggregation complexity
  • Multi-type
  • Multiple object types
  • Denoted by

16
Relational concept complexity
  • Propositional
  • No aggregation
  • Single tuple, 1-1 or n-1 joins
  • n-1 is just a shared object
  • Not relational per se already flat

17
Relational concept complexity
  • Independent fields
  • Separate aggregation per field
  • Separate 1-n joins with T

18
Relational concept complexity
  • Dependent fields in same table
  • Multi-dimensional aggregation
  • Separate 1-n joins with T

19
Relational concept complexity
  • Dependent fields over multiple tables
  • Multi-type aggregation
  • Separate 1-n joins, still only with T

20
Relational concept complexity
  • Global
  • Any joins or combinations of fields
  • Multi-type aggregation
  • Multi-way joins
  • Joins among tables other than T

21
Current relational aggregation
  • First-order logic
  • Find clauses that directly predict the class
  • ? is OR
  • Form binary features from tests
  • Logical and arithmetic tests
  • These go in the feature vector
  • ? is any ordinary classifier

22
Current relational aggregation
  • The usual database aggregates
  • For numerical values
  • mean, min, max, count, sum, etc.
  • For categorical values
  • Most common value
  • Count per value

23
Current relational aggregation
  • Set distance
  • Two tuples, each with a set of related tuples
  • Distance metric between related fields
  • Euclidean for numerical data
  • Edit distance for categorical
  • Distance between sets is distance of closest pair

24
Proposed relational aggregation
  • Recall the point of this work
  • Tuple t from table T is part of a feature vector
  • Want to augment w/ info from other tables
  • Info added to t must be consistent w/ values in t
  • Need to flatten the added info to yield one
    vector per tuple t
  • Use that to
  • Train classifier given class y for t
  • Predict class y for t

25
Proposed relational aggregation
  • Outline of steps
  • Do query to get more info u from other tables
  • Partition the results based on
  • Main features t
  • Class y
  • Predicates on t
  • Extract distributions over results for fields in
    u
  • Get distribution for each partition
  • For now, limit to categorical fields
  • Suggest extension to numerical fields
  • Derive features from distributions

26
Do query to get info from other tables
  • Select
  • Based on the target table T
  • If training, known class y is included in T
  • Joins must preserve distinct values from T
  • Join on as much of Ts key as is present in other
    table
  • Maybe need to constrain other fields?
  • Not a problem for correctly normalized tables
  • Project
  • Include all of t
  • Append additional fields u from joined tables
  • Anything up to all fields from joins

27
Extract distributions
  • Partition query results various ways, e.g.
  • Into cases per each t
  • For training, include the (known) class y in t
  • Also (if training) split per each class
  • Want this for class priors
  • Split per some (unspecifed) predicate c(t)
  • For each partition
  • There is a bag of associated u tuples
  • Ignore the t part already a flat vector
  • Split vertically to get bags of individual values
    per each field f in u
  • Note this breaks association between fields!

28
Distributions for categorical fields
  • Let categorical field be f with values fi
  • Form histogram for each partition
  • Count instances of each value fi of f in a bag
  • These are sufficient statistics for
  • Distribution over fi values
  • Probability of each bag in the partition
  • Start with one per each tuple t and field f
  • Cft, (per-) case vector
  • Component Cfti, count for fi

29
Distributions for categorical fields
  • Distribution of histograms per predicatec(t) and
    field f
  • Treat histogram counts as random variables
  • Regard c(t) true partition as a collection of
    histogram samples
  • Regard histograms as vectors of random variables,
    one per field value fi
  • Extract moments of these histogram count
    distributions
  • mean (sort of) reference vector
  • variance (sort of) variance vector

30
Distributions for categorical fields
  • Net histogram per predicate c(t), field f
  • c(t) partitions tuples t into two groups
  • Only histogram the c(t) true group
  • Could include c as a predicate if we want
  • Dont re-count!
  • Already have histograms for each t and f case
    reference vectors
  • Sum the case reference vectors columnwise
  • Call this a reference vector, Rfc
  • Proportional to average histogram over t for c(t)
    true (weighted by samples per t)

31
Distributions for categorical fields
  • Variance of case histograms per predicatec(t)
    and field f
  • Define variance vector, Vfc
  • Columnwise sum of squares of case reference
    vectors / number of samples with c(t) true
  • Not an actual variance
  • Squared means not subtracted
  • Dont care
  • Its indicative of the variance...
  • Throw in means-based features as well to give
    classifier full variance info

32
Distributions for categorical fields
  • What predicates might we use?
  • Unconditionally true, c(t) true
  • Result is net distribution independent of t
  • Unconditional reference vector, R
  • Per class k, ck(t) (t.y k)
  • Class priors
  • Recall for training data, y is a field in t
  • Per class reference vector,

33
Distributions for categorical fields
  • Summary of notation
  • c(t), a predicate based on values in a tuple t
  • f, a categorical field from a join with T
  • fi, values of f
  • Rfc, reference vector
  • histogram over fi values in bag for c(t) true
  • Cft, case vector
  • histogram over fi values for ts bag
  • R, unconditional reference vector
  • Vfc, variance vector
  • Columnwise average squared ref. vector
  • Xi, i th value in some ref. vector X

34
Distributions for numerical data
  • Same general idea representative distributions
    per various partitions
  • Can use categorical techniques if we
  • Bin the numerical values
  • Treat each bin as a categorical value

35
Feature extraction
  • Base features on ref. and variance vectors
  • Two kinds
  • Interesting values
  • one value from case reference vector per t
  • same column in vector for all t
  • assorted options for choosing column
  • choices depend on predicate ref. vectors
  • Vector distances
  • distance between case ref. vector and predicate
    ref. vector
  • various distance metrics
  • More notation acronym for each feature type

36
Feature extraction interesting values
  • For a given c, f, select that fi which is...
  • MOC Most common overall
  • argmaxi Ri
  • Most common in each class
  • For binary class y
  • Positive is y 1, Negative is y 0
  • MOP argmaxi Rft.y1i
  • MON argmaxi Rft.y0i
  • Most distinctive per class
  • Common in one class but not in other(s)
  • MOD argmaxi Rft.y1i - Rft.y0i
  • MOM argmaxi MOD / Vft.y1i - Vft.y0i
  • Normalizes for variance (sort of)

37
Feature extraction vector distance
  • Distance btw given ref. vector each case vector
  • Distance metrics
  • ED Edit not defined
  • Sum of abs. diffs, a.k.a. Manhattan dist?
  • Si Ci Ri
  • EU Euclidean
  • v(Ci T Ri ), omit v for speed
  • MA Mahalanobis
  • v(Ci T S-1 Ri ), omit v for speed
  • S should be covariance...of what?
  • CO Cosine, 1- cos(angle btw vectors)
  • 1 - Ci T Ri / v (Ci Ri )

38
Feature extraction vector distance
  • Apply each metric w/ various ref. vectors
  • Acronym is metric w/ suffix for ref. vector
  • (No suffix) Unconditional ref. vector
  • P per-class positive ref. vector, Rft.y1
  • N per-class positive ref. vector, Rft.y0
  • D difference between P and D distances
  • Alphabet soup, e.g. EUP, MAD,...

39
Feature extraction
  • Other features added for tests
  • Not part of their aggregation proposal
  • AH abstraction hierarchy (?)
  • Pull into T all fields that are just shared
    records via n1 references
  • AC autocorrelation aggregation
  • For joins back into T, get other cases linked
    to each t
  • Fraction of positive cases among others

40
Learning
  • Find linked tables
  • Starting from T, do breadth-first walk of schema
    graph
  • Up to some max depth
  • Cap number of paths followed
  • For each path, know T is linked to last table in
    path
  • Extract aggregate fields
  • Pull in all fields of last table in path
  • Aggregate them (using new aggregates) per t
  • Append aggregates to t

41
Learning
  • Classifier
  • Pick 10 subsets each w/ 10 features
  • Random choice, weighted by performance
  • But theres no classifier yet...so how do
    features predict class?
  • Build a decision tree for each feature set
  • Have class frequencies at leaves
  • Features might not completely distinguish classes
  • Class prediction
  • Select class with higher frequency
  • Class probability estimation
  • Average frequencies over trees

42
Tests
  • IPO data
  • 5 tables
  • Most fields in the main table, used as T
  • Other tables had key one data field
  • Predicate on one field in T used as the class
  • Tested against
  • First-order logic aggregation
  • Extract clauses using an ILP system
  • Append evaluated clauses to each t
  • Various ILP systems
  • Using just data in T
  • (Or T and AH features?)

43
Tests
  • IPO data
  • 5 tables w/ small, simple schema
  • Majority of fields were in the main table, i.e.
    T
  • The only numeric fields were in main table, so no
    aggregation of numeric features needed
  • Other tables had key one data field
  • Max path length 2 to reach all tables, no
    recursion
  • Predicate on one field in T used as the class
  • Tested against
  • First-order logic aggregation
  • Extract clauses using an ILP system
  • Append evaluated clauses to each t
  • Various ILP systems
  • Using just data in T (or T and AH features?)

44
Test results
  • See paper for numbers
  • Accuracy with aggregate features
  • Up to 10 increase over only features from T
  • Depends on which and how many extra features used
  • Most predictive feature was in a separate table
  • Expect accuracy increase as more info available
  • Shows info was not destroyed by aggregation
  • Vector distance features better
  • Generalization

45
Interesting ideas (I) benefits (B)
  • Taxonomy
  • I Division into stages of aggregation
  • Slot in any procedure per stage
  • Estmate complexity per stage
  • B Might get the discussion going
  • Aggregate features
  • I Identifying a main table
  • Others get aggregated
  • I Forming partitions to aggregate over
  • Using queries with joins to pull in other tables
  • Abstract partitioning based on predicate
  • I Comparing case against reference histograms
  • I Separate comparison method and reference

46
Interesting ideas (I) benefits (B)
  • Learning
  • I Decision tree tricks
  • Cut DT induction off short to get class freqs
  • Starve DT of features to improve generalization

47
Issues
  • Some worrying lapses...
  • Lacked standard terms for common concepts
  • position i of vector has the number of
    instances of ith value... -gt histogram
  • abstraction hierarchy -gt schema
  • value order -gt enumeration
  • Defined (and emphasized) terms for trivial and
    commonly used things
  • Imprecise use of terms
  • variance for (something like) second moment
  • Im not confident they know what Mahalanobis
    distance is
  • They say left outer join and show inner join
    symbol

48
Issues
  • Some worrying lapses...
  • Did not connect reference vector and variance
    vector to underlying statistics
  • Should relate to bag prior and field value
    conditional probability, not just weighted
  • Did not acknowledge loss of correlation info from
    splitting up joined u tuples in their features
  • Assumes fields are independent
  • Dependency was mentioned in the taxonomy
  • Fig 1 schema cannot support 2 example query
  • Missing a necessary foreign key reference

49
Issues
  • Some worrying lapses...
  • Their formal statement of the task did not show
    aggregation as dependent on t
  • Needed for c(t) partitioning
  • Did not clearly distinguish when t did or did not
    contain class
  • No need to put it in there at all
  • No, the higher Gaussian moments are not all zero!
  • Only the odd ones are. Yeesh.
  • Correct reason we dont need them is all can be
    computed from mean and variance
  • Uuugly notation

50
Issues
  • Some worrying lapses...
  • Did not cite other uses of histograms or
    distributions extracted as features
  • Spike-triggered average / covariance / etc.
  • Used by all neurobiology, neurocomputation
  • E.g. de Ruyter van Steveninck Bialek
  • Response-conditional ensemble
  • Used by Our own Adrienne Fairhall colleagues
  • E.g. Aguera Arcas, Fairhall, Bialek
  • Event-triggered distribution
  • Used by me ?
  • E.g. CSE528 project

51
Issues
  • Some worrying lapses...
  • Did not cite other uses of histograms or
    distributions extracted as features...
  • So, did not use standard tricks
  • Dimension reduction
  • Treat histogram as a vector
  • Do PCA, keep top few eigenmodes, new features are
    projections
  • Nor special tricks
  • Subtract prior covariance before PCA
  • Likewise competing the classes is not new

52
Issues
  • Non-goof issues
  • Would need bookkeeping to maintain variance
    vector for online learning
  • Dont have sufficient statistics
  • Histograms are actual samples
  • Adding new data doesnt add new
    sampleschanges existing ones
  • Could subtract old contribution, add new one
  • Use a triggered query
  • Dont bin those nice numerical variables!
  • Binning makes vectors out of scalars
  • Scalar fields can be ganged into a vector across
    fields!
  • Do (e.g.) clustering on the bag of vectors
  • Thats enough of that
Write a Comment
User Comments (0)
About PowerShow.com