Title: Feature Generation and Selection in SRL
1Feature Generation and Selection in SRL
- Alexandrin Popescul Lyle H. Ungar
- Presented By Stef Schoenmackers
2Overview
- Structural Generalized Linear Regression (SGLR)
Overview - Design Motivations
- Experiments
- Conclusions
3SGLR Overview
- Adds statistical methods to ILP
- SQL as the logical language
- Generalized Linear Regression as statistical
method - Uses clustering to generate new relations
- Builds discriminative models
- Targeted at large problems where generative
models impossible - Integrates feature generation and problem modeling
4SGLR Loop
5SGLR Method
- Clusters data and adds clusters as new relations
- Searches the space of SQL query refinements
- Features are numerical SQL aggregates
- Test feature with statistical measure (e.g. AIC,
BIC) - Add only significantly predictive features
- Examine each feature only once
- Use current set of features to guide search
6Overview
- Structural Generalized Linear Regression (SGLR)
Overview - Design Motivations
- Experiments
- Conclusions
7SQL Motivation
- Most of the worlds data is in relational
databases - Can exploit schema and meta-information
- SQL uses a fairly expressive language
- Non-recursive first-order logic formulas
- Relational DBs have been studied and optimized
for decades, so should be more scalable than
other alternatives
8Clustering Motivation
- Dimensionality reduction
- Clusters are added as relations (new first-class
concepts) - Increases expressivity of the language describing
patterns in the data - Can lead to a more rapid discovery of predictive
features - Done as a pre-processing step
- cost(clustering) ltlt cost(feature search)
9Aggregation Motivation
- Summarizes the information in a table into scalar
values usable by a statistical model - average, max, min, count, average, empty/exists
(0/1) - Exploits database work into making them efficient
- Provides a richer space of features to choose
from
10Dynamic Feature Generation
- Most features do not provide useful information
- In large domains, feature generation is
expensive, and precomputing all possible features
is far too time consuming - Solution Use a smarter search strategy and
dynamically generate features. Let the features
already selected influence which features are
added - Focuses only on the promising areas in the search
space
11Feature Streams
- Put features into different evaluation queues
- Choose next feature from the best stream
- If feature in multiple streams, only evaluate
once - Stream design can use prior knowledge/bias
12Refinement Graphs (in ILP)
- Start with most general rule, and refines it to
produce more specific clauses - Single variable substitution
- Add predicate involving 1 existing variables
- Uses top-down breadth-first search to find the
most general rule that covers only positive
examples - Performs poorly in noisy domains
13Refinement Graphs (in SGLR)
- Adds one relation to a query and expands it into
all possible configurations of equality
conditions of new attributes with a new or old
attribute - Contains at least one equality condition between
a new and old attribute - Any attribute can be set to a constant
- High-level variable typing/classes are enforced
- Not all refinements are most general, but
simplifies pruning of equivalent subspaces
(accounts only for the type and number of
relations joined in a query)
14Example Refinement Graph
Query(d)
Cites(d,d1)
Author_of(d, a)
Word_count(d, w, int)
Author_of(d, aSmith)
Cites(d,d1),Cites(d1,d2)
DB Tables
Cites(d,d1), Author_of(d1, a)
Cites(d,d1), Author_of(d1, aDomingos)
15Overview
- Structural Generalized Linear Regression (SGLR)
Overview - Design Motivations
- Experiments
- Conclusions
16Experiments
- Used CiteSeer data
- Citation(doc1, doc2), Author(doc, person),
PublishedIn(doc, venue), HasWord(doc,word) - 60k Docs, 131k Authors, 173k Citations, 6.8M
Words - Two Tasks
- Predict the publication venue
- Predict existence of a citation
17Experiments
- Cluster all many-to-many relations
- K-means
- Added 6 new relations
- Use logistic regression for prediction
- BFS of search space
- 5k/5k- examples for venue prediction
- 2.5k/2.5k- examples for citation prediction
18Results
Venue (87.2)
Citation (93.1)
19Dynamic Feature Generation
- Query expressions generated Breadth-First
- Baseline puts all queries into one queue
- Dynamic strategy enqueues queries into separate
streams - Stream 1 exists and count over table
- Stream 2 other aggregates (counts of unique
elements in individual columns) - Chooses next feature from stream where
(featuresAdded1)/(featuresTried1) is max - Stop when a stream is empty
20Results
Venue
No Clusters
Clusters
Citation
21Time Results
Venue
No Clusters
Clusters
Citation
22Domain Independent Learning
- Most citation prediction features are
research-area generic - Can we train a model for one area and test on
another?
23Domain Independent Results
- Used KDD-Cup 2003 data (High Energy Physics
papers in arXiv)
Train On Test On Accuracy
CiteSeer arXiv 92.9
CiteSeer CiteSeer 92.6
arXiv arXiv 96.0
24Conclusions
- Cluster-based features add expressivity, and
apply to any domain or SRL method - Generating queries dynamically can reduce search
time and increase accuracy
25Questions?