Raghu Ramakrishnan - PowerPoint PPT Presentation

About This Presentation

Title:

Raghu Ramakrishnan

Description:

Raghu Ramakrishnan – PowerPoint PPT presentation

Number of Views:238

Avg rating:3.0/5.0

Slides: 61

Provided by: Yah31

Learn more at: https://pages.cs.wisc.edu

Category:

more less

Transcript and Presenter's Notes

Title: Raghu Ramakrishnan

1
Mirrors and Crystal BallsA Personal Perspective
on Data Mining

Raghu Ramakrishnan

2
Outline

This award recognizes the work of many people,
and I represent the many
A warp-speed tour of some earlier work
Whats a data mining talk without predictions?
Some exciting directions for data mining that
were working on at Yahoo!

3
A Look in the Mirror (and the faces I found
thereunfortunately, couldnt find photos for
some people)(and apologies in advance for not
discussing the related work that provided context
and, often, tools and motivation)
4
1987
2007
5
Sequences, Streams

SEQ
Sequence Data Processing. P. Seshadri, M. Livny
and R. Ramakrishnan. SIGMOD 1994
SEQ A Model for Sequence Databases. P.
Seshadri, M. Livny, and R. Ramakrishnan, ICDE
1995
The Design and Implementation of a Sequence
Database System. P. Seshadri, M. Livny and R.
Ramakrishnan. VLDB 1996
SRQL
SRQL Sorted Relational Query Language. R.
Ramakrishnan, D. Donjerkovic, A. Ranganathan, K.
Beyer, and M. Krishnaprasad. SSDBM 1998

6
Scalable Clustering

Birch
BIRCH A Clustering Algorithm for Large
Multidimensional Datasets. T. Zhang, R.
Ramakrishnan and M. Livny. SIGMOD 96
Fast Density Estimation Using CF-Kernels. T.
Zhang, R. Ramakrishnan, and M. Livny. KDD 1999
Clustering Large Databases in Arbitrary Metric
Spaces. V. Ganti, R. Ramakrishnan, J. Gehrke, A.
Powell, and J. French. ICDE 1999
Clustering Categorical Data
CACTUS A Scalable Clustering Algorithm for
Categorical Data. V. Ganti, J. Gehrke, and R.
Ramakrishnan. KDD 1999

7
Scalable Decision Trees

Rain Forest
RainForest A Framework for Fast Decision Tree
Construction of Large Datasets. J. Gehrke, R.
Ramakrishnan and V. Ganti. VLDB 1998
Boat
BOAT Optimistic Decision Tree Construction. J.
Gehrke, V. Ganti, R. Ramakrishnan, and W-Y. Loh.
SIGMOD 1999

8
Streaming and Evolving Data, Incremental Mining

FOCUS
FOCUS A Framework for Measuring Changes in Data
Characteristics. V. Ganti, J. Gehrke, R.
Ramakrishnan, and W-Y. Loh. PODS 1999
DEMON
DEMON Mining and Monitoring Evolving Data. V.
Ganti, J. Gehrke, and R. Ramakrishnan. ICDE 1999

9
Mass Collaboration

The QUIQ Engine A Hybrid IR-DB System. N.
Kabra, R. Ramakrishnan, and V. Ercegovac. ICDE
2003
Mass Collaboration A Case Study. R.
Ramakrishnan, A. Baptist, V. Ercegovac, M.
Hanselman, N. Kabra, A. Marathe, U. Shaft. IDEAS
2004

10
OLAP, Hierarchies, and Exploratory Mining

Prediction Cubes. B-C. Chen, L. Chen, Y. Lin, R.
Ramakrishnan. VLDB 2005
Bellwether Analysis Predicting Global Aggregates
from Local Regions. B-C. Chen, R. Ramakrishnan,
J.W. Shavlik, P. Tamma. VLDB 2006

11
Hierarchies Redux

OLAP Over Uncertain and Imprecise Data. D.
Burdick, P. Deshpande, T.S. Jayram, R.
Ramakrishnan, S. Vaithyanathan. VLDB 2005
Efficient Allocation Algorithms for OLAP Over
Imprecise Data. D. Burdick, P.M. Deshpande, T. S.
Jayram, R. Ramakrishnan, S. Vaithyanathan.
Learning from Aggregate Views. B-C. Chen, L.
Chen, D. Musicant, and R. Ramakrishnan. ICDE 2006
Mondrian Multidimensional K-Anonymity. K.
LeFevre, D.J. DeWitt, R. Ramakrishnan. ICDE 2006
Workload-Aware Anonymization. K. LeFevre, D.J.
DeWitt, R. Ramakrishnan. KDD 2006
Privacy Skyline Privacy with Multidimensional
Adversarial Knowledge. B-C. Chen, R.
Ramakrishnan, K. LeFevre. VLDB 2007
Composite Subset Measures. L. Chen, R.
Ramakrishnan, P. Barford, B-C. Chen, V.
Yegneswaran. VLDB 2006

12
Many Other Connections

Scalable Inference
Optimizing MPF Queries Decision Support and
Probabilistic Inference. H. Corrada Bravo, R.
Ramakrishnan. SIGMOD 2007
Relational Learning
View Learning for Statistical Relational
Learning, with an Application to Mammography. J.
Davis, E.S. Burnside, I. Dutra, David Page, R.
Ramakrishnan, V. Santos Costa, J.W. Shavlik.

13
Community Information Management

Efficient Information Extraction over Evolving
Text Data. F. Chen, A. Doan, J. Yang, R.
Ramakrishnan. ICDE 2008
Toward Best-Effort Information Extraction. W.
Shen, P. DeRose, R. McCann, A. Doan, R.
Ramakrishnan. SIGMOD 2008
Declarative Information Extraction Using Datalog
with Embedded Extraction Predicates. W. Shen, A.
Doan, J.F. Naughton, R. Ramakrishnan. VLDB 2007
Source-aware Entity Matching A Compositional
Approach. W. Shen, P. DeRose, L. Vu, A. Doan, R.
Ramakrishnan. ICDE 2007

14
Through the Looking Glass
Prediction is very hard, especially about the
future. Yogi Berra
15
Information Extraction and the challenge of
managing it
16
(No Transcript)
17
DBLife

Integrated information about a (focused)
real-world community
Collaboratively built and maintained by the
community
CIMple software package

18
Search Results of the Future
yelp.com
Gawker
babycenter
New York Times
epicurious
LinkedIn
answers.com
webmd
(Slide courtesy Andrew Tomkins)
19
Opening Up Yahoo! Search

Phase 1

Phase 2
Giving site owners and developers control over
the appearance of Yahoo! Search results.
BOSS takes Yahoo!s open strategy to the next
level by providing Yahoo! Search infrastructure
and technology to developers and companies to
help them build their own search experiences.
(Slide courtesy Prabhakar Raghavan)
20
Custom Search Experiences
Social Search
Vertical Search
Visual Search
(Slide courtesy Prabhakar Raghavan)
21
Economics of IE

Data ?, Supervision ?
The cost of supervision, especially large,
high-quality training sets, is high
By comparison, the cost of data is low
Therefore
Rapid training set construction/active learning
techniques
Tolerance for low- (or low-quality) supervision
Take feedback and iterate rapidly

22
Example Accepted Papers

Every conference comes with a slightly different
format for accepted papers
We want to extract accepted papers directly
(before they make their way into DBLP etc.)
Assume
Lots of background knowledge (e.g., DBLP from
last year)
No supervision on the target page
What can you do?

23
(No Transcript)
24
Down the Page a Bit
25
Record Identification

To get started, we need to identify records
Hey, we could write an XPath, no?
So, what if no supervision is allowed?
Given a crude classifier for paper records, can
we recursively split up this page?

26
First Level Splits
27
After More Splits
28
Now Get the Records

Goal To extract fields of individual records
We need training examples, right?
But these papers are new
The best we can do without supervision is noisy
labels.
From having seen other such pages

29
Partial, Noisy Labels
30
Extracted Records
31
Refining Results via Feedback

Now lets shift slightly to consider extraction
of publications from academic home pages
Must identify publication sections of faculty
home pages, and extract paper citations from them
Underlying data model for extracted data is
A flexible graph-based model (similar to RDF or
ER conceptual model)
Confidence scores per-attribute or relationship

32
Extracted Publication Titles
33
A Dubious Extracted Publication
PSOX provides declarative lineage tracking over
operator executions
34
Wheres the Problem?
Use lineage to find source of problem..
35
Source Page
Hmm, not a publication page .. (but may have
looked like one to a classifier)
36
Feedback
User corrects classification of that section..
37
Faculty or Student?

NLP
Build a Classifier
Or

38
Another Clue
39
Stepping Back

Leads to large-scale, partially-labeled
relational learning
Involving different types of entities and links

Prof-List
Prof
Student-List
Student
Student
AdvisorOf
40
Maximizing the Value of What You Select to Show
Users
41
Content Optimization

PROBLEM Match-making between content, user,
context
Content
Programmed (e.g., editors) Acquired (e.g., RSS
feeds, UGC)
User
Individual (e.g., b-cookie), or user segment
Context
E.g., Y! or non-Y! property device time period
APPROACH Scalable algorithms that select content
to show, using editorially determined content
mixes, and respecting editorially set constraints
and policies.

42
Team from Y! Research
BeeChung Chen
Pradheep Elango
Deepak Agarwal
Raghu Ramakrishnan
Wei Chu
Seung-Taek Park
43
Team from Y! Engineering
Nitin Motgi
Joe Zachariah
Scott Roy
Todd Beaupre
Kenneth Fox
44
Yahoo! Home Page Featured Box

It is the top-center part of the Y! Front Page
It has four tabs Featured, Entertainment,
Sports, and Video

45
Traditional Role of Editors

Strict quality control
Preserve Yahoo! Voice
E.g., typical mix of content
Community standards
Quality guidelines
E.g., Topical articles shown for limited time
Program articles periodically
New ones pushed, old ones taken out
Few tens of unique articles per day
16 articles at any given time editors keep up
with novel articles and remove fading ones
Choose which articles appear in which tabs

46
Content Optimization Approach

Editors continue to determine content sources,
program some content, determine policies to
ensure quality, and specify business constraints
But we use a statistically based machine learning
algorithm to determine what articles to show
where when a user visits the FP

47
Modeling Approach

Pure feature based (did not work well)
Article URL, keywords, categories
Build offline models to predict CTR when article
shown to users
Models considered
Logistic Regression with feature selection
Decision Trees, Feature segments through
clustering
Track CTR per article in user segments through
online models
This worked well the approach we took eventually

48
Challenges

Non-stationary CTR
To ensure webpage stability, we show the same
article until we find a better one
CTR decays over time sharply at F1
Time-of-day day-of-week effect in CTR

49
Modeling Approach

Track item scores through dynamic linear models
(fast Kalman filter algorithms)
We model decay explicitly in our models
We have a global time-of-day curve explicitly in
our online models

50
Explore/Exploit

What is the best strategy for new articles?
If we show it and its bad lose clicks
If we delay and its good lose clicks
Solution Show it while we dont have much data
if it looks promising
Classical multi-armed bandit type problem
Our setup is different than the ones studied in
the literature new ML problem

51
Novel Aspects

Classical Arms assumed fixed over time
We gain and lose arms over time
Some theoretical work by Whittle in 80s
operations research
Classical Serving rule updated after each pull
We compute optimal design in batch mode
Classical Generally. CTR assumed stationary
We have highly dynamic, non-stationary CTRs

52
Some Other Complications

We run multiple experiments (possibly correlated)
simultaneously effective sample size calculation
a challenge
Serving Bias Incorrect to learn from data for
serving scheme A and apply to serving scheme B
Need unbiased quality score
Bias sources positional effects, time effect,
set of articles shown together
Incorporating feature-based techniques
Regression style , E.g., logistic regression
Tree-based (hierarchical bandit)

53
System Challenges

Highly dynamic system characteristics
Short article lifetimes, pool constantly
changing, user population is dynamic, CTRs
non-stationary
Quick adaptation is key to success
Scalability
1000s of page views/sec data collection, model
training, article scoring done under tight
latency constraints

54
Results

We built an experimental infrastructure to test
new content serving schemes
Ran side-by-side experiments on live traffic
Experiments performed for several months we
consistently outperformed the old system
Results showed we get more clicks by engaging
more users
Editorial overrides
Did not reduce lift numbers substantially

55
Comparing buckets
56
Experiments

Daily CTR Lift relative to editorial serving

57
Lift is Due to Increased Reach

Lift in fraction of clicking users

58
Related Work

Amazon, Netflix, Y! Music, etc.
Collaborative filtering with large content pool
Achieve lift by eliminating bad articles
We have a small number of high quality articles
Search, Advertising
Matching problem with large content pool
Match through feature based models

59
Summary of Approach

Offline models to initialize online models
Online models to track performance
Explore/exploit to converge fast
Study user visit patterns and behavior program
content accordingly

60
Summary

There are some exciting grand challenge
problems that will require us to bring to bear
ideas from data management, statistics, learning,
and optimization
i.e., data mining problems!
Our field is too young to think about growing
old, but the best is yet to be

Write a Comment

User Comments (0)