Raghu Ramakrishnan - PowerPoint PPT Presentation

About This Presentation
Title:

Raghu Ramakrishnan

Description:

Raghu Ramakrishnan – PowerPoint PPT presentation

Number of Views:238
Avg rating:3.0/5.0
Slides: 61
Provided by: Yah31
Category:

less

Transcript and Presenter's Notes

Title: Raghu Ramakrishnan


1
Mirrors and Crystal BallsA Personal Perspective
on Data Mining
  • Raghu Ramakrishnan

2
Outline
  • This award recognizes the work of many people,
    and I represent the many
  • A warp-speed tour of some earlier work
  • Whats a data mining talk without predictions?
  • Some exciting directions for data mining that
    were working on at Yahoo!

3
A Look in the Mirror (and the faces I found
thereunfortunately, couldnt find photos for
some people)(and apologies in advance for not
discussing the related work that provided context
and, often, tools and motivation)
4
1987
2007
5
Sequences, Streams
  • SEQ
  • Sequence Data Processing. P. Seshadri, M. Livny
    and R. Ramakrishnan. SIGMOD 1994
  • SEQ A Model for Sequence Databases. P.
    Seshadri, M. Livny, and R. Ramakrishnan, ICDE
    1995
  • The Design and Implementation of a Sequence
    Database System. P. Seshadri, M. Livny and R.
    Ramakrishnan. VLDB 1996
  • SRQL
  • SRQL Sorted Relational Query Language. R.
    Ramakrishnan, D. Donjerkovic, A. Ranganathan, K.
    Beyer, and M. Krishnaprasad. SSDBM 1998

6
Scalable Clustering
  • Birch
  • BIRCH A Clustering Algorithm for Large
    Multidimensional Datasets. T. Zhang, R.
    Ramakrishnan and M. Livny. SIGMOD 96
  • Fast Density Estimation Using CF-Kernels. T.
    Zhang, R. Ramakrishnan, and M. Livny. KDD 1999
  • Clustering Large Databases in Arbitrary Metric
    Spaces. V. Ganti, R. Ramakrishnan, J. Gehrke, A.
    Powell, and J. French. ICDE 1999
  • Clustering Categorical Data
  • CACTUS A Scalable Clustering Algorithm for
    Categorical Data. V. Ganti, J. Gehrke, and R.
    Ramakrishnan. KDD 1999

7
Scalable Decision Trees
  • Rain Forest
  • RainForest A Framework for Fast Decision Tree
    Construction of Large Datasets. J. Gehrke, R.
    Ramakrishnan and V. Ganti. VLDB 1998
  • Boat
  • BOAT Optimistic Decision Tree Construction. J.
    Gehrke, V. Ganti, R. Ramakrishnan, and W-Y. Loh.
    SIGMOD 1999

8
Streaming and Evolving Data, Incremental Mining
  • FOCUS
  • FOCUS A Framework for Measuring Changes in Data
    Characteristics. V. Ganti, J. Gehrke, R.
    Ramakrishnan, and W-Y. Loh. PODS 1999
  • DEMON
  • DEMON Mining and Monitoring Evolving Data. V.
    Ganti, J. Gehrke, and R. Ramakrishnan. ICDE 1999

9
Mass Collaboration
  • The QUIQ Engine A Hybrid IR-DB System. N.
    Kabra, R. Ramakrishnan, and V. Ercegovac. ICDE
    2003
  • Mass Collaboration A Case Study. R.
    Ramakrishnan, A. Baptist, V. Ercegovac, M.
    Hanselman, N. Kabra, A. Marathe, U. Shaft. IDEAS
    2004

10
OLAP, Hierarchies, and Exploratory Mining
  • Prediction Cubes. B-C. Chen, L. Chen, Y. Lin, R.
    Ramakrishnan. VLDB 2005
  • Bellwether Analysis Predicting Global Aggregates
    from Local Regions. B-C. Chen, R. Ramakrishnan,
    J.W. Shavlik, P. Tamma. VLDB 2006

11
Hierarchies Redux
  • OLAP Over Uncertain and Imprecise Data. D.
    Burdick, P. Deshpande, T.S. Jayram, R.
    Ramakrishnan, S. Vaithyanathan. VLDB 2005
  • Efficient Allocation Algorithms for OLAP Over
    Imprecise Data. D. Burdick, P.M. Deshpande, T. S.
    Jayram, R. Ramakrishnan, S. Vaithyanathan.
  • Learning from Aggregate Views. B-C. Chen, L.
    Chen, D. Musicant, and R. Ramakrishnan. ICDE 2006
  • Mondrian Multidimensional K-Anonymity. K.
    LeFevre, D.J. DeWitt, R. Ramakrishnan. ICDE 2006
  • Workload-Aware Anonymization. K. LeFevre, D.J.
    DeWitt, R. Ramakrishnan. KDD 2006
  • Privacy Skyline Privacy with Multidimensional
    Adversarial Knowledge. B-C. Chen, R.
    Ramakrishnan, K. LeFevre. VLDB 2007
  • Composite Subset Measures. L. Chen, R.
    Ramakrishnan, P. Barford, B-C. Chen, V.
    Yegneswaran. VLDB 2006

12
Many Other Connections
  • Scalable Inference
  • Optimizing MPF Queries Decision Support and
    Probabilistic Inference. H. Corrada Bravo, R.
    Ramakrishnan. SIGMOD 2007
  • Relational Learning
  • View Learning for Statistical Relational
    Learning, with an Application to Mammography. J.
    Davis, E.S. Burnside, I. Dutra, David Page, R.
    Ramakrishnan, V. Santos Costa, J.W. Shavlik.

13
Community Information Management
  • Efficient Information Extraction over Evolving
    Text Data. F. Chen, A. Doan, J. Yang, R.
    Ramakrishnan. ICDE 2008
  • Toward Best-Effort Information Extraction. W.
    Shen, P. DeRose, R. McCann, A. Doan, R.
    Ramakrishnan. SIGMOD 2008
  • Declarative Information Extraction Using Datalog
    with Embedded Extraction Predicates. W. Shen, A.
    Doan, J.F. Naughton, R. Ramakrishnan. VLDB 2007
  • Source-aware Entity Matching A Compositional
    Approach. W. Shen, P. DeRose, L. Vu, A. Doan, R.
    Ramakrishnan. ICDE 2007

14
Through the Looking Glass
Prediction is very hard, especially about the
future. Yogi Berra
15
Information Extraction and the challenge of
managing it
16
(No Transcript)
17
DBLife
  • Integrated information about a (focused)
    real-world community
  • Collaboratively built and maintained by the
    community
  • CIMple software package

18
Search Results of the Future
yelp.com
Gawker
babycenter
New York Times
epicurious
LinkedIn
answers.com
webmd
(Slide courtesy Andrew Tomkins)
19
Opening Up Yahoo! Search
  • Phase 1

Phase 2
Giving site owners and developers control over
the appearance of Yahoo! Search results.
BOSS takes Yahoo!s open strategy to the next
level by providing Yahoo! Search infrastructure
and technology to developers and companies to
help them build their own search experiences.
(Slide courtesy Prabhakar Raghavan)
20
Custom Search Experiences
Social Search
Vertical Search
Visual Search
(Slide courtesy Prabhakar Raghavan)
21
Economics of IE
  • Data ?, Supervision ?
  • The cost of supervision, especially large,
    high-quality training sets, is high
  • By comparison, the cost of data is low
  • Therefore
  • Rapid training set construction/active learning
    techniques
  • Tolerance for low- (or low-quality) supervision
  • Take feedback and iterate rapidly

22
Example Accepted Papers
  • Every conference comes with a slightly different
    format for accepted papers
  • We want to extract accepted papers directly
    (before they make their way into DBLP etc.)
  • Assume
  • Lots of background knowledge (e.g., DBLP from
    last year)
  • No supervision on the target page
  • What can you do?

23
(No Transcript)
24
Down the Page a Bit
25
Record Identification
  • To get started, we need to identify records
  • Hey, we could write an XPath, no?
  • So, what if no supervision is allowed?
  • Given a crude classifier for paper records, can
    we recursively split up this page?

26
First Level Splits
27
After More Splits
28
Now Get the Records
  • Goal To extract fields of individual records
  • We need training examples, right?
  • But these papers are new
  • The best we can do without supervision is noisy
    labels.
  • From having seen other such pages

29
Partial, Noisy Labels
30
Extracted Records
31
Refining Results via Feedback
  • Now lets shift slightly to consider extraction
    of publications from academic home pages
  • Must identify publication sections of faculty
    home pages, and extract paper citations from them
  • Underlying data model for extracted data is
  • A flexible graph-based model (similar to RDF or
    ER conceptual model)
  • Confidence scores per-attribute or relationship

32
Extracted Publication Titles
33
A Dubious Extracted Publication
PSOX provides declarative lineage tracking over
operator executions
34
Wheres the Problem?
Use lineage to find source of problem..
35
Source Page
Hmm, not a publication page .. (but may have
looked like one to a classifier)
36
Feedback
User corrects classification of that section..
37
Faculty or Student?
  • NLP
  • Build a Classifier
  • Or

38
Another Clue
39
Stepping Back
  • Leads to large-scale, partially-labeled
    relational learning
  • Involving different types of entities and links

Prof-List
Prof
Student-List
Student
Student
AdvisorOf
40
Maximizing the Value of What You Select to Show
Users
41
Content Optimization
  • PROBLEM Match-making between content, user,
    context
  • Content
  • Programmed (e.g., editors) Acquired (e.g., RSS
    feeds, UGC)
  • User
  • Individual (e.g., b-cookie), or user segment
  • Context
  • E.g., Y! or non-Y! property device time period
  • APPROACH Scalable algorithms that select content
    to show, using editorially determined content
    mixes, and respecting editorially set constraints
    and policies.

42
Team from Y! Research
BeeChung Chen
Pradheep Elango
Deepak Agarwal
Raghu Ramakrishnan
Wei Chu
Seung-Taek Park
43
Team from Y! Engineering
Nitin Motgi
Joe Zachariah
Scott Roy
Todd Beaupre
Kenneth Fox
44
Yahoo! Home Page Featured Box
  • It is the top-center part of the Y! Front Page
  • It has four tabs Featured, Entertainment,
    Sports, and Video

45
Traditional Role of Editors
  • Strict quality control
  • Preserve Yahoo! Voice
  • E.g., typical mix of content
  • Community standards
  • Quality guidelines
  • E.g., Topical articles shown for limited time
  • Program articles periodically
  • New ones pushed, old ones taken out
  • Few tens of unique articles per day
  • 16 articles at any given time editors keep up
    with novel articles and remove fading ones
  • Choose which articles appear in which tabs

46
Content Optimization Approach
  • Editors continue to determine content sources,
    program some content, determine policies to
    ensure quality, and specify business constraints
  • But we use a statistically based machine learning
    algorithm to determine what articles to show
    where when a user visits the FP

47
Modeling Approach
  • Pure feature based (did not work well)
  • Article URL, keywords, categories
  • Build offline models to predict CTR when article
    shown to users
  • Models considered
  • Logistic Regression with feature selection
  • Decision Trees, Feature segments through
    clustering
  • Track CTR per article in user segments through
    online models
  • This worked well the approach we took eventually

48
Challenges
  • Non-stationary CTR
  • To ensure webpage stability, we show the same
    article until we find a better one
  • CTR decays over time sharply at F1
  • Time-of-day day-of-week effect in CTR

49
Modeling Approach
  • Track item scores through dynamic linear models
    (fast Kalman filter algorithms)
  • We model decay explicitly in our models
  • We have a global time-of-day curve explicitly in
    our online models

50
Explore/Exploit
  • What is the best strategy for new articles?
  • If we show it and its bad lose clicks
  • If we delay and its good lose clicks
  • Solution Show it while we dont have much data
    if it looks promising
  • Classical multi-armed bandit type problem
  • Our setup is different than the ones studied in
    the literature new ML problem

51
Novel Aspects
  • Classical Arms assumed fixed over time
  • We gain and lose arms over time
  • Some theoretical work by Whittle in 80s
    operations research
  • Classical Serving rule updated after each pull
  • We compute optimal design in batch mode
  • Classical Generally. CTR assumed stationary
  • We have highly dynamic, non-stationary CTRs

52
Some Other Complications
  • We run multiple experiments (possibly correlated)
    simultaneously effective sample size calculation
    a challenge
  • Serving Bias Incorrect to learn from data for
    serving scheme A and apply to serving scheme B
  • Need unbiased quality score
  • Bias sources positional effects, time effect,
    set of articles shown together
  • Incorporating feature-based techniques
  • Regression style , E.g., logistic regression
  • Tree-based (hierarchical bandit)

53
System Challenges
  • Highly dynamic system characteristics
  • Short article lifetimes, pool constantly
    changing, user population is dynamic, CTRs
    non-stationary
  • Quick adaptation is key to success
  • Scalability
  • 1000s of page views/sec data collection, model
    training, article scoring done under tight
    latency constraints

54
Results
  • We built an experimental infrastructure to test
    new content serving schemes
  • Ran side-by-side experiments on live traffic
  • Experiments performed for several months we
    consistently outperformed the old system
  • Results showed we get more clicks by engaging
    more users
  • Editorial overrides
  • Did not reduce lift numbers substantially

55
Comparing buckets
56
Experiments
  • Daily CTR Lift relative to editorial serving

57
Lift is Due to Increased Reach
  • Lift in fraction of clicking users

58
Related Work
  • Amazon, Netflix, Y! Music, etc.
  • Collaborative filtering with large content pool
  • Achieve lift by eliminating bad articles
  • We have a small number of high quality articles
  • Search, Advertising
  • Matching problem with large content pool
  • Match through feature based models

59
Summary of Approach
  • Offline models to initialize online models
  • Online models to track performance
  • Explore/exploit to converge fast
  • Study user visit patterns and behavior program
    content accordingly

60
Summary
  • There are some exciting grand challenge
    problems that will require us to bring to bear
    ideas from data management, statistics, learning,
    and optimization
  • i.e., data mining problems!
  • Our field is too young to think about growing
    old, but the best is yet to be
Write a Comment
User Comments (0)
About PowerShow.com