Internet Research: What - PowerPoint PPT Presentation

1 / 59

About This Presentation

Title:

Internet Research: What

Description:

Internet Research: Whats hot in Search, Advertizing – PowerPoint PPT presentation

Number of Views:39

Avg rating:3.0/5.0

Slides: 60

Provided by: chuckne

Category:

more less

Transcript and Presenter's Notes

Title: Internet Research: What

1
Internet Research Whats hot in Search,
Advertizing Cloud Computing
Rajeev RastogiYahoo! Labs Bangalore
2
The most visited site on the internet

600 million users per month
Super popular properties
News, finance, sports
Answers, flickr, del.icio.us
Mail, messaging
Search

3
Unparalleled scale

25 terabytes of data collected each day
Over 4 billion clicks every day
Over 4 billion emails per day
Over 6 billion instant messages per day
Over 20 billion web documents indexed
Over 4 billion images searchable

No other company on the planet processes as much
data as we do!
4
Yahoo! Labs Bangalore

Focus is on basic and applied research
Search
Advertizing
Cloud computing
University relations
Faculty research grants
Summer internships
Sharing data/computing infrastructure
Conference sponsorships
PhD co-op program

5
Web Search
6
What does search look like today?
7
Search results of the future Structured abstracts
yelp.com
Gawker
babycenter
New York Times
epicurious
LinkedIn
answers.com
webmd
8
Search results of the future Query refinement
9
Search results of the future Rich media
10
Technologies that are enabling search
transformation

Information extraction (structured abstracts)
Web page classification (query refinement)
Multimedia search (rich media)

11
Information extraction (IE)

Goal Extract structured records from Web pages

Name
Category
Address
Map
Phone
Price
Reviews
12
Multiple verticals

Business, social networking, video, .

13
One schema per vertical
14
IE on the Web is a hard problem

Web pages are noisy
Pages belonging to different Web sites have
different layouts

Noise
15
Web page types

Template-based

Hand-crafted
16
Template-based pages

Pages within a Web site generated using scripts,
have very similar structure
Can be leveraged for extraction
30 of crawled Web pages
Information rich, frequently appear in the top
results of search queries
E.g. search query Chinese Mirch New York
9 template-based pages in the top 10 results

17
Wrapper Induction

Enables extraction from template-based pages

Learn
Sample pages
Annotations
Website pages
Annotate Pages
Learn Wrappers
Sample
Apply wrappers
XPath Rules
Extract
Extract
Website pages
Records
18
Example
Generalize
XPath /html/body/div/div/div/div/div/div/span
/html/body//div//span
19
Filters

Apply filters to prune from multiple candidates
that match XPath expression

XPath /html/body//div//span
Regex Filter (Phone)(0-93) 0-93-0-94
20
Limitations of wrappers

Wont work across Web sites due to different page
layouts
Scaling to thousands of sites can be a challenge
Need to learn a separate wrapper for each site
Annotating example pages from thousands of sites
can be time-consuming expensive

21
Research challenge

Unsupervised IE Extract attribute values from
pages of a new Web site without annotating a
single page from the site
Only annotate pages from a few sites initially as
training data

22
Conditional Random Fields (CRFs)

Models conditional probability distribution of
label sequence yy1,,yn given input sequence
xx1,,xn
fk features, lk weights
Choose lk to maximize log-likelihood of training
data
Use Viterbi algorithm to compute label sequence y
with highest probability

23
CRFs-based IE

Web pages can be viewed as labeled sequences
Train CRF using pages from few Web sites
Then use trained CRF to extract from remaining
sites

24
Drawbacks of CRFs

Require too many training examples
Have been used previously to segment short
strings with similar structure
However, may not work too well across Web sites
that
contain long pages with lots of noise
have very different structure

25
An alternate approach that exploits site knowledge

Build attribute classifiers for each attribute
Use pages from a few initial Web sites
For each page from a new Web site
Segment page into sequence of fields (using
static repeating text)
Use attribute classifiers to assign attribute
labels to fields
Use constraints to disambiguate labels
Uniqueness an attribute occurs at most once in a
page
Proximity attribute values appear close together
in a page
Structural relative positions of attributes are
identical across pages of a Web site

26
Attribute classifiers constraints example
Chinese Mirch
Chinese, Indian
120 Lexington AvenueNew York, NY 10016
(212) 532 3663
Page1
Phone
Category
Name
Address
Jewel of India
Indian
15 W 44th StNew York, NY 10016
(212) 869 5544
Page2
Category
Name
Phone
Address
21 Club
American
21 W 52nd StNew York, NY 10019
(212) 582 7200
Page3
Phone
Category, Name
Name, Noise
Address
Uniqueness constraint NamePrecedence
constraint Name lt Category
21 Club
Page3
American
21 W 52nd StNew York, NY 10019
(212) 582 7200
Phone
Category
Name
Address
27
Other IE scenarios Browse page extraction
Similar-structuredrecords
28
IE big picture/taxonomy

Things to extract from
Template-based, browse, hand-crafted pages, text
Things to extract
Records, tables, lists, named entities
Techniques used
Structure-based (HTML tags, DOM tree paths)
e.g. Wrappers
Content-based (attribute values/models) e.g.
dictionaries
Structure Content (sequential/hierarchical
relationships among attribute values) e.g.
hierarchical CRFs
Level of automation
Manual, supervised, unsupervised

29
Web Page Classification Requirements

Quality
High Precision and Recall
Leverage structured input (links, co-citations)
and output (taxonomy)
Scalability
Large numbers of training Examples, Features and
Classes
Complex Structured input and output
Cost
Small human effort (for labeling of pages)
Compact classifier model
Low prediction time

30
Structured Output Learning

Structured Output Examples
Multi-class
Taxonomy
Naïve approach
Separate binary classifier per class
Separate classifier for each taxonomy level
Better approach single (SVM) classifier
Higher accuracy, more efficient
Sequential Dual Method (SDM)
Visit each example sequentially and solve
associated QP problem (in dual) efficiently
Order of magnitude faster

Health
Sport
Fitness
Medicine
Cricket
Soccer
One-day
Test
31
Classification With Relational Information
Similar structure

Relational Information
Web page links, structural similarity
Graph representation
Pages as nodes (with labels)
Edge weights (s(j,k)) Page similarity,
out-link/co-citation existence, etc.
Classification can be expressed as an
optimization problem

Co-citation
Link
32
Multimedia Search

Availability consumption of multimedia content
on the Internet is increasing
500 billion images will be captured in 2010
Leveraging content and metadata are important for
MM search
Some big technical challenges are
Results diversity
Relevance
Image Classification, e.g., pornography

33
Near-Duplicate Detection

Multiple near-similar versions of an image exist
on the internet
scaled, cropped, captioned, small scene change,
etc.
Near-duplicates adversely impact user experience
Can we use a compact description and dedup in
constant time?
Fourier-Mellin Transform (FMT) translation,
rotation, and scale invariant
Signature generation using a small number of
low-frequency coefficients of FMT

34
Filtering noisy tags to improve relevance

Measures such as IDF may assign high weights to
noisy tags
Treat Tag-Sets as Bag-of-words, random collection
of terms
Boosting weights of tags based on their
co-occurrence with other tags can filter out
noise

idf
co-occur
35
Online Advertizing
36
Sponsored search ads
Search query
Ad
37
How it works
Ad Index
Advertiser
Sponsored search engine

Engine decides when/where to show this ad on
search results page
Advertizer pays only if user clicks on ad

38
Ad selection criterion

Problem which ads to show from among ads
containing keyword?

Ads with highest bid may not maximize revenue
Choose ads with maximum expected revenue
Weigh bid amount with click probability

39
Contextual Advertising
Ads
40
Contextual ads

Similar to sponsored search, but now ads are
shown on general Web pages as opposed to only
search pages
Advertizers bid on keywords
Advertizer pays only if user clicks, Y!
publisher share paid amount
Ad matching engine ranks ads based on expected
revenue (bid amount click probability)

41
Estimating click probability

Use logistic regression model
p(click ad, page, user)
fi ith feature for ad, page, user
wi weight for feature fi
Training data ad click logs (all clicks
non-click samples)
Optimize log-likelihood to learn weights

42
Features

Ad bid terms, title, body, category,
Page url, title, keywords in body, category,
User
Geographic (location, time)
Demographic (age, gender)
Behavioral
Combine above to get (billions of) richer
features
E.g (apple ad title) (ipod page body)
(20 lt user age lt 30)
Select subset that leads to improvement in
likelihood

43
Banner ads

Show Web page with display ads

Ad
Creates Brand Awareness
44
How it works
Ad Index
Advertiser
I want 1M impressions On finance.yahoo.com,
gender male, age 20-30 during the month of
April 2009
Banner Ad Engine

Engine guarantees 1M impressions
Advertiser pays a fixed price
No dependence on clicks
Engine does admission control, decides allocation
of ads to pages

45
Allocation Example
SUPPLY (Qty, Price)
(6M,10)
Unallocated
Value60M
(10M,10)
(10M,10)
12
12
Suboptimal
(10M,20)
(10M,10)
12
DEMAND (Target, Qty)
(6M,20)
Value 120M
12
(GenderMale, 12M)
(Agegt30, 12M)
Optimal
46
Research problem

Goal Allocate demands so that the value of
unallocated inventory is maximized
Similar to transportation problem

47
Transportation problem
Demands
Supply
Price
d1
1
s1
1
p1
xi1
d2
2
2
s2
p2
xi2
xij
j
i
sj
pj
di
Edges to Ri
xij Units of demand I allocated to region j
48
Ads taxonomy
Online Ads
Search pages
Web pages
Contextual
Banner
Sponsored search
Keywords
Keywords
Attributes
Targeting
Guarantees
NG
NG
G
NG
CPM/CPC
CPM
CPC
CPC
Model
49
Major trend Ads convergence

Today

Separate systems for contextual display
Contextual
Display
CPC
CPM

Tomorrow

Unified Ads marketplace
Unify contextual Display
Increase supply demand
Enable better matching
CPC, CPM ads compete

Advertiser Creates demand
Publisher Creates supply of pages
50
Research challenge

Which ad to select between competing CPC, CPM
ads?
Use eCPM
For CPM ads eCPM bid
For CPC ads eCPM bid Pr(click)
Select ad with max eCPM to maximize revenue
Problem ad with highest eCPM may not get
selected
eCPMs estimated based on historical data, which
can differ from actual eCPMs
Variance in estimated eCPMs higher for CPC ads
Selection gets biased towards ads which have
higher variance as they have higher probability
of over-estimated eCPMs

Estimated eCPM
CPC ad
CPM ad
Actual eCPM
Estimated eCPM
51
Cloud Computing
52
Much of the stuff we do is compute/data-intensive

Search
Index 100 billion crawled Web pages
Build Web graph, compute PageRank
Advertizing
Construct ML models to predict click probability
Cluster, classify Web pages
Improve search relevance, ad matching
Data mining
Analyze TBs of Web logs to compute correlations
between (billions of) user profiles and page views

53
Solution Cloud computing

A cloud consists of
1000s of commodity machines (e.g., Linux PCs)
Software layer for
Distributing data across machines
Parallelizing application execution across
cluster
Detecting and recovering from failures
Yahoo!s software layer based on Hadoop Open
Source

54
Cloud computing benefits

Enables processing of massive compute-intensive
tasks
Reduces computing and storage costs
Resource sharing leads to efficient utilization
Commodity hardware, open source
Shields application developers from complexity of
building in reliability, scalability in their
programs
In large clusters, machines fail every day
Parallel programming is hard

55
Cloud computing at Yahoo!

10,000s of nodes running Hadoop, TBs of RAM, PBs
of disk
Multiple clusters, largest is a 1600 node cluster

56
Hadoops Map/Reduce Framework

Framework for parallel computation over massive
data sets on large clusters
As an example, consider the problem of creating
an index for word search.
Input Thousands of documents/web pages
Output A mapping of word to document IDs

57
Hadoops Map/Reduce
Index example (contd.)
intermediate output (sorted)
Shuffle
Reduce Tasks
Input split
Map Tasks
58
Research challenges
Data Distribution and Replication
Compute Nodes in Racks
Data Blocks for a given job distributed and
replicated across nodes in a rack and across racks
Rack 1
Rack 2

Challenges
Optimize distribution to provide maximum
locality
Optimize replication to provide best fault
tolerance

Rack i
Rack n
Job Scheduling
Job Queues based on priorities and SLAs

Challenges
Schedule jobs to maximize resource utilization
while preserving SLAs
Schedule jobs to maximize data locality
Performance modeling

L1
SDS Q1 40
1
2
3
L2
YST Q2 35
Lm
ATG Qm 25
59
Summary

Internet is an exciting place, plenty of research
needed to improve
User experience
Monetization
Scalability
Search -gt Information extraction, classification,
.
Advertizing -gt Click prediction, ad placement, .
Cloud computing -gt Job scheduling, perf modeling,
Solving problems will require techniques from
multiple disciplines ML, statistics, economics,
algos, systems,