Title: Internet Research: What
1Internet Research Whats hot in Search,
Advertizing Cloud Computing
Rajeev RastogiYahoo! Labs Bangalore
2The most visited site on the internet
- 600 million users per month
- Super popular properties
- News, finance, sports
- Answers, flickr, del.icio.us
- Mail, messaging
- Search
3Unparalleled scale
- 25 terabytes of data collected each day
- Over 4 billion clicks every day
- Over 4 billion emails per day
- Over 6 billion instant messages per day
- Over 20 billion web documents indexed
- Over 4 billion images searchable
No other company on the planet processes as much
data as we do!
4Yahoo! Labs Bangalore
- Focus is on basic and applied research
- Search
- Advertizing
- Cloud computing
- University relations
- Faculty research grants
- Summer internships
- Sharing data/computing infrastructure
- Conference sponsorships
- PhD co-op program
5Web Search
6What does search look like today?
7Search results of the future Structured abstracts
yelp.com
Gawker
babycenter
New York Times
epicurious
LinkedIn
answers.com
webmd
8Search results of the future Query refinement
9Search results of the future Rich media
10Technologies that are enabling search
transformation
- Information extraction (structured abstracts)
- Web page classification (query refinement)
- Multimedia search (rich media)
11Information extraction (IE)
- Goal Extract structured records from Web pages
Name
Category
Address
Map
Phone
Price
Reviews
12Multiple verticals
- Business, social networking, video, .
13One schema per vertical
14IE on the Web is a hard problem
- Web pages are noisy
- Pages belonging to different Web sites have
different layouts
Noise
15Web page types
Hand-crafted
16Template-based pages
- Pages within a Web site generated using scripts,
have very similar structure - Can be leveraged for extraction
- 30 of crawled Web pages
- Information rich, frequently appear in the top
results of search queries - E.g. search query Chinese Mirch New York
- 9 template-based pages in the top 10 results
17Wrapper Induction
- Enables extraction from template-based pages
Learn
Sample pages
Annotations
Website pages
Annotate Pages
Learn Wrappers
Sample
Apply wrappers
XPath Rules
Extract
Extract
Website pages
Records
18Example
Generalize
XPath /html/body/div/div/div/div/div/div/span
/html/body//div//span
19Filters
- Apply filters to prune from multiple candidates
that match XPath expression
XPath /html/body//div//span
Regex Filter (Phone)(0-93) 0-93-0-94
20Limitations of wrappers
- Wont work across Web sites due to different page
layouts - Scaling to thousands of sites can be a challenge
- Need to learn a separate wrapper for each site
- Annotating example pages from thousands of sites
can be time-consuming expensive
21Research challenge
- Unsupervised IE Extract attribute values from
pages of a new Web site without annotating a
single page from the site - Only annotate pages from a few sites initially as
training data
22Conditional Random Fields (CRFs)
- Models conditional probability distribution of
label sequence yy1,,yn given input sequence
xx1,,xn - fk features, lk weights
- Choose lk to maximize log-likelihood of training
data - Use Viterbi algorithm to compute label sequence y
with highest probability
23CRFs-based IE
- Web pages can be viewed as labeled sequences
-
- Train CRF using pages from few Web sites
- Then use trained CRF to extract from remaining
sites
24Drawbacks of CRFs
- Require too many training examples
- Have been used previously to segment short
strings with similar structure - However, may not work too well across Web sites
that - contain long pages with lots of noise
- have very different structure
25An alternate approach that exploits site knowledge
- Build attribute classifiers for each attribute
- Use pages from a few initial Web sites
- For each page from a new Web site
- Segment page into sequence of fields (using
static repeating text) - Use attribute classifiers to assign attribute
labels to fields - Use constraints to disambiguate labels
- Uniqueness an attribute occurs at most once in a
page - Proximity attribute values appear close together
in a page - Structural relative positions of attributes are
identical across pages of a Web site
26Attribute classifiers constraints example
Chinese Mirch
Chinese, Indian
120 Lexington AvenueNew York, NY 10016
(212) 532 3663
Page1
Phone
Category
Name
Address
Jewel of India
Indian
15 W 44th StNew York, NY 10016
(212) 869 5544
Page2
Category
Name
Phone
Address
21 Club
American
21 W 52nd StNew York, NY 10019
(212) 582 7200
Page3
Phone
Category, Name
Name, Noise
Address
Uniqueness constraint NamePrecedence
constraint Name lt Category
21 Club
Page3
American
21 W 52nd StNew York, NY 10019
(212) 582 7200
Phone
Category
Name
Address
27Other IE scenarios Browse page extraction
Similar-structuredrecords
28IE big picture/taxonomy
- Things to extract from
- Template-based, browse, hand-crafted pages, text
- Things to extract
- Records, tables, lists, named entities
- Techniques used
- Structure-based (HTML tags, DOM tree paths)
e.g. Wrappers - Content-based (attribute values/models) e.g.
dictionaries - Structure Content (sequential/hierarchical
relationships among attribute values) e.g.
hierarchical CRFs - Level of automation
- Manual, supervised, unsupervised
29Web Page Classification Requirements
- Quality
- High Precision and Recall
- Leverage structured input (links, co-citations)
and output (taxonomy) - Scalability
- Large numbers of training Examples, Features and
Classes - Complex Structured input and output
- Cost
- Small human effort (for labeling of pages)
- Compact classifier model
- Low prediction time
30Structured Output Learning
- Structured Output Examples
- Multi-class
- Taxonomy
- Naïve approach
- Separate binary classifier per class
- Separate classifier for each taxonomy level
- Better approach single (SVM) classifier
- Higher accuracy, more efficient
- Sequential Dual Method (SDM)
- Visit each example sequentially and solve
associated QP problem (in dual) efficiently - Order of magnitude faster
Health
Sport
Fitness
Medicine
Cricket
Soccer
One-day
Test
31Classification With Relational Information
Similar structure
- Relational Information
- Web page links, structural similarity
- Graph representation
- Pages as nodes (with labels)
- Edge weights (s(j,k)) Page similarity,
out-link/co-citation existence, etc. - Classification can be expressed as an
optimization problem
Co-citation
Link
32Multimedia Search
- Availability consumption of multimedia content
on the Internet is increasing - 500 billion images will be captured in 2010
- Leveraging content and metadata are important for
MM search - Some big technical challenges are
- Results diversity
- Relevance
- Image Classification, e.g., pornography
33Near-Duplicate Detection
- Multiple near-similar versions of an image exist
on the internet - scaled, cropped, captioned, small scene change,
etc. - Near-duplicates adversely impact user experience
- Can we use a compact description and dedup in
constant time? - Fourier-Mellin Transform (FMT) translation,
rotation, and scale invariant - Signature generation using a small number of
low-frequency coefficients of FMT
34Filtering noisy tags to improve relevance
- Measures such as IDF may assign high weights to
noisy tags - Treat Tag-Sets as Bag-of-words, random collection
of terms - Boosting weights of tags based on their
co-occurrence with other tags can filter out
noise
idf
co-occur
35Online Advertizing
36Sponsored search ads
Search query
Ad
37How it works
Ad Index
Advertiser
Sponsored search engine
- Engine decides when/where to show this ad on
search results page - Advertizer pays only if user clicks on ad
38Ad selection criterion
- Problem which ads to show from among ads
containing keyword?
- Ads with highest bid may not maximize revenue
- Choose ads with maximum expected revenue
- Weigh bid amount with click probability
39Contextual Advertising
Ads
40Contextual ads
- Similar to sponsored search, but now ads are
shown on general Web pages as opposed to only
search pages - Advertizers bid on keywords
- Advertizer pays only if user clicks, Y!
publisher share paid amount - Ad matching engine ranks ads based on expected
revenue (bid amount click probability)
41Estimating click probability
- Use logistic regression model
- p(click ad, page, user)
- fi ith feature for ad, page, user
- wi weight for feature fi
- Training data ad click logs (all clicks
non-click samples) - Optimize log-likelihood to learn weights
42Features
- Ad bid terms, title, body, category,
- Page url, title, keywords in body, category,
- User
- Geographic (location, time)
- Demographic (age, gender)
- Behavioral
- Combine above to get (billions of) richer
features - E.g (apple ad title) (ipod page body)
(20 lt user age lt 30) - Select subset that leads to improvement in
likelihood
43Banner ads
- Show Web page with display ads
Ad
Creates Brand Awareness
44How it works
Ad Index
Advertiser
I want 1M impressions On finance.yahoo.com,
gender male, age 20-30 during the month of
April 2009
Banner Ad Engine
- Engine guarantees 1M impressions
- Advertiser pays a fixed price
- No dependence on clicks
- Engine does admission control, decides allocation
of ads to pages
45Allocation Example
SUPPLY (Qty, Price)
(6M,10)
Unallocated
Value60M
(10M,10)
(10M,10)
12
12
Suboptimal
(10M,20)
(10M,10)
12
DEMAND (Target, Qty)
(6M,20)
Value 120M
12
(GenderMale, 12M)
(Agegt30, 12M)
Optimal
46Research problem
- Goal Allocate demands so that the value of
unallocated inventory is maximized - Similar to transportation problem
47Transportation problem
Demands
Supply
Price
d1
1
s1
1
p1
xi1
d2
2
2
s2
p2
xi2
xij
j
i
sj
pj
di
Edges to Ri
xij Units of demand I allocated to region j
48Ads taxonomy
Online Ads
Search pages
Web pages
Contextual
Banner
Sponsored search
Keywords
Keywords
Attributes
Targeting
Guarantees
NG
NG
G
NG
CPM/CPC
CPM
CPC
CPC
Model
49Major trend Ads convergence
Separate systems for contextual display
Contextual
Display
CPC
CPM
- Unified Ads marketplace
- Unify contextual Display
- Increase supply demand
- Enable better matching
- CPC, CPM ads compete
Advertiser Creates demand
Publisher Creates supply of pages
50Research challenge
- Which ad to select between competing CPC, CPM
ads? - Use eCPM
- For CPM ads eCPM bid
- For CPC ads eCPM bid Pr(click)
- Select ad with max eCPM to maximize revenue
- Problem ad with highest eCPM may not get
selected - eCPMs estimated based on historical data, which
can differ from actual eCPMs - Variance in estimated eCPMs higher for CPC ads
- Selection gets biased towards ads which have
higher variance as they have higher probability
of over-estimated eCPMs
Estimated eCPM
CPC ad
CPM ad
Actual eCPM
Estimated eCPM
51Cloud Computing
52Much of the stuff we do is compute/data-intensive
- Search
- Index 100 billion crawled Web pages
- Build Web graph, compute PageRank
- Advertizing
- Construct ML models to predict click probability
- Cluster, classify Web pages
- Improve search relevance, ad matching
- Data mining
- Analyze TBs of Web logs to compute correlations
between (billions of) user profiles and page views
53Solution Cloud computing
- A cloud consists of
- 1000s of commodity machines (e.g., Linux PCs)
- Software layer for
- Distributing data across machines
- Parallelizing application execution across
cluster - Detecting and recovering from failures
- Yahoo!s software layer based on Hadoop Open
Source
54Cloud computing benefits
- Enables processing of massive compute-intensive
tasks - Reduces computing and storage costs
- Resource sharing leads to efficient utilization
- Commodity hardware, open source
- Shields application developers from complexity of
building in reliability, scalability in their
programs - In large clusters, machines fail every day
- Parallel programming is hard
55Cloud computing at Yahoo!
- 10,000s of nodes running Hadoop, TBs of RAM, PBs
of disk - Multiple clusters, largest is a 1600 node cluster
56Hadoops Map/Reduce Framework
- Framework for parallel computation over massive
data sets on large clusters - As an example, consider the problem of creating
an index for word search. - Input Thousands of documents/web pages
- Output A mapping of word to document IDs
57Hadoops Map/Reduce
Index example (contd.)
intermediate output (sorted)
Shuffle
Reduce Tasks
Input split
Map Tasks
58Research challenges
Data Distribution and Replication
Compute Nodes in Racks
Data Blocks for a given job distributed and
replicated across nodes in a rack and across racks
Rack 1
Rack 2
- Challenges
- Optimize distribution to provide maximum
locality - Optimize replication to provide best fault
tolerance
Rack i
Rack n
Job Scheduling
Job Queues based on priorities and SLAs
- Challenges
- Schedule jobs to maximize resource utilization
while preserving SLAs - Schedule jobs to maximize data locality
- Performance modeling
L1
SDS Q1 40
1
2
3
L2
YST Q2 35
Lm
ATG Qm 25
59Summary
- Internet is an exciting place, plenty of research
needed to improve - User experience
- Monetization
- Scalability
- Search -gt Information extraction, classification,
. - Advertizing -gt Click prediction, ad placement, .
- Cloud computing -gt Job scheduling, perf modeling,
- Solving problems will require techniques from
multiple disciplines ML, statistics, economics,
algos, systems,