Hypertext Databases and Data Mining SIGMOD 1999 Tutorial - PowerPoint PPT Presentation

About This Presentation

Title:

Hypertext Databases and Data Mining SIGMOD 1999 Tutorial

Description:

Filtering news, email, etc. Narrowing searches and selective data acquisition ... Yahoo/SocietyCulture/Environment/Recycling. Soumen Chakrabarti. IIT Bombay ... – PowerPoint PPT presentation

Number of Views:57

Avg rating:3.0/5.0

Slides: 51

Provided by: soumencha

Category:

more less

Transcript and Presenter's Notes

Title: Hypertext Databases and Data Mining SIGMOD 1999 Tutorial

1
Hypertext Databases and Data Mining(SIGMOD 1999
Tutorial)

Soumen Chakrabarti
Indian Institute of Technology Bombay
http//www.cse.iitb.ernet.in/soumenhttp//www.cs
.berkeley.edu/soumensoumen_at_cse.iitb.ernet.in

2
The Web

350 million static HTML pages, 2 terabytes
0.81 million new pages created per day
600 GB of pages change per month
Average page changes in a few weeks
Average page has about ten links
Increasing volume of active pages and views
Boundaries between repositories blurred
Bigger than the sum of its parts

3
Hypertext databases

Academia
Digital library, web publication
Consumer
Newsgroups, communities, product reviews
Industry and organizations
Health care, customer service
Corporate email

4
What to expect

Write in decimal the exact circumference of a
circle of radius one inch
Is the distance between Tokyo and Rome more than
6000 miles?
What is the distance between Tokyo and Rome?
java
java coffee -applet
uninterrupt power suppl ups -parcel

5
Search products and services

Verity
Fulcrum
PLS
Oracle text extender
DB2 text extender
Infoseek Intranet
SMART (academic)
Glimpse (academic)

Inktomi (HotBot)
Alta Vista
Google!
Yahoo!
Infoseek Internet
Lycos
Excite

6
FTP
Gopher
HTML
Local data
More structure
IndexingSearch
Crawling
WebSQL
WebL
Relevance Ranking
Social Network of Hyperlinks
Latent Semantic Indexing
XML
Clustering
Web Communities
Scatter- Gather
Collaborative Filtering
Web Servers
Topic Distillation
Topic Directories
Monitor Mine Modify
User Profiling
Semi-supervised Learning
Automatic Classification
Focused Crawling
Web Browsers
7
Basic indexing and search
8
Keyword indexing

Boolean search
care AND NOT old
Stemming
gain
Phrases and proximity
new care
loss ltNEAR/5gt care
ltSENTENCEgt

My care is loss of care with old care done
D1
Your care is gain of care with new care won
D2
care
D1 1, 5, 8
D2 1, 5, 8
new
D2 7
old
D1 7
loss
D1 3
9
Tables and queries 1
POSTING
select distinct did from POSTING where tid
care except select distinct did from POSTING
where tid like gain
with TPOS1(did, pos) as (select did, pos from
POSTING where tid new), TPOS2(did, pos)
as (select did, pos from POSTING where tid
care) select distinct did from TPOS1,
TPOS2 where TPOS1.did TPOS2.did and
proximity(TPOS1.pos, TPOS2.pos)
proximity(a, b) a 1 b abs(a - b) lt 5
10
Relevance ranking

Recall coverage
What fraction of relevant documents were reported
Precision accuracy
What fraction of reported documents were relevant
Trade-off

Query
True response
Compare
Search
Consider prefix k
Output sequence
11
Vector space model and TFIDF

Some words are more important than others
W.r.t. a document collection D
d have a term, d- do not
Inverse document frequency
Term frequency (TF)
Many variants
Probabilistic models

12
Tables and queries 2
VECTOR(did, tid, elem) With TEXT(did, tid,
freq) as (select did, tid, count(distinct pos)
from POSTING group by did, tid), LENGTH(did,
len) as (select did, sum(freq) from TEXT group
by did), DOCFREQ(tid, df) as (select tid,
count(distinct did) from TEXT group by
tid) select did, tid, (freq / len) (1
log((select count(distinct did from
POSTING))/df)) from TEXT, LENGTH, DOCFREQ where
TEXT.did LENGTH.did and TEXT.tid DOCFREQ.tid
13
Relevance ranking
now
select did, cosine(did, query) from corpus where
candidate(did, query) order by cosine(did, query)
desc fetch first k rows only
query
auto
car
Find largest k columns of
D
Exact computation O(n2) All entries above mean
can be estimated with error e within O(ne-2) time
A
T
14
Similarity and clustering
15
Clustering

Given an unlabeled collection of documents,
induce a taxonomy based on similarity
Need document similarity measure
Distance between normalized document vectors
Cosine of angle between document vectors
Top-down clustering is difficult because of huge
number of noisy dimensions
k-means, expectation maximization
Quadratic-time bottom-up clustering

16
Document model

Vocabulary V, term wi, document ? represented by
is the number of times wi occurs
in document ?
Most fs are zeroes for a single document
Monotone component-wise damping function g such
as log or square-root

17
Similarity
Normalized document profile
Profile for document group ?
18
Group average clustering 1

Initially G is a collection of singleton groups,
each with one document
Repeat
Find ?, ? in G with max s(???)
Merge group ? with group ?
For each ? keep track of best ?
O(n2) algorithm

19
Group average clustering 2
Un-normalizedgroup profile
Can show
20
Rectangular time algorithm Buckshot

Randomly sample documents
Run group average clustering algorithm to reduce
to k groups or clusters
Iterate assign-to-nearest O(1) times
Move each document to cluster ? with max s(?,?)
Total time taken is O(kn)

21
Extended similarity

auto and car co-occur often
Therefore they must be related
Documents having related words are related
Useful for search and clustering
Two basic approaches
Hand-made thesaurus (WordNet)
Co-occurrence and associations

auto car car auto
auto car car auto
auto car car auto
car ? auto
auto
?
car
22
Latent semantic indexing
Term
Document
d
Documents
A
U
D
V
SVD
Terms
t
d
r
23
Collaborative recommendation

Peoplerecord, moviesfeatures, cluster people
Both people and features can be clustered
For hypertext access, time of access is a feature
Need advanced models

24
A model for collaboration

People and movies belong to unknown classes
Pk probability a random person is in class k
Pl probability a random movie is in class l
Pkl probability of a class-k person liking a
class-l movie
Gibbs sampling iterate
Pick a person or movie at random and assign to a
class with probability proportional to Pk or Pl
Estimate new parameters

25
Supervised learning
26
Supervised learning (classification)

Many forms
Content automatically organize the web per
Yahoo!
Type faculty, student, staff
Intent education, discussion, comparison,
advertisement
Applications
Relevance feedback for re-scoring query responses
Filtering news, email, etc.
Narrowing searches and selective data acquisition

27
Difficulties

Dimensionality
Decision tree classifiers dozens of columns
Vector space model 50,000 columns
Context-dependent noise
Can (v.) considered a stopword
Can (n.) may not be a stopword
in/Yahoo/SocietyCulture/Environment/Recycling

28
More difficulties

Need for scalability
High dimension needs more data to learn
Class labels are from a hierarchy
All documents belong to the root node
Highest probability leaf may have low confidence

29
Techniques

Nearest neighbor
Standard keyword index also supports
classification
How to define similarity? (TFIDF may not work)
Wastes space by storing individual document info
Rule-based, decision-tree based
Very slow to train (but quick to test)
Good accuracy (but brittle rules)
Model-based
Fast training and testing with small footprint

30
More document models

Boolean vector (word counts ignored)
Toss one coin for each term in the universe
Bag of words (multinomial)
Repeatedly toss coin with a term on each face
Limited dependence models
Bayesian network where each feature has at most k
features as parents
Maximum entropy estimation

31
Bag-of-words

Decide topic topic c is picked with prior
probability ?(c) ?c?(c) 1
Each topic c has parameters ?(c,t) for terms t
Coin with face probabilities ?t ?(c,t) 1
Fix document length and keep tossing coin
Given c, probability of document is

32
Limitations

With the model
100th occurrence of term as surprising as first
No inter-term dependence
With using the model
Most observed ?(c,t) are zero and/or noisy
Have to pick a low-noise subset of the term
universe
Improves space, time, and accuracy
Have to fix low-support statistics

33
Feature selection
Model with unknown parameters
Confidence intervals
T
T
p1
p1
p2
...
q1
q2
...
q1
N
Observed data
0
1
...
Pick F?T such that models built over F have high
separation confidence
N
34
Tables and queries 3
TAXONOMY
EGMAPR(did, kcid) ((select did, kcid from
EGMAP) union all (select e.did, t.pcid
from EGMAPR as e, TAXONOMY as t where e.kcid
t.kcid)) STAT(pcid, tid, kcid, ksmc, ksnc)
(select pcid, tid, TAXONOMY.kcid, count(dist
inct TEXT.did), sum(freq) from EGMAPR, TAXONOMY,
TEXT where TAXONOMY.kcid EGMAPR.kcid and
EGMAPR.did TEXT.did group by pcid, tid,
TAXONOMY.kcid)
1
2
3
EGMAP
4
5
TEXT
35
Analyzing hyperlink structure
36
Hyperlink graph analysis

Hypermedia is a social network
Telephoned, advised, co-authored, paid, cited
Social network theory (cf. Wasserman Faust)
Extensive research applying graph notions
Centrality
Prestige and reflected prestige
Co-citation
Can be applied directly to Web search
HIT, Google, CLEVER, topic distillation

37
Hypertext models for classification

cclass, ttext, Nneighbors
Text-only model Prtc
Using neighbors textto judge my topicPrt,
t(N) c
Better modelPrt, c(N) c
Non-linear relaxation

?
38
Exploiting link features

9600 patents from 12 classes marked by USPTO
Patents have text and cite other patents
Expand test patent to include neighborhood
Forget fraction of neighbors classes

39
Google and HITS

In-degree ? prestige
Not all votes are worth the same
Prestige of a page is the sum of prestige of
citing pages p Ep
Pre-compute query independent prestige score

High prestige ? good authority
High reflected prestige ? good hub
Bipartite iteration
a Eh
h ETa
h ETEh

40
Tables and queries 4
delete from HUBS insert into HUBS(url,
score) (select urlsrc, sum(score wtrev) from
AUTH, LINK where authwt is not null and type
non-local and ipdst ltgt ipsrc and url
urldst group by urlsrc) update HUBS set (score)
score / (select sum(score) from HUBS)
HUBS
AUTH
update LINK as X set (wtfwd) 1. / (select
count(ipsrc) from LINK where ipsrc
X.ipsrc and urldst X.urldst) where type
non-local
wgtfwd
score
score
urlsrc _at_ipsrc
urldst _at_ipdst
LINK
wgtrev
41
Querying/mining semi-structured data
42
Semi-structured database systems

Lore (Stanford)
Object exchange model, dataguides
WebSQL (Toronto), WebL (Compaq SRC)
Structured query languages for the Web
WHIRL (ATT Research)
Approximate matches on multiple textual columns
Strudel (ATT Research, U. Washington)
Web site generation and management

43
Queries combining structure and content

Select x.url, x.title from Document x such that
http//www.cs.wisc.edu? ? ?x where x
mentions semi-structured data
Apart from cycling, find the most common topic
found within link radius 2 of pages on cycling
In the last year, how many links were made from
environment protection pages to Exxon?

Answer first-aid
44
Resource discovery
45
Resource discovery results

High rate of harvesting relevant pages
Robust to perturbations of starting URLs
Great resources found 12 links from start set

46
Resource discovery results 1

High rate of harvesting relevant pages
Standard crawling neither necessary nor adequate
for answering specific queries

47
Resource discovery results 2

Robust to perturbations of starting URLs
Great resources found 12 links from start set

48
Database issues

Useful features
Concurrency and recovery (crawlers)
I/O-efficient representation of mining algorithms
Ad-hoc queries combining structure and content
Need better support for
Flexible choices for concurrency and recovery
Indexed scans over temporary table expressions
Efficient string storage and operations
Answering complex queries approximately

49
Resources
50
Research areas

Modeling, representation, and manipulation
More applications of machine learning
Approximate structure and content matching
Answering questions in specific domains
Interactive refinement of ill-defined queries
Tracking emergent topics in a discussion group
Content-based collaborative recommendation
Semantic prefetching and caching

51
Events and activities

Text REtrieval Conference (TREC)
Mature ad-hoc query and filtering tracks
(newswire)
New track for web search (2GB and 100GB corpus)
New track for question answering
DIMACS special years on Networks (-2000)
Includes applications such as information
retrieval, databases and the Web, multimedia
transmission and coding, distributed and
collaborative computing
Conferences WWW, SIGIR, SIGMOD/VLDB?