Title: Hypertext Databases and Data Mining SIGMOD 1999 Tutorial
1Hypertext Databases and Data Mining(SIGMOD 1999
Tutorial)
- Soumen Chakrabarti
- Indian Institute of Technology Bombay
- http//www.cse.iitb.ernet.in/soumenhttp//www.cs
.berkeley.edu/soumensoumen_at_cse.iitb.ernet.in
2The Web
- 350 million static HTML pages, 2 terabytes
- 0.81 million new pages created per day
- 600 GB of pages change per month
- Average page changes in a few weeks
- Average page has about ten links
- Increasing volume of active pages and views
- Boundaries between repositories blurred
- Bigger than the sum of its parts
3Hypertext databases
- Academia
- Digital library, web publication
- Consumer
- Newsgroups, communities, product reviews
- Industry and organizations
- Health care, customer service
- Corporate email
4What to expect
- Write in decimal the exact circumference of a
circle of radius one inch - Is the distance between Tokyo and Rome more than
6000 miles? - What is the distance between Tokyo and Rome?
- java
- java coffee -applet
- uninterrupt power suppl ups -parcel
5Search products and services
- Verity
- Fulcrum
- PLS
- Oracle text extender
- DB2 text extender
- Infoseek Intranet
- SMART (academic)
- Glimpse (academic)
- Inktomi (HotBot)
- Alta Vista
- Google!
- Yahoo!
- Infoseek Internet
- Lycos
- Excite
6FTP
Gopher
HTML
Local data
More structure
IndexingSearch
Crawling
WebSQL
WebL
Relevance Ranking
Social Network of Hyperlinks
Latent Semantic Indexing
XML
Clustering
Web Communities
Scatter- Gather
Collaborative Filtering
Web Servers
Topic Distillation
Topic Directories
Monitor Mine Modify
User Profiling
Semi-supervised Learning
Automatic Classification
Focused Crawling
Web Browsers
7Basic indexing and search
8Keyword indexing
- Boolean search
- care AND NOT old
- Stemming
- gain
- Phrases and proximity
- new care
- loss ltNEAR/5gt care
- ltSENTENCEgt
My care is loss of care with old care done
D1
Your care is gain of care with new care won
D2
care
D1 1, 5, 8
D2 1, 5, 8
new
D2 7
old
D1 7
loss
D1 3
9Tables and queries 1
POSTING
select distinct did from POSTING where tid
care except select distinct did from POSTING
where tid like gain
with TPOS1(did, pos) as (select did, pos from
POSTING where tid new), TPOS2(did, pos)
as (select did, pos from POSTING where tid
care) select distinct did from TPOS1,
TPOS2 where TPOS1.did TPOS2.did and
proximity(TPOS1.pos, TPOS2.pos)
proximity(a, b) a 1 b abs(a - b) lt 5
10Relevance ranking
- Recall coverage
- What fraction of relevant documents were reported
- Precision accuracy
- What fraction of reported documents were relevant
- Trade-off
Query
True response
Compare
Search
Consider prefix k
Output sequence
11Vector space model and TFIDF
- Some words are more important than others
- W.r.t. a document collection D
- d have a term, d- do not
- Inverse document frequency
- Term frequency (TF)
- Many variants
- Probabilistic models
12Tables and queries 2
VECTOR(did, tid, elem) With TEXT(did, tid,
freq) as (select did, tid, count(distinct pos)
from POSTING group by did, tid), LENGTH(did,
len) as (select did, sum(freq) from TEXT group
by did), DOCFREQ(tid, df) as (select tid,
count(distinct did) from TEXT group by
tid) select did, tid, (freq / len) (1
log((select count(distinct did from
POSTING))/df)) from TEXT, LENGTH, DOCFREQ where
TEXT.did LENGTH.did and TEXT.tid DOCFREQ.tid
13Relevance ranking
now
select did, cosine(did, query) from corpus where
candidate(did, query) order by cosine(did, query)
desc fetch first k rows only
query
auto
car
Find largest k columns of
D
Exact computation O(n2) All entries above mean
can be estimated with error e within O(ne-2) time
A
T
14Similarity and clustering
15Clustering
- Given an unlabeled collection of documents,
induce a taxonomy based on similarity - Need document similarity measure
- Distance between normalized document vectors
- Cosine of angle between document vectors
- Top-down clustering is difficult because of huge
number of noisy dimensions - k-means, expectation maximization
- Quadratic-time bottom-up clustering
16Document model
- Vocabulary V, term wi, document ? represented by
- is the number of times wi occurs
in document ? - Most fs are zeroes for a single document
- Monotone component-wise damping function g such
as log or square-root
17Similarity
Normalized document profile
Profile for document group ?
18Group average clustering 1
- Initially G is a collection of singleton groups,
each with one document - Repeat
- Find ?, ? in G with max s(???)
- Merge group ? with group ?
- For each ? keep track of best ?
- O(n2) algorithm
19Group average clustering 2
Un-normalizedgroup profile
Can show
20Rectangular time algorithm Buckshot
- Randomly sample documents
- Run group average clustering algorithm to reduce
to k groups or clusters - Iterate assign-to-nearest O(1) times
- Move each document to cluster ? with max s(?,?)
- Total time taken is O(kn)
21Extended similarity
- auto and car co-occur often
- Therefore they must be related
- Documents having related words are related
- Useful for search and clustering
- Two basic approaches
- Hand-made thesaurus (WordNet)
- Co-occurrence and associations
auto car car auto
auto car car auto
auto car car auto
car ? auto
auto
?
car
22Latent semantic indexing
Term
Document
d
Documents
A
U
D
V
SVD
Terms
t
d
r
23Collaborative recommendation
- Peoplerecord, moviesfeatures, cluster people
- Both people and features can be clustered
- For hypertext access, time of access is a feature
- Need advanced models
24A model for collaboration
- People and movies belong to unknown classes
- Pk probability a random person is in class k
- Pl probability a random movie is in class l
- Pkl probability of a class-k person liking a
class-l movie - Gibbs sampling iterate
- Pick a person or movie at random and assign to a
class with probability proportional to Pk or Pl - Estimate new parameters
25Supervised learning
26Supervised learning (classification)
- Many forms
- Content automatically organize the web per
Yahoo! - Type faculty, student, staff
- Intent education, discussion, comparison,
advertisement - Applications
- Relevance feedback for re-scoring query responses
- Filtering news, email, etc.
- Narrowing searches and selective data acquisition
27Difficulties
- Dimensionality
- Decision tree classifiers dozens of columns
- Vector space model 50,000 columns
- Context-dependent noise
- Can (v.) considered a stopword
- Can (n.) may not be a stopword
in/Yahoo/SocietyCulture/Environment/Recycling
28More difficulties
- Need for scalability
- High dimension needs more data to learn
- Class labels are from a hierarchy
- All documents belong to the root node
- Highest probability leaf may have low confidence
29Techniques
- Nearest neighbor
- Standard keyword index also supports
classification - How to define similarity? (TFIDF may not work)
- Wastes space by storing individual document info
- Rule-based, decision-tree based
- Very slow to train (but quick to test)
- Good accuracy (but brittle rules)
- Model-based
- Fast training and testing with small footprint
30More document models
- Boolean vector (word counts ignored)
- Toss one coin for each term in the universe
- Bag of words (multinomial)
- Repeatedly toss coin with a term on each face
- Limited dependence models
- Bayesian network where each feature has at most k
features as parents - Maximum entropy estimation
31Bag-of-words
- Decide topic topic c is picked with prior
probability ?(c) ?c?(c) 1 - Each topic c has parameters ?(c,t) for terms t
- Coin with face probabilities ?t ?(c,t) 1
- Fix document length and keep tossing coin
- Given c, probability of document is
32Limitations
- With the model
- 100th occurrence of term as surprising as first
- No inter-term dependence
- With using the model
- Most observed ?(c,t) are zero and/or noisy
- Have to pick a low-noise subset of the term
universe - Improves space, time, and accuracy
- Have to fix low-support statistics
33Feature selection
Model with unknown parameters
Confidence intervals
T
T
p1
p1
p2
...
q1
q2
...
q1
N
Observed data
0
1
...
Pick F?T such that models built over F have high
separation confidence
N
34Tables and queries 3
TAXONOMY
EGMAPR(did, kcid) ((select did, kcid from
EGMAP) union all (select e.did, t.pcid
from EGMAPR as e, TAXONOMY as t where e.kcid
t.kcid)) STAT(pcid, tid, kcid, ksmc, ksnc)
(select pcid, tid, TAXONOMY.kcid, count(dist
inct TEXT.did), sum(freq) from EGMAPR, TAXONOMY,
TEXT where TAXONOMY.kcid EGMAPR.kcid and
EGMAPR.did TEXT.did group by pcid, tid,
TAXONOMY.kcid)
1
2
3
EGMAP
4
5
TEXT
35Analyzing hyperlink structure
36Hyperlink graph analysis
- Hypermedia is a social network
- Telephoned, advised, co-authored, paid, cited
- Social network theory (cf. Wasserman Faust)
- Extensive research applying graph notions
- Centrality
- Prestige and reflected prestige
- Co-citation
- Can be applied directly to Web search
- HIT, Google, CLEVER, topic distillation
37Hypertext models for classification
- cclass, ttext, Nneighbors
- Text-only model Prtc
- Using neighbors textto judge my topicPrt,
t(N) c - Better modelPrt, c(N) c
- Non-linear relaxation
?
38Exploiting link features
- 9600 patents from 12 classes marked by USPTO
- Patents have text and cite other patents
- Expand test patent to include neighborhood
- Forget fraction of neighbors classes
39Google and HITS
- In-degree ? prestige
- Not all votes are worth the same
- Prestige of a page is the sum of prestige of
citing pages p Ep - Pre-compute query independent prestige score
- High prestige ? good authority
- High reflected prestige ? good hub
- Bipartite iteration
- a Eh
- h ETa
- h ETEh
40Tables and queries 4
delete from HUBS insert into HUBS(url,
score) (select urlsrc, sum(score wtrev) from
AUTH, LINK where authwt is not null and type
non-local and ipdst ltgt ipsrc and url
urldst group by urlsrc) update HUBS set (score)
score / (select sum(score) from HUBS)
HUBS
AUTH
update LINK as X set (wtfwd) 1. / (select
count(ipsrc) from LINK where ipsrc
X.ipsrc and urldst X.urldst) where type
non-local
wgtfwd
score
score
urlsrc _at_ipsrc
urldst _at_ipdst
LINK
wgtrev
41Querying/mining semi-structured data
42Semi-structured database systems
- Lore (Stanford)
- Object exchange model, dataguides
- WebSQL (Toronto), WebL (Compaq SRC)
- Structured query languages for the Web
- WHIRL (ATT Research)
- Approximate matches on multiple textual columns
- Strudel (ATT Research, U. Washington)
- Web site generation and management
43Queries combining structure and content
- Select x.url, x.title from Document x such that
http//www.cs.wisc.edu? ? ?x where x
mentions semi-structured data - Apart from cycling, find the most common topic
found within link radius 2 of pages on cycling - In the last year, how many links were made from
environment protection pages to Exxon?
Answer first-aid
44Resource discovery
45Resource discovery results
- High rate of harvesting relevant pages
- Robust to perturbations of starting URLs
- Great resources found 12 links from start set
46Resource discovery results 1
- High rate of harvesting relevant pages
- Standard crawling neither necessary nor adequate
for answering specific queries
47Resource discovery results 2
- Robust to perturbations of starting URLs
- Great resources found 12 links from start set
48Database issues
- Useful features
- Concurrency and recovery (crawlers)
- I/O-efficient representation of mining algorithms
- Ad-hoc queries combining structure and content
- Need better support for
- Flexible choices for concurrency and recovery
- Indexed scans over temporary table expressions
- Efficient string storage and operations
- Answering complex queries approximately
49Resources
50Research areas
- Modeling, representation, and manipulation
- More applications of machine learning
- Approximate structure and content matching
- Answering questions in specific domains
- Interactive refinement of ill-defined queries
- Tracking emergent topics in a discussion group
- Content-based collaborative recommendation
- Semantic prefetching and caching
51Events and activities
- Text REtrieval Conference (TREC)
- Mature ad-hoc query and filtering tracks
(newswire) - New track for web search (2GB and 100GB corpus)
- New track for question answering
- DIMACS special years on Networks (-2000)
- Includes applications such as information
retrieval, databases and the Web, multimedia
transmission and coding, distributed and
collaborative computing - Conferences WWW, SIGIR, SIGMOD/VLDB?