Mining the Web for Improving Information Retrieval Technology presentation

About This Presentation

Transcript and Presenter's Notes

Title: Mining the Web for Improving Information Retrieval Technology

1
Mining the Web for Improving Information
Retrieval Technology

Lee-Feng Chien (???)
Academia Sinica

2
Web Search vs. IR

Web search has benefited a lot from IR
Fundamental indexing searching techniques were
originated from IR research
Search engine is a popular term of Web IR systems
How Web search can continue to benefit from IR
research?

3
Outline

Trends of IR research
Lessons learned from Web search
Improving IR via Web mining
Summary

4
I. Trends of IR Research
5
(No Transcript)
6
What do People Want from IR(Viewpoint of an IR
leading scientist in 1995)

10. Relevance Feedback.
9. Information Extraction.
8. Multimedia Retrieval.
7. Effective Retrieval.
6. Routing and Filtering.
5. Interfaces and Browsing.
4. Vocabulary Expansion.
3. Efficient, Flexible Indexing and Retrieval.
2. Distributed IR.
1. Integrated Solutions.

7
What do People Want from IR

10. Relevance Feedback.
9. Information Extraction.
8. Multimedia Retrieval.
7. Effective Retrieval.
6. Routing and Filtering.
5. Interfaces and Browsing.
4. Vocabulary Expansion.
3. Efficient, Flexible Indexing and Retrieval.
2. Distributed IR.
1. Integrated Solutions.

Efficiency Issues
8
What do People Want from IR

10. Relevance Feedback.
9. Information Extraction.
8. Multimedia Retrieval.
7. Effective Retrieval.
6. Routing and Filtering.
5. Interfaces and Browsing.
4. Vocabulary Expansion.
3. Efficient, Flexible Indexing and Retrieval.
2. Distributed IR.
1. Integrated Solutions.

Effectiveness Issues
9
What do People Want from IR

10. Relevance Feedback.
9. Information Extraction.
8. Multimedia Retrieval.
7. Effective Retrieval.
6. Routing and Filtering.
5. Interfaces and Browsing.
4. Vocabulary Expansion.
3. Efficient, Flexible Indexing and Retrieval.
2. Distributed IR.
1. Integrated Solutions.

Application Issues
10
Efficiency Issues

1. Integrated Solutions
Complete solution requires integration with other
systems, such as DBMS, office management, etc.
2. Distributed IR
Search in a distributed environment that may
contain hundreds or even thousands of databases,
and merging the results.
3. Efficient, Flexible Indexing and Retrieval

11
Effectiveness issues

4. Vocabulary Expansion
Users information need are often described using
different words than are found in relevant
documents.
One of the major causes of failures in IR systems
is vocabulary mismatch.
5. Interfaces and Browsing
Interfaces must support a range of functions
including presentation of retrieved information.
The challenge is present this sophisticated
functionality in a conceptually simple way.
7. Effective Retrieval
10. Relevance Feedback

12
New IR Topics

Popular topics in search industry
Web search, mobile search, email spam filtering,
P2P search, Spatially-aware IR,
Popular topics in other communities
Multimedia search (Multimedia)
Text mining (Data Mining, NLP applications)
Spoken information retrieval (Speech)
Federated search (Digital Library)
Popular topics in SIGIR
Language modeling for IR, question answering,
cross-language IR, topic detection tracking,
search result clustering,

13
Trends of IR Research

Topics important in search industry may not as
popular as in SIGIR
To see if a topic has a standard test bed
The effectiveness issues are still challenging
Lesions learned from Web search can be referred

14
II. Lesions Learned from Web search
15
Traditional IR vs. Web Search

New information sources
Document -gt page, blog, Web image,
New media types
Text -gt image, video, speech, music, map,
New infrastructures
Plain text file -gt hypertext, P2P, semantic Web,
New applications
Crawler, email spam filter, MP3 search, mobile
search,
Major impacts were not all derived from
breakthrough of IR techniques

16
Facts of Web Search

Short query
English 2.35 words (Altavista, 1998)
Chinese 3.55 chars (H. T. Pu, 1999)
Precision-biased
Users normally browse the first result page
(Altavista, 1998)

17
Terms Per Query 1997-2001
Reference Amanda Spink Bernard J. Jansen
(2004). Web Search Public Searching of the Web.
Springer.
18
Queries Per User 1997-2001

19
Pages Viewed Per User 1997-2001
20
Ranking via Document Similarity is ineffective
for Short Query
Document
Query
A huge number of pages with matched query
terms on the Web
Query information retrieval
Similarity
21
Short Query Retrieval
Query Space
Doc Space
Document
Query
Similarity
Query information retrieval
22
Users Needs Document Authority
Query Space
Doc Space
Document
Query
Similarity
Query information retrieval
Concept IR book IR systems, SIGIR Web
sites
Authority
Representative IR book
23
(No Transcript)
24
DEF
25
Books
26
Tools
27
Hypotheses of Traditional IR

Query is long
TRECs average length of topic description is 15
terms
Evaluation considers both precision recall
Average recall-precision value of top 1000
retrieved results

28
Retrieval Models in Traditional IR

Retrieval models are most document-similarity-base
d
Queries and documents represented as n
dimensional vectors, e.g., vector-space model.
Rank the documents according to their degree of
similarity to the query
QA and language modeling approaches are most
document-similarity-based

29
Vector Similarity

Cosine measure or normalized correlation
coefficient
Euclidean Distance

30
Using LM in IR

Principle 1
Document D Language model P(wMD)
Query Q sequence of words q1,q2,,qn
(uni-grams)
Matching P(QMD)
Principle 2
Document D Language model P(wMD)
Query Q Language model P(wMQ)
Matching comparison between P(.MD) and P(.MQ)
Principle 3
Translate D to Q

31
New Retrieval Model Required
Query Space
Doc Space
User Space
Author Space
Document
Query
Similarity
Query information retrieval
Document Authority
Information Needs
32
Other Challenge Problems

Search result organizing
Web search results are a long list of snippets
users are hard to browse the results in a
conceptual manner.
Scalability
All used techniques should be scalable with the
increase of data size number of users
Academic/university labs are difficult on
establishing the same environment

33
III. Improving IR via Web Mining
34
Web Mining

Web mining is a study on how to discover
knowledge
from diverse data resources in the Web and
benefit
Web information systems

Mining
Web texts, images, logs
Search Engine
Behaviors of Millions of Users
35
Mining Techniques

Mining the web discovering knowledge from
hypertext data (S. Chakrabarti, 2003)
Web usage mining (R. Baeza-Yates)
Taxonomy of Web mining (R. Cooley)

36
Document Space Mining

Page rank
Focused crawler
Search result clustering

37
PageRank Document Authority

It is a measure of a web pages citation
importance that corresponds well with peoples
subjective idea of importance.
Given a document A, let C(A) links coming out
of A, let T1 .. Tn be the documents linking to A,
and let d be a constant. Then PageRank of A
PR(A)

38
Other Techniques Used in Document Space

Focused crawling for domain-specific information
CMU Cora (1998)
Search result clustering
Document clustering
Weak in comprehension
Term clustering
STC (Zamir, WWW99)
DisCover (Kummamuru, WWW04)
Salient phrases ranking (Zeng, SIGIR04)
Link-based clustering
Contents-Link (Wang, CIKM02)

39
Existing Clustering Engines

Problems exist, e.g., comprehension of clustered
results, clustering complexity.

40
User Space Mining

Log analysis relevant concepts
Query log, query session log, clicked stream log
Query clustering
Query taxonomy generation

41

Challenge Short Query
Document Space
SE
Users
Authors
Taiwan University
Document Taxonomy
Information Use
42

Hidden Information to Know Users
Document Space
SE
Users
Authors
Taiwan University
Document Taxonomy
. Query Sessions q1, q2, .
Information Use
43
(No Transcript)
44
(No Transcript)
45
Log-based vs. Document-based
Query-Session-Based
Document-Based
??? (Sina) yahoo yam ?? (Yahoo) tomail pchome ???
? (free email) ???? (Yahoo Chinese) ??
(Kimo) ???? (search engine)
??? (Sina) ??? (copy right) ????? (IP right) ??
(chat) ?? (news) ??? (chat room) ?? (personal
finance) ?? (banking) ?? ?? (mail box)
46

Hidden Information to Know Users
Document Space
SE
Users
Authors
Taiwan University
Document Taxonomy
. Query Sessions q1, q2, . . Clicked
Streams Q URL .
Information Use
47
Clustering of Queries and URLs

Beeferman and Berger (KDD2000)
Goal
Clustering queries and corresponding URLs.
500,000 click-through records from Lycos search
engine.
Results
Apply bipartite-graph-based iterative clustering
Many Web pagess are diverse in query terms
Huge log is necessary for clustering all queries

Queries
URLs
48
Problems of Beefermans Approach

Users clicks not always reliable
Limited to the top search results
Very sparse matrix for training (terms X urls)
High frequency terms too much noise
Low frequency terms hard to find relevant terms

49

Hidden Information to Know Users
Document Space
SE
Users
Authors
Taiwan University
Document Taxonomy
. Query Sessions q1, q2, . . Clicked
Streams Q URL QURLPage Content .
Information Use
50

Organizing Query Space
Document Space
Query Space
SE
Taiwan University
Query Taxonomy
Document Taxonomy
Users
Authors
Information Use
Information Need
51

Organizing Query Space
Document Space
Query Space
SE
Taiwan University
Query Taxonomy
Document Taxonomy
Users
Authors
Mapping
Information Use
Information Need
52
Query Taxonomy (S. L. Chuang L. F. Chien, ACM
TOIS 2005)

To know more about what users search
Understand users search interests, used
vocabularies, help users formulate queries,
complement document taxonomies
Query amount is large and hierarchical clustering
is must
Query classification, query clustering and query
taxonomy have different values and applications

53
Query Clustering Results
54
Query Taxonomy
HAC
55
Query Taxonomy
HACP
56
Concept-based Search via Query Taxonomy
57
Research at WDK, IIS NTU

LiveConcept
LiveImage
LiveTrans

58
(No Transcript)
59
(No Transcript)
60
(No Transcript)
61
(No Transcript)
62
(No Transcript)
63
LiveTrans
64
Discovered Knowledge
. Relevant concepts . Concept classes .
Relevant translations
Web logs, texts, images,
Search Engine
-- concept search, image search, cross-language
search,
Millions of Users
65
Search Result Clustering
66
(No Transcript)
67
(No Transcript)
68
(No Transcript)
69
Topic Class
Topic
70
(No Transcript)
71
Term Extraction

Term Extraction
PAT-tree-based (SIGIR95)
Relevant Terms Finding
Query session log mining (JASIST 2002)
Anchor text mining
Search result page mining

72
Term Classification Clustering

Term classification
Using seed query terms (DSS04)
Using Web corpora (WWW04, WI05)
Thematic Metadata Extraction (ACM TALIP04)
Term clustering (ICDM02)
Taxonomy generation (CIKM04, TOIS05)
Query taxonomy generation (OIR04)

73
Term Translation

Direct translation
Anchor text mining (COLING02, ACM TALIP02)
Search result page mining (ACL04, JASIST05)
Transitive translation (ICDM02, ACM TOIS04)
CLIR application (SIGIR05)

74
Summary

Trends of IR research
Effectiveness issues are still challenging
A lot of new topics appear
Lessons learned from Web search
Short query problem precision bias are tough
problems not well investigated in conventional
document-similarity-based approaches
Effective retrieval models should consider both
users information needs and document authority
Improving IR via Web mining
Mining techniques have been applied to analyzing
document authority
Query space mining has a great potential to be
investigated

Mining the Web for Improving Information Retrieval Technology PowerPoint PPT Presentation