Mining the Web for Improving Information Retrieval Technology PowerPoint PPT Presentation

presentation player overlay
1 / 75
About This Presentation
Transcript and Presenter's Notes

Title: Mining the Web for Improving Information Retrieval Technology


1
Mining the Web for Improving Information
Retrieval Technology
  • Lee-Feng Chien (???)
  • Academia Sinica

2
Web Search vs. IR
  • Web search has benefited a lot from IR
  • Fundamental indexing searching techniques were
    originated from IR research
  • Search engine is a popular term of Web IR systems
  • How Web search can continue to benefit from IR
    research?

3
Outline
  • Trends of IR research
  • Lessons learned from Web search
  • Improving IR via Web mining
  • Summary

4
I. Trends of IR Research
5
(No Transcript)
6
What do People Want from IR(Viewpoint of an IR
leading scientist in 1995)
  • 10. Relevance Feedback.
  • 9. Information Extraction.
  • 8. Multimedia Retrieval.
  • 7. Effective Retrieval.
  • 6. Routing and Filtering.
  • 5. Interfaces and Browsing.
  • 4. Vocabulary Expansion.
  • 3. Efficient, Flexible Indexing and Retrieval.
  • 2. Distributed IR.
  • 1. Integrated Solutions.

7
What do People Want from IR
  • 10. Relevance Feedback.
  • 9. Information Extraction.
  • 8. Multimedia Retrieval.
  • 7. Effective Retrieval.
  • 6. Routing and Filtering.
  • 5. Interfaces and Browsing.
  • 4. Vocabulary Expansion.
  • 3. Efficient, Flexible Indexing and Retrieval.
  • 2. Distributed IR.
  • 1. Integrated Solutions.

Efficiency Issues
8
What do People Want from IR
  • 10. Relevance Feedback.
  • 9. Information Extraction.
  • 8. Multimedia Retrieval.
  • 7. Effective Retrieval.
  • 6. Routing and Filtering.
  • 5. Interfaces and Browsing.
  • 4. Vocabulary Expansion.
  • 3. Efficient, Flexible Indexing and Retrieval.
  • 2. Distributed IR.
  • 1. Integrated Solutions.

Effectiveness Issues
9
What do People Want from IR
  • 10. Relevance Feedback.
  • 9. Information Extraction.
  • 8. Multimedia Retrieval.
  • 7. Effective Retrieval.
  • 6. Routing and Filtering.
  • 5. Interfaces and Browsing.
  • 4. Vocabulary Expansion.
  • 3. Efficient, Flexible Indexing and Retrieval.
  • 2. Distributed IR.
  • 1. Integrated Solutions.

Application Issues
10
Efficiency Issues
  • 1. Integrated Solutions
  • Complete solution requires integration with other
    systems, such as DBMS, office management, etc.
  • 2. Distributed IR
  • Search in a distributed environment that may
    contain hundreds or even thousands of databases,
    and merging the results.
  • 3. Efficient, Flexible Indexing and Retrieval

11
Effectiveness issues
  • 4. Vocabulary Expansion
  • Users information need are often described using
    different words than are found in relevant
    documents.
  • One of the major causes of failures in IR systems
    is vocabulary mismatch.
  • 5. Interfaces and Browsing
  • Interfaces must support a range of functions
    including presentation of retrieved information.
  • The challenge is present this sophisticated
    functionality in a conceptually simple way.
  • 7. Effective Retrieval
  • 10. Relevance Feedback

12
New IR Topics
  • Popular topics in search industry
  • Web search, mobile search, email spam filtering,
    P2P search, Spatially-aware IR,
  • Popular topics in other communities
  • Multimedia search (Multimedia)
  • Text mining (Data Mining, NLP applications)
  • Spoken information retrieval (Speech)
  • Federated search (Digital Library)
  • Popular topics in SIGIR
  • Language modeling for IR, question answering,
    cross-language IR, topic detection tracking,
    search result clustering,

13
Trends of IR Research
  • Topics important in search industry may not as
    popular as in SIGIR
  • To see if a topic has a standard test bed
  • The effectiveness issues are still challenging
  • Lesions learned from Web search can be referred

14
II. Lesions Learned from Web search
15
Traditional IR vs. Web Search
  • New information sources
  • Document -gt page, blog, Web image,
  • New media types
  • Text -gt image, video, speech, music, map,
  • New infrastructures
  • Plain text file -gt hypertext, P2P, semantic Web,
  • New applications
  • Crawler, email spam filter, MP3 search, mobile
    search,
  • Major impacts were not all derived from
    breakthrough of IR techniques

16
Facts of Web Search
  • Short query
  • English 2.35 words (Altavista, 1998)
  • Chinese 3.55 chars (H. T. Pu, 1999)
  • Precision-biased
  • Users normally browse the first result page
    (Altavista, 1998)

17
Terms Per Query 1997-2001
Reference Amanda Spink Bernard J. Jansen
(2004). Web Search Public Searching of the Web.
Springer.
18
Queries Per User 1997-2001

19
Pages Viewed Per User 1997-2001
20
Ranking via Document Similarity is ineffective
for Short Query
Document
Query
A huge number of pages with matched query
terms on the Web
Query information retrieval
Similarity
21
Short Query Retrieval
Query Space
Doc Space
Document
Query
Similarity
Query information retrieval
22
Users Needs Document Authority
Query Space
Doc Space
Document
Query
Similarity
Query information retrieval
Concept IR book IR systems, SIGIR Web
sites
Authority
Representative IR book
23
(No Transcript)
24
DEF
25
Books
26
Tools
27
Hypotheses of Traditional IR
  • Query is long
  • TRECs average length of topic description is 15
    terms
  • Evaluation considers both precision recall
  • Average recall-precision value of top 1000
    retrieved results

28
Retrieval Models in Traditional IR
  • Retrieval models are most document-similarity-base
    d
  • Queries and documents represented as n
    dimensional vectors, e.g., vector-space model.
  • Rank the documents according to their degree of
    similarity to the query
  • QA and language modeling approaches are most
    document-similarity-based

29
Vector Similarity
  • Cosine measure or normalized correlation
    coefficient
  • Euclidean Distance

30
Using LM in IR
  • Principle 1
  • Document D Language model P(wMD)
  • Query Q sequence of words q1,q2,,qn
    (uni-grams)
  • Matching P(QMD)
  • Principle 2
  • Document D Language model P(wMD)
  • Query Q Language model P(wMQ)
  • Matching comparison between P(.MD) and P(.MQ)
  • Principle 3
  • Translate D to Q

31
New Retrieval Model Required
Query Space
Doc Space
User Space
Author Space
Document
Query
Similarity
Query information retrieval
Document Authority
Information Needs
32
Other Challenge Problems
  • Search result organizing
  • Web search results are a long list of snippets
    users are hard to browse the results in a
    conceptual manner.
  • Scalability
  • All used techniques should be scalable with the
    increase of data size number of users
  • Academic/university labs are difficult on
    establishing the same environment

33
III. Improving IR via Web Mining
34
Web Mining
  • Web mining is a study on how to discover
    knowledge
  • from diverse data resources in the Web and
    benefit
  • Web information systems

Mining
Web texts, images, logs
Search Engine
Behaviors of Millions of Users
35
Mining Techniques
  • Mining the web discovering knowledge from
    hypertext data (S. Chakrabarti, 2003)
  • Web usage mining (R. Baeza-Yates)
  • Taxonomy of Web mining (R. Cooley)

36
Document Space Mining
  • Page rank
  • Focused crawler
  • Search result clustering

37
PageRank Document Authority
  • It is a measure of a web pages citation
    importance that corresponds well with peoples
    subjective idea of importance.
  • Given a document A, let C(A) links coming out
    of A, let T1 .. Tn be the documents linking to A,
    and let d be a constant. Then PageRank of A
    PR(A)

38
Other Techniques Used in Document Space
  • Focused crawling for domain-specific information
  • CMU Cora (1998)
  • Search result clustering
  • Document clustering
  • Weak in comprehension
  • Term clustering
  • STC (Zamir, WWW99)
  • DisCover (Kummamuru, WWW04)
  • Salient phrases ranking (Zeng, SIGIR04)
  • Link-based clustering
  • Contents-Link (Wang, CIKM02)

39
Existing Clustering Engines
  • Problems exist, e.g., comprehension of clustered
    results, clustering complexity.

40
User Space Mining
  • Log analysis relevant concepts
  • Query log, query session log, clicked stream log
  • Query clustering
  • Query taxonomy generation

41

Challenge Short Query
Document Space
SE
Users
Authors
Taiwan University
Document Taxonomy
Information Use
42

Hidden Information to Know Users
Document Space
SE
Users
Authors
Taiwan University
Document Taxonomy
. Query Sessions q1, q2, .
Information Use
43
(No Transcript)
44
(No Transcript)
45
Log-based vs. Document-based
Query-Session-Based
Document-Based
??? (Sina) yahoo yam ?? (Yahoo) tomail pchome ???
? (free email) ???? (Yahoo Chinese) ??
(Kimo) ???? (search engine)
??? (Sina) ??? (copy right) ????? (IP right) ??
(chat) ?? (news) ??? (chat room) ?? (personal
finance) ?? (banking) ?? ?? (mail box)
46

Hidden Information to Know Users
Document Space
SE
Users
Authors
Taiwan University
Document Taxonomy
. Query Sessions q1, q2, . . Clicked
Streams Q URL .
Information Use
47
Clustering of Queries and URLs
  • Beeferman and Berger (KDD2000)
  • Goal
  • Clustering queries and corresponding URLs.
  • 500,000 click-through records from Lycos search
    engine.
  • Results
  • Apply bipartite-graph-based iterative clustering
  • Many Web pagess are diverse in query terms
  • Huge log is necessary for clustering all queries

Queries
URLs
48
Problems of Beefermans Approach
  • Users clicks not always reliable
  • Limited to the top search results
  • Very sparse matrix for training (terms X urls)
  • High frequency terms too much noise
  • Low frequency terms hard to find relevant terms

49

Hidden Information to Know Users
Document Space
SE
Users
Authors
Taiwan University
Document Taxonomy
. Query Sessions q1, q2, . . Clicked
Streams Q URL QURLPage Content .
Information Use
50

Organizing Query Space
Document Space
Query Space
SE
Taiwan University
Query Taxonomy
Document Taxonomy
Users
Authors
Information Use
Information Need
51

Organizing Query Space
Document Space
Query Space
SE
Taiwan University
Query Taxonomy
Document Taxonomy
Users
Authors
Mapping
Information Use
Information Need
52
Query Taxonomy (S. L. Chuang L. F. Chien, ACM
TOIS 2005)
  • To know more about what users search
  • Understand users search interests, used
    vocabularies, help users formulate queries,
    complement document taxonomies
  • Query amount is large and hierarchical clustering
    is must
  • Query classification, query clustering and query
    taxonomy have different values and applications

53
Query Clustering Results
54
Query Taxonomy
HAC
55
Query Taxonomy
HACP
56
Concept-based Search via Query Taxonomy
57
Research at WDK, IIS NTU
  • LiveConcept
  • LiveImage
  • LiveTrans

58
(No Transcript)
59
(No Transcript)
60
(No Transcript)
61
(No Transcript)
62
(No Transcript)
63
LiveTrans
64
Discovered Knowledge
. Relevant concepts . Concept classes .
Relevant translations
Web logs, texts, images,
Search Engine
-- concept search, image search, cross-language
search,
Millions of Users
65
Search Result Clustering
66
(No Transcript)
67
(No Transcript)
68
(No Transcript)
69
Topic Class
Topic
70
(No Transcript)
71
Term Extraction
  • Term Extraction
  • PAT-tree-based (SIGIR95)
  • Relevant Terms Finding
  • Query session log mining (JASIST 2002)
  • Anchor text mining
  • Search result page mining

72
Term Classification Clustering
  • Term classification
  • Using seed query terms (DSS04)
  • Using Web corpora (WWW04, WI05)
  • Thematic Metadata Extraction (ACM TALIP04)
  • Term clustering (ICDM02)
  • Taxonomy generation (CIKM04, TOIS05)
  • Query taxonomy generation (OIR04)

73
Term Translation
  • Direct translation
  • Anchor text mining (COLING02, ACM TALIP02)
  • Search result page mining (ACL04, JASIST05)
  • Transitive translation (ICDM02, ACM TOIS04)
  • CLIR application (SIGIR05)

74
Summary
  • Trends of IR research
  • Effectiveness issues are still challenging
  • A lot of new topics appear
  • Lessons learned from Web search
  • Short query problem precision bias are tough
    problems not well investigated in conventional
    document-similarity-based approaches
  • Effective retrieval models should consider both
    users information needs and document authority
  • Improving IR via Web mining
  • Mining techniques have been applied to analyzing
    document authority
  • Query space mining has a great potential to be
    investigated

75
QA
  • Thank You!
Write a Comment
User Comments (0)
About PowerShow.com