Title: Mining the Web for Improving Information Retrieval Technology
1Mining the Web for Improving Information
Retrieval Technology
- Lee-Feng Chien (???)
- Academia Sinica
2Web Search vs. IR
- Web search has benefited a lot from IR
- Fundamental indexing searching techniques were
originated from IR research - Search engine is a popular term of Web IR systems
- How Web search can continue to benefit from IR
research?
3Outline
- Trends of IR research
- Lessons learned from Web search
- Improving IR via Web mining
- Summary
4I. Trends of IR Research
5(No Transcript)
6What do People Want from IR(Viewpoint of an IR
leading scientist in 1995)
- 10. Relevance Feedback.
- 9. Information Extraction.
- 8. Multimedia Retrieval.
- 7. Effective Retrieval.
- 6. Routing and Filtering.
- 5. Interfaces and Browsing.
- 4. Vocabulary Expansion.
- 3. Efficient, Flexible Indexing and Retrieval.
- 2. Distributed IR.
- 1. Integrated Solutions.
7What do People Want from IR
- 10. Relevance Feedback.
- 9. Information Extraction.
- 8. Multimedia Retrieval.
- 7. Effective Retrieval.
- 6. Routing and Filtering.
- 5. Interfaces and Browsing.
- 4. Vocabulary Expansion.
- 3. Efficient, Flexible Indexing and Retrieval.
- 2. Distributed IR.
- 1. Integrated Solutions.
Efficiency Issues
8What do People Want from IR
- 10. Relevance Feedback.
- 9. Information Extraction.
- 8. Multimedia Retrieval.
- 7. Effective Retrieval.
- 6. Routing and Filtering.
- 5. Interfaces and Browsing.
- 4. Vocabulary Expansion.
- 3. Efficient, Flexible Indexing and Retrieval.
- 2. Distributed IR.
- 1. Integrated Solutions.
Effectiveness Issues
9What do People Want from IR
- 10. Relevance Feedback.
- 9. Information Extraction.
- 8. Multimedia Retrieval.
- 7. Effective Retrieval.
- 6. Routing and Filtering.
- 5. Interfaces and Browsing.
- 4. Vocabulary Expansion.
- 3. Efficient, Flexible Indexing and Retrieval.
- 2. Distributed IR.
- 1. Integrated Solutions.
Application Issues
10Efficiency Issues
- 1. Integrated Solutions
- Complete solution requires integration with other
systems, such as DBMS, office management, etc. - 2. Distributed IR
- Search in a distributed environment that may
contain hundreds or even thousands of databases,
and merging the results. - 3. Efficient, Flexible Indexing and Retrieval
11Effectiveness issues
- 4. Vocabulary Expansion
- Users information need are often described using
different words than are found in relevant
documents. - One of the major causes of failures in IR systems
is vocabulary mismatch. - 5. Interfaces and Browsing
- Interfaces must support a range of functions
including presentation of retrieved information. - The challenge is present this sophisticated
functionality in a conceptually simple way. - 7. Effective Retrieval
- 10. Relevance Feedback
12New IR Topics
- Popular topics in search industry
- Web search, mobile search, email spam filtering,
P2P search, Spatially-aware IR, - Popular topics in other communities
- Multimedia search (Multimedia)
- Text mining (Data Mining, NLP applications)
- Spoken information retrieval (Speech)
- Federated search (Digital Library)
- Popular topics in SIGIR
- Language modeling for IR, question answering,
cross-language IR, topic detection tracking,
search result clustering,
13Trends of IR Research
- Topics important in search industry may not as
popular as in SIGIR - To see if a topic has a standard test bed
- The effectiveness issues are still challenging
- Lesions learned from Web search can be referred
14II. Lesions Learned from Web search
15Traditional IR vs. Web Search
- New information sources
- Document -gt page, blog, Web image,
- New media types
- Text -gt image, video, speech, music, map,
- New infrastructures
- Plain text file -gt hypertext, P2P, semantic Web,
- New applications
- Crawler, email spam filter, MP3 search, mobile
search, - Major impacts were not all derived from
breakthrough of IR techniques
16Facts of Web Search
- Short query
- English 2.35 words (Altavista, 1998)
- Chinese 3.55 chars (H. T. Pu, 1999)
- Precision-biased
- Users normally browse the first result page
(Altavista, 1998)
17Terms Per Query 1997-2001
Reference Amanda Spink Bernard J. Jansen
(2004). Web Search Public Searching of the Web.
Springer.
18Queries Per User 1997-2001
19Pages Viewed Per User 1997-2001
20Ranking via Document Similarity is ineffective
for Short Query
Document
Query
A huge number of pages with matched query
terms on the Web
Query information retrieval
Similarity
21Short Query Retrieval
Query Space
Doc Space
Document
Query
Similarity
Query information retrieval
22Users Needs Document Authority
Query Space
Doc Space
Document
Query
Similarity
Query information retrieval
Concept IR book IR systems, SIGIR Web
sites
Authority
Representative IR book
23(No Transcript)
24DEF
25Books
26Tools
27Hypotheses of Traditional IR
- Query is long
- TRECs average length of topic description is 15
terms - Evaluation considers both precision recall
- Average recall-precision value of top 1000
retrieved results
28Retrieval Models in Traditional IR
- Retrieval models are most document-similarity-base
d - Queries and documents represented as n
dimensional vectors, e.g., vector-space model. - Rank the documents according to their degree of
similarity to the query - QA and language modeling approaches are most
document-similarity-based
29Vector Similarity
- Cosine measure or normalized correlation
coefficient - Euclidean Distance
30Using LM in IR
- Principle 1
- Document D Language model P(wMD)
- Query Q sequence of words q1,q2,,qn
(uni-grams) - Matching P(QMD)
- Principle 2
- Document D Language model P(wMD)
- Query Q Language model P(wMQ)
- Matching comparison between P(.MD) and P(.MQ)
- Principle 3
- Translate D to Q
31New Retrieval Model Required
Query Space
Doc Space
User Space
Author Space
Document
Query
Similarity
Query information retrieval
Document Authority
Information Needs
32Other Challenge Problems
- Search result organizing
- Web search results are a long list of snippets
users are hard to browse the results in a
conceptual manner. - Scalability
- All used techniques should be scalable with the
increase of data size number of users - Academic/university labs are difficult on
establishing the same environment
33III. Improving IR via Web Mining
34Web Mining
- Web mining is a study on how to discover
knowledge - from diverse data resources in the Web and
benefit - Web information systems
Mining
Web texts, images, logs
Search Engine
Behaviors of Millions of Users
35Mining Techniques
- Mining the web discovering knowledge from
hypertext data (S. Chakrabarti, 2003) - Web usage mining (R. Baeza-Yates)
- Taxonomy of Web mining (R. Cooley)
36Document Space Mining
- Page rank
- Focused crawler
- Search result clustering
37PageRank Document Authority
- It is a measure of a web pages citation
importance that corresponds well with peoples
subjective idea of importance. - Given a document A, let C(A) links coming out
of A, let T1 .. Tn be the documents linking to A,
and let d be a constant. Then PageRank of A
PR(A)
38Other Techniques Used in Document Space
- Focused crawling for domain-specific information
- CMU Cora (1998)
- Search result clustering
- Document clustering
- Weak in comprehension
- Term clustering
- STC (Zamir, WWW99)
- DisCover (Kummamuru, WWW04)
- Salient phrases ranking (Zeng, SIGIR04)
- Link-based clustering
- Contents-Link (Wang, CIKM02)
39Existing Clustering Engines
- Problems exist, e.g., comprehension of clustered
results, clustering complexity.
40User Space Mining
- Log analysis relevant concepts
- Query log, query session log, clicked stream log
- Query clustering
- Query taxonomy generation
41 Challenge Short Query
Document Space
SE
Users
Authors
Taiwan University
Document Taxonomy
Information Use
42 Hidden Information to Know Users
Document Space
SE
Users
Authors
Taiwan University
Document Taxonomy
. Query Sessions q1, q2, .
Information Use
43(No Transcript)
44(No Transcript)
45Log-based vs. Document-based
Query-Session-Based
Document-Based
??? (Sina) yahoo yam ?? (Yahoo) tomail pchome ???
? (free email) ???? (Yahoo Chinese) ??
(Kimo) ???? (search engine)
??? (Sina) ??? (copy right) ????? (IP right) ??
(chat) ?? (news) ??? (chat room) ?? (personal
finance) ?? (banking) ?? ?? (mail box)
46 Hidden Information to Know Users
Document Space
SE
Users
Authors
Taiwan University
Document Taxonomy
. Query Sessions q1, q2, . . Clicked
Streams Q URL .
Information Use
47Clustering of Queries and URLs
- Beeferman and Berger (KDD2000)
- Goal
- Clustering queries and corresponding URLs.
- 500,000 click-through records from Lycos search
engine. - Results
- Apply bipartite-graph-based iterative clustering
- Many Web pagess are diverse in query terms
- Huge log is necessary for clustering all queries
Queries
URLs
48Problems of Beefermans Approach
- Users clicks not always reliable
- Limited to the top search results
- Very sparse matrix for training (terms X urls)
- High frequency terms too much noise
- Low frequency terms hard to find relevant terms
49 Hidden Information to Know Users
Document Space
SE
Users
Authors
Taiwan University
Document Taxonomy
. Query Sessions q1, q2, . . Clicked
Streams Q URL QURLPage Content .
Information Use
50 Organizing Query Space
Document Space
Query Space
SE
Taiwan University
Query Taxonomy
Document Taxonomy
Users
Authors
Information Use
Information Need
51 Organizing Query Space
Document Space
Query Space
SE
Taiwan University
Query Taxonomy
Document Taxonomy
Users
Authors
Mapping
Information Use
Information Need
52Query Taxonomy (S. L. Chuang L. F. Chien, ACM
TOIS 2005)
- To know more about what users search
- Understand users search interests, used
vocabularies, help users formulate queries,
complement document taxonomies - Query amount is large and hierarchical clustering
is must - Query classification, query clustering and query
taxonomy have different values and applications
53Query Clustering Results
54Query Taxonomy
HAC
55Query Taxonomy
HACP
56Concept-based Search via Query Taxonomy
57Research at WDK, IIS NTU
- LiveConcept
- LiveImage
- LiveTrans
58(No Transcript)
59(No Transcript)
60(No Transcript)
61(No Transcript)
62(No Transcript)
63LiveTrans
64Discovered Knowledge
. Relevant concepts . Concept classes .
Relevant translations
Web logs, texts, images,
Search Engine
-- concept search, image search, cross-language
search,
Millions of Users
65Search Result Clustering
66(No Transcript)
67(No Transcript)
68(No Transcript)
69Topic Class
Topic
70(No Transcript)
71Term Extraction
- Term Extraction
- PAT-tree-based (SIGIR95)
- Relevant Terms Finding
- Query session log mining (JASIST 2002)
- Anchor text mining
- Search result page mining
72Term Classification Clustering
- Term classification
- Using seed query terms (DSS04)
- Using Web corpora (WWW04, WI05)
- Thematic Metadata Extraction (ACM TALIP04)
- Term clustering (ICDM02)
- Taxonomy generation (CIKM04, TOIS05)
- Query taxonomy generation (OIR04)
73Term Translation
- Direct translation
- Anchor text mining (COLING02, ACM TALIP02)
- Search result page mining (ACL04, JASIST05)
- Transitive translation (ICDM02, ACM TOIS04)
- CLIR application (SIGIR05)
74Summary
- Trends of IR research
- Effectiveness issues are still challenging
- A lot of new topics appear
- Lessons learned from Web search
- Short query problem precision bias are tough
problems not well investigated in conventional
document-similarity-based approaches - Effective retrieval models should consider both
users information needs and document authority - Improving IR via Web mining
- Mining techniques have been applied to analyzing
document authority - Query space mining has a great potential to be
investigated
75QA