Semantic, Hierarchical, Online Clustering of Web Search Results - PowerPoint PPT Presentation

1 / 17

About This Presentation

Title:

Semantic, Hierarchical, Online Clustering of Web Search Results

Description:

Title: Clustering Web Search Results Author: Iwona Bialynicka-Birula Last modified by: AILAB Created Date: 4/5/2004 9:34:13 AM Document presentation format – PowerPoint PPT presentation

Number of Views:152

Avg rating:3.0/5.0

Slides: 18

Provided by: Iwona1

Category:

more less

Transcript and Presenter's Notes

Title: Semantic, Hierarchical, Online Clustering of Web Search Results

1
Semantic, Hierarchical, Online Clustering of Web
Search Results

Yisheng Dong

2
Overview

Introduction
Previous Related Works
SHOC Approach
Prototype System
Conclusion

3
Introduction

Motivation
The Web is the biggest data source.
Search engine is the most commonly used tool for
Web information retrieval.
Its current status is far from the satisfaction.
Solution
Clustering of Web search results would help a
lot.
SHOC can generate both reasonable and readable
cluster.

4
Basic requirements (clustering approach for web
search result)

Semantic
Each cluster should correspond to a concept.
Avoid confining each Web page to only on cluster.
A label can describe the topic of cluster well.
Hierarchical
Eye-browsing tree structure.
Taking advantage of the relationship between
them.
Online
Provide fresh clustering result just-in-time.

5
Previous Related Work

Scatter/Gather system
traditional heuristic clustering algorithm.
It has some limitations.
Based on hyperlink
It needs to download and parse original Web page.
Cannot cluster immediately.
STC
It is not appropriate for Oriental language.
Extract many meaningless partial phrases.
Synonymy and polysemy are not considered.

6
SOHC step

Data acquisition
Data cleaning
Feature extraction
Identifying base clusters
Combining base clusters

7
Data acquision

The data acquisition task here is actually
meta-search.
Use 2-level parallelization mechanism
Call several engines simultaneously.
Fetch all of its search result simultaneously.

8
Data cleaning

Sentence boundaries are identified via the
following.
punctuation marks (e.g. ., ,, , ?, etc.)
HTML tags (e.g.ltpgt, ltbrgt, ltligt, lttdgt etc.)
Non-word tokens are stripped.(e.g. punctuation
marks and HTML tags)
Redundant spaces are compressed.
Stemming algorithm may be applied.(for English
text)

9
Feature extraction (Overview)

Words
Most clustering algorithm treat a document as
bag of words.
Ignoring word order and proximity.
Key phrases
Advantage
Improve the quality of the clusters.
Useful in constructing labels.
Data structures (key phrase discovery)
Suffix tree
Related to the alphabet size of language.
Suffix array
Scalable over alphabet size.

10
Feature extraction(key phrase discovery)

Completeness
Left-completeness
Right-completeness
Stability (Mutual Information)
S c1c2cp, SL c1cp-1, SR c2cp
Significance
se(S) freq(S) g(S)
g(x) 0 (x1) log2x (2x8)
3 (xgt8)

11
Feature extraction (Suffix array)

Suffix array
An array of all N suffixes, sorted alphabetically
LCP (Longest Common Prefix)
Use to accelerate searching in text

ltSuffix array and lcp of the to_be_or_not_to_begt
12
Feature extraction (Discover rcs)

void discover_rcs()
typedef structure
int ID
int frequency
RCSTYPE
RSCTYPE rcs_stackN // N is the
document's length
Initialize rcs_stack
int sp -1 // the stack pointer
int i 1
while(i lt N1)
if(sp lt 0) // the stack is empty
if(lcpi gt 0)
sp
rcs_stacksp.ID i
rcs_stacksp.frequency 2
i

int r rcs_stacksp.ID
if(lcpr lt lcpi)
sp
rcs_stacksp.ID i
rcs_stacksp.frequency 2
i
else if(lcpr lcpi)
rcs_stacksp.frequecny
i
else
Output rcs_stacksp // ID
frequency
int f rcs_stacksp.frequency
sp--
if(sp gt 0)
rcs_stacksp.frequency
rcs_stacksp.frequency
f -1

13
Feature extraction (Intersect lcs_rcs)

void intersect_lcs_rcs(sorted lcs array, sorted
rcs array)
int i 0, j0
while(iltL j lt R)
string str_l lcsi.ID denoted LCS
string str_r rcsj.ID denoted RCS
if(str_l str_r)
Output lcsi
i
j
if(str_l lt str_r)
i
if(str_l gt str_r)
j

rcs array rcs array rcs array
ID frequency RCS
1 2 _be
2 5 _
6 2 be
8 2 e
11 2 o_be
12 4 o
16 3 to_be
17 2 t
cs array cs array cs array
ID frequency CS
2 5 _
12 4 o
16 3 t
17 2 to_be
14
Identifying base clusters
15
Combining base clusters

Combine base cluster X and Y
if ( X n Y / X ? Y gt t1 )
X and Y are merged into
one cluster
else
if ( X gt Y )
if ( X n Y / Y gt t2 )
let Y become Xs child
else
if ( X n Y / X gt t2 )
let X become Ys child

Merging Label
if ( label x is a substring of label y )
label_xy label_y
else if ( label_y is a substring of label_x )
label_xy label_x
else
label_xy label_x label_y

16
Prototype system

Crate a prototype system named WICE (Web
Information Clustering Engine)
Doing well for dealing with the special problems
related to Chinese
Output for query object oriented
object oriented programming
object oriented analysis, etc.

17
Conclusion

Main contribution
The benefit of using key phrase.
Method based on suffix array for key phrase.
The concept of orthogonal clustering.
The WICE system is designed and implemented.
Further works
Detailed analysis.
Further experimenting.
Interpretation of experiment results.
Comparing with other clustering algorithms.

Write a Comment

User Comments (0)

About PowerShow.com

Recommended Relevance Latest Highest Rated Most Viewed

Sort by:

Related More from user

CrystalGraphics Presentations

World's Best PowerPoint Templates PowerPoint PPT Presentation

World's Best PowerPoint Templates - CrystalGraphics offers more PowerPoint templates than anyone else in the world, with over 4 million to choose from. Winner of the Standing Ovation Award for “Best PowerPoint Templates” from Presentations Magazine. They'll give your presentations a professional, memorable appearance - the kind of sophisticated look that today's audiences expect. Boasting an impressive range of designs, they will support your presentations with inspiring background photos or videos that support your themes, set the right mood, enhance your credibility and inspire your audiences.

CrystalGraphics 3D Character Slides for PowerPoint PowerPoint PPT Presentation

CrystalGraphics 3D Character Slides for PowerPoint - CrystalGraphics 3D Character Slides for PowerPoint

Chart and Diagram Slides for PowerPoint PowerPoint PPT Presentation

Chart and Diagram Slides for PowerPoint - Beautifully designed chart and diagram s for PowerPoint with visually stunning graphics and animation effects. Our new CrystalGraphics Chart and Diagram Slides for PowerPoint is a collection of over 1000 impressively designed data-driven chart and editable diagram s guaranteed to impress any audience. They are all artistically enhanced with visually stunning color, shadow and lighting effects. Many of them are also animated. And they’re ready for you to use in your PowerPoint presentations the moment you need them. – PowerPoint PPT presentation

Related Presentations

Semantic web role and its method: Domain ontology PowerPoint PPT Presentation

Semantic web role and its method: Domain ontology - 1. Department of Information Management Chaoyang University of Technology ... In order to resolve the antinomy of stability and plasticity, the ART network ... | PowerPoint PPT presentation | free to view

Text Information Retrieval and Applications PowerPoint PPT Presentation

Text Information Retrieval and Applications - Multimedia IR (image, speech, music, video) Semantic retrieval (XML, Semantic Web) ... Web pages in the world: 19.2 billion pages (indexed by Yahoo as of August 2005) ... | PowerPoint PPT presentation | free to view

Knowledge Discovery over the Deep Web, Semantic Web and XML PowerPoint PPT Presentation

Knowledge Discovery over the Deep Web, Semantic Web and XML - Knowledge can be interlinked. A knowledge base on one server. can refer to concepts from another knowledge base on another server. 'resource' (= 'entity' ... | PowerPoint PPT presentation | free to view

Improving Web Page Clustering with Global Document Analysis PowerPoint PPT Presentation

Improving Web Page Clustering with Global Document Analysis - QDC: a new web page clustering algorithm. Evaluation showing QDC is significantly ... Snippets and Full Page Text. 14. Evaluation: Quality and Coverage. 15 ... | PowerPoint PPT presentation | free to view

ece 627 intelligent web: ontology and beyond PowerPoint PPT Presentation

ece 627 intelligent web: ontology and beyond - lecture 18: tagging and folksonomy | PowerPoint PPT presentation | free to view

Prof. Dr. Bettina Berendt PowerPoint PPT Presentation

Prof. Dr. Bettina Berendt - Web Usage Mining Modelling: frequent-pattern mining I (sequence mining with WUM), classification and clustering) Prof. Dr. Bettina Berendt Humboldt Univ. Berlin, Germany | PowerPoint PPT presentation | free to view

Chapter 6 Applications PowerPoint PPT Presentation

Chapter 6 Applications - They help to overcome some of the problems of free-text search ... enable a search for people with specific skills. expose skill gaps and competency levels ... | PowerPoint PPT presentation | free to view

Some studies on Vietnamese multi-document summarization and semantic relation extraction PowerPoint PPT Presentation

Some studies on Vietnamese multi-document summarization and semantic relation extraction - Some studies on Vietnamese multi-document summarization and semantic relation extraction Laboratory of Data Mining & Knowledge Science * * Laboratory of Data Mining ... | PowerPoint PPT presentation | free to view

Dr Mike Lowndes, PowerPoint PPT Presentation

Dr Mike Lowndes, - Presented at Museums and the Web 2006, Albuquerque ... Magpie IE plugin (Open University) So nothing practical even yet? (Semagix can you afford it? ... | PowerPoint PPT presentation | free to view

From Search Engines to Wed Mining Web Search Engines, Spiders, Portals, Web APIs, and Web Mining: Fr PowerPoint PPT Presentation

From Search Engines to Wed Mining Web Search Engines, Spiders, Portals, Web APIs, and Web Mining: Fr - From Search Engines to Wed Mining Web Search Engines, Spiders, Portals, Web APIs, and Web Mining: Fr | PowerPoint PPT presentation | free to view

Data Mining: Concepts and Techniques Mining Text Data PowerPoint PPT Presentation

Data Mining: Concepts and Techniques Mining Text Data - Playground(p1). Chasing(d1,b1,p1). Semantic analysis. Lexical. analysis (part ... articles, research papers, books, digital libraries, e-mail messages, and Web ... | PowerPoint PPT presentation | free to view

Integrating Online and Geospatial Information Sources PowerPoint PPT Presentation

Integrating Online and Geospatial Information Sources - Craig Knoblock. University of Southern California. 1 ... Craig Knoblock. University of Southern California. 7. Why isn't this just. No common schema ... | PowerPoint PPT presentation | free to view

Designing Information Architecture for Search PowerPoint PPT Presentation

Designing Information Architecture for Search - Designing Information Architecture for Search Tutorial: SIGIR 2001 Marti Hearst University of California, Berkeley www.sims.berkeley.edu/~hearst NSF CAREER Grant ... | PowerPoint PPT presentation | free to view

LargeScale Deep Web Integration: Exploring and Querying Structured Data on the Deep Web PowerPoint PPT Presentation

LargeScale Deep Web Integration: Exploring and Querying Structured Data on the Deep Web - Cars.com. Amazon.com. Apartments.com. Biography.com. 401carfinder.com ... Concept: A form as a Bayesian network. Training: Estimate the Bayesian probabilities. ... | PowerPoint PPT presentation | free to view

AVENUE/LETRAS: Learning-based MT Approaches for Languages with Limited Resources PowerPoint PPT Presentation

AVENUE/LETRAS: Learning-based MT Approaches for Languages with Limited Resources - How do we read all the stuff that they put online? MT for these languages would Enable: ... Bilingual Dictionary with Examples. 1,926 entries. Spelling ... | PowerPoint PPT presentation | free to view

Grid Computing: an introduction PowerPoint PPT Presentation

Grid Computing: an introduction - Weather Forecast and Climate. Simulation of VLSI systems. Parallel Search in Databases ... Fibre channel, Gigabit Ethernet, Web services, XML: 1995-2000 ... | PowerPoint PPT presentation | free to view

Communities in Social Media An eyepiece into User Intentions and Context Akshay Java eBiquity Resear PowerPoint PPT Presentation

Communities in Social Media An eyepiece into User Intentions and Context Akshay Java eBiquity Resear - Twitter Network. Facebook Network. What is a Community. Existing Approaches. Clustering Approach ... is our collective wisdom. Twitter. is our collective ... | PowerPoint PPT presentation | free to view

Tools for Design of Composite Web Services Presented Version June 17, 2004 http:www'cs'ucsb'edusutut PowerPoint PPT Presentation

Tools for Design of Composite Web Services Presented Version June 17, 2004 http:www'cs'ucsb'edusutut - Models of Web Services and Composition. Approaches to ... Code to execute when 'undoing' an action. What the process actually does process ... partners ... | PowerPoint PPT presentation | free to view

Smart mining of drug discovery information using web service workflows PowerPoint PPT Presentation

Smart mining of drug discovery information using web service workflows - Indiana University School of. About Me. B.Sc. in Computing Science ... Indiana University School of. My definition of Chemical Informatics ... | PowerPoint PPT presentation | free to view

Kein Folientitel PowerPoint PPT Presentation

Kein Folientitel - to a linear order and to visual variables. More. constraints. on search. 26 ... Search criterion textual property. Communication Visual data mining. Step 5 Example ... | PowerPoint PPT presentation | free to view

CS490D: Introduction to Data Mining Prof. Chris Clifton PowerPoint PPT Presentation

CS490D: Introduction to Data Mining Prof. Chris Clifton - 'Text Mining' Information Retrieval Tools ' ... May use data mining technology (clustering, association) ... Technology Watch (patent office) ... | PowerPoint PPT presentation | free to view

Schema Mapping of Formbased Web Interfaces PowerPoint PPT Presentation

Schema Mapping of Formbased Web Interfaces - ... where Sij= weighted similarity between the ... Kernel gram matrix formed as weighted sum of individual similarity measures. ... How to Assign Feature Weights? ... | PowerPoint PPT presentation | free to view

Web Mining : A Bird PowerPoint PPT Presentation

Web Mining : A Bird - Web Mining : A Bird s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu | PowerPoint PPT presentation | free to view

Business Intelligence and Knowledge Management PowerPoint PPT Presentation

Business Intelligence and Knowledge Management - Provide the right information, to the right user, at the right time for ... Digital: screen, digital dashboard, Web service, email alert. Reporting System - III ... | PowerPoint PPT presentation | free to view

Knowledge Base Grid Power the Internet By Intelligence PowerPoint PPT Presentation

Knowledge Base Grid Power the Internet By Intelligence - Correlative Semantic Browsing Service. Subsumption and Classification Inference Service ... request to other correlative KBs and return all derived facts to ... | PowerPoint PPT presentation | free to view

ADVANCES IN GEOREFERENCED DIGITAL LIBRARIES PowerPoint PPT Presentation

ADVANCES IN GEOREFERENCED DIGITAL LIBRARIES - Search Engines. Cataloging Metadata Creation. Where is ...? What's there? What happened there? ... collection referencing & registration. client interface ... | PowerPoint PPT presentation | free to view

Web Mining : A Birds Eye View PowerPoint PPT Presentation

Web Mining : A Birds Eye View - mining techniques to discover interesting usage patterns from the secondary data ... Web Usage Mining ... Customized Usage Tracking. Adaptive Sites (Perkowitz ... | PowerPoint PPT presentation | free to view