Using Memex to archive and mine community Web browsing experience - PowerPoint PPT Presentation

About This Presentation
Title:

Using Memex to archive and mine community Web browsing experience

Description:

Using Memex to archive and mine community Web browsing experience. Soumen Chakrabarti ... bookmarking. events logged. WWW9. 8. Folder tab. File-manager like interface ... – PowerPoint PPT presentation

Number of Views:126
Avg rating:3.0/5.0
Slides: 20
Provided by: soumencha
Category:

less

Transcript and Presenter's Notes

Title: Using Memex to archive and mine community Web browsing experience


1
Using Memex to archive and mine community Web
browsing experience
  • Soumen ChakrabartiSandeep SrivastavaMallela
    SubramanyamMitul Tiwari
  • Indian Institute of Technology Bombay

2
Information sources on the Web
  • Web page contents
  • Early keyword search engines
  • Hyperlink structure
  • Later engines Google, Raging Search
  • Searching behavior
  • Search site monitor clicks on search results
  • Browsing behavior
  • Easily captured in stand-alone hypermedia
  • Need software infrastructure for the Web

3
Personal Memex
  • Archiving is feasible
  • 25 GB in a lifetime
  • Why archive?
  • Recall past events
  • Create a profile
  • Correlate with sites, directories, searches
  • Challenges
  • Flexible architecture
  • Analyses techniques

Your husband died,but here is his Memex
(From Jim Grays Turing Award Lecture)
4
Searching the personal Memex
  • Keyword search (never lose a page)
  • Advanced queries
  • Recreate my recent surfing history w.r.t. the
    topic bicycling
  • Extract from the MIT Web site all pages that
    match my compiler research profile
  • Topic taxonomy plays a central role
  • Characterized by bookmark folders
  • More familiar than universal directories

5
Archiving architecture choices
  • Bookmarks only or all click history
  • Installed application or plug-in
  • Closer integration, e.g. with COM
  • CGI and Javascript
  • Slow, hard to monitor all clicks
  • Applet-servlet
  • Portable, better UI compared to HTML
  • Proxy or wiretap
  • Proxy involves configuring browser

6
Memex block diagram
Browser
Memex server
Visit
Client JAR
Taxonomy synthesis
Resource discovery
Search
Attach
Recommendation
Folder
Download
Context
Classification
Mining demons
Running client applet
Event-handler servlets
Archive
Clustering
Relational metadata
Text index
Topic models
Memex client-server protocol and workload sharing
negotiations
7
Document workflow
Page visit and bookmarking events logged
NODE table
Browser
Memex client
Push new version
Per-document version queue
Crawler
Pop and discard old version
Demon Registry
Search indexer
Classifier service
Clustering service
Garbage collector
8
Folder tab
  • File-manager like interface
  • Valuable user input and feedback on topics and
    example documents

9
Context tab
Replay of recent browsing context restricted
to chosen topic
Choice of topic context
Active browser monitoring and dynamic layout of
new incremental context graph
Better mobility than one- dimensional history
provided by popular browsers
10
Search tab
Search using, keywords and visit statistics
Search using, keywords and visit statistics
  • Find the paper about collaborative filtering I
    was reading a month back

11
Mining issues
  • Two relations
  • occurs_in(term, document)
  • bookmarked_into(document, folder)
  • (Ignore hyperlinks for now)
  • Document classification and clustering
  • Exploit bookmarked_into
  • Taxonomy synthesis
  • Reconcile folders from a community of users into
    coherent themes

12
Taxonomy synthesis motivation
  • Autonomy vs collaboration
  • Personalization?picking folders from Yahoo
  • Complex relations between users interests
  • Need the simplest common ground

User2
User1
User3
Yahoo
Cycling
Sports
Biz
Sports
Sports
Shops
Hiking
Cycling
Bikeshops
Bikeshops
Subsumption
Tree inversion
13
Taxonomy synthesis intuition
kpfa.org
Media
bbc.co.uk
kron.com
Broadcasting
channel4.com
kcbs.com
Entertainment
foxmovies.com
lucasfilms.com
Studios
miramax.com
Folders
Documents
14
Taxonomy synthesis intuition
kpfa.org
Media
Themes
bbc.co.uk
Radio
kron.com
Broadcasting
channel4.com
TV
kcbs.com
Entertainment
foxmovies.com
Movies
lucasfilms.com
Studios
miramax.com
Folders
Documents
15
Trade-off
  • Using theme nodes can simplify graph
  • Shannon encoding of folder or theme ID
  • Increases distortion of term distribution
  • Kullbach-Leibler (KL) distance of distorted
    folder w.r.t. true folder
  • Compare cost in bits

16
Algorithm BestSingle
  • Pool all documents
  • Find bottom-up hierarchical clustering (HAC)
    using text only
  • Map each original folder to the one HAC node at
    the smallest KL distance
  • Low mapping cost, high distortion

17
PatchHAC and Bicriteria
  • PatchHAC
  • Start with BestSingle
  • Greedily introduce additional mappings from
    folders to HAC nodes
  • Bicriteria
  • Start with each document a theme
  • Collapse greedily while total code length
    decreases

18
Conclusion
19
Related work
  • Archiving, searching, categorization
  • Vistabar (Alta Vista)
  • Bookmark organizer (IBM Haifa)
  • PowerBookmarks (NEC)
  • Purple Yogi
  • Netscape roaming access, Backflip
  • Mining
  • Attribute similarity via external probes
  • Non-linear dynamical systems
Write a Comment
User Comments (0)
About PowerShow.com