A Model for Fast Web Mining Prototyping PowerPoint PPT Presentation

presentation player overlay
1 / 46
About This Presentation
Transcript and Presenter's Notes

Title: A Model for Fast Web Mining Prototyping


1
A Model for Fast Web Mining Prototyping
Álvaro Pereira
  • Nivio Ziviani
  • UFMG Brazil

Ricardo Baeza-Yates
Jesus Bisbal UPF Spain
2
Motivation
  • Our focus
  • Web mining as the process of discovering useful
    information in Web data by means of data mining
    techniques
  • Web mining
  • Computation-intensive task
  • Iterative process
  • Prototyping plays an important role
  • Experimenting with different alternatives
  • Incorporating the knowledge from previous
    iterations
  • Mining softwares are developed ad-hoc
  • Time-consuming tasks
  • Not scalable
  • Not reusable

3
Main Objective Design and Development of WIM
  • WIM Web Information Mining model
  • WIM goal facilitate fast Web mining prototyping
  • Main research challenges
  • Data model
  • Algebra
  • Software prototype
  • Architecture and implementation issues

4
Web Mining Problems WIM Has Been Applied So Far
  • Study of genealogical trees on the Web (WWW'08)
  • A study on how the Web textual content evolves
  • A usage pagerank for ranking improvement
  • A logical graph is created based on usage data
  • Linkage Evolution for New Pages
  • Hypothesis duplicates tend to have no evolution
    of links (inlinks)
  • A user intent study
  • Identifying queries that cannot be classified as
    either navigational or informational
  • Creation of a reference dataset for learning to
    rank

5
Outline
  • Related work
  • WIM data model
  • WIM algebra
  • Software architecture
  • Conclusions and future work

6
  • Related Work

7
First Research Line Data Mining Tools
  • Business-driven solutions
  • Not specially designed for Web data
  • SQL extensions
  • Examples
  • Microsoft SQL Server
  • Oracle Data Mining
  • IBM DB2 Intelligent Miner
  • BI tools
  • Angoss, Infor CRM Epiphany, Portrait Software,
    SAS
  • Weka

8
Second Research LineQuery Languages for Web
Data
  • Not for mining
  • Web data manipulation
  • Acquisition, storage, management
  • Examples
  • TSIMMIS, W3QL, WebLog, WebSQL, ARANEUS, StruQL,
    WebOQL, Whoweda, WEBMINER, WUM, Squeal, WebBase,
    WEBVIEW

9
  • Data Model

10
Data Model Design Goals
  • Feasibility
  • Simplicity
  • Extensibility
  • Data representativity
  • Uniformity among operators
  • Applicability to other scenarios

11
Relation Type
  • Node relations represent nodes of a graph, such
    as
  • Documents of a Web dataset
  • Terms of a document
  • Queries of a query log
  • Sessions of a query log
  • Link relations represent edges of a graph, such
    as
  • Links between Web documents
  • Word distance among terms of a document
  • Similarity among queries
  • Clicks of a query log
  • Association between queries and sessions
  • Usage data can be represented as both node or
    link relations

12
Node Relation
13
Link Relation
  • Main difference link relations must represent
    start and end nodes of a graph

14
Compatibility
  • A link relation is compatible to a node relation
    if the nodes of the graph (link relation) are
    foreign keys in the node relation

15
Operation
  • The act of applying an operator to a relation
  • An operator is a function defined by the WIM
    algebra
  • Unary or binary

16
WIM Program
  • Sequence of operations applied to relations
  • Result of users' interaction through the WIM
    language
  • The WIM language
  • Is built upon the WIM algebra
  • Is declarative
  • Is a dataflow programming language
  • Facilitates parallelism
  • Allows graphical implementation

17
WIM Program Example Genealogical Tree Study
18
WIM Program Example Genealogical Tree Study
19
(No Transcript)
20
(No Transcript)
21
(No Transcript)
22
(No Transcript)
23
(No Transcript)
24
(No Transcript)
25
(No Transcript)
26
(No Transcript)
27
(No Transcript)
28
  • WIM Algebra

29
Two Classes of Operators
  • Seven data manipulation operators
  • Select, Calculate, CalcGraph, Aggregate, Set,
    Join, Materialize
  • Eight data mining operators
  • Search, Compare, CompGraph, Cluster, Disconnect,
    Associate, Analyze, Relink

30
Select
  • Select tuples from the input

31
Calculate
  • For mathematical and statistical calculations

32
CalcGraph
  • For calculations between nodes of the graph

33
Aggregate
  • group tuples with the same value for one or two
    attributes

34
Set
  • For union, intersection and difference of tuples
    in two relations

35
Join
  • Add an external attribute into a given relation

36
Search
  • Used for querying (TF-IDF, BM-25, AND, OR)

37
Compare
  • Compare elements of a textual attribute

38
Disconnect
  • Identify clusters in a graph

39
Analyze
  • For link analysis (Pagerank, Authority, Indegree)

40
  • Software Architecture

41
Software Architecture
42
  • Conclusions and Future Work

43
Conclusions
  • WIM a model and software for fast Web mining
    prototyping
  • Data model
  • Algebra
  • A software prototype
  • Efficient
  • Several tens of million of tuples
  • Running time is higher for the mining operations
  • Ad-hoc solutions also need the mining step
  • Scalable
  • Future implementation could have the attributes
    stored in different servers and different parts
    of programs running distributively

44
Conclusions
  • Extensible
  • New operators, and new options/methods for the
    current operators, can be added
  • We have designed and implemented an extension of
    operator Analyze
  • calculate pagerank taking into account the label
    of the graph
  • Effective for a set of Web mining applications

45
Future Work on WIM
  • Finish the implementation and make a version of
    the prototype available
  • Users would contribute with extensions
  • Improve the prototype to become a tool
  • Design new operators for other mining tasks
  • Aggregate a Web crawler and a data visualization
    interface
  • Implement a graphical interface to the WIM
    language

46
  • Thank you!
  • alvaro_at_dcc.ufmg.br
Write a Comment
User Comments (0)
About PowerShow.com