2
WWW the new face of the Net
Once upon a time, the Internet was a forum for
exchanging information. Then
came the Web.
The Web introduced new capabilities
and attracted many more people
increasing commercial interest
and turning the Net into a real forum
3
Information overload
as more people started using it ...
the quantity of information on the Web
increased...
increasing the quantity of online information
further...
attracting even more people ...
and leading to the overload of information for
the users ...
4
WWW an expanding forum

The Web is large and volatile
More than 600.000.000 users online
More than 800.000 sign up every day
More than 9.000.000 Web sites
More than 300.000.000.000 pages online
Less than 50 of Web sites will be there next
year
leading to the abundance problem
99 of online information is of no
interest to 99 of the people

5
Information access services

A number of services aim to help the user gain
access to online information and products ...

but can they really cope?
6
New requirements

Current indexing does not allow for wide
coverage Less than 5 of the Web covered by
search engines.
What I want is hardly ever ranked high enough.
Product information in catalogues is often biased
towards specific suppliers and outdated.
Product descriptions are incomplete and
insufficient for comparison purposes.
E in E-commerce stands for English More
than 70 of the Web is English.
and many more problems lead to the conclusion
...
that more intelligent solutions are needed!

7
A new generation of services

Some have already made their way to the market

many more are being developed as I speak
8
Approaches to Web mining

Primary data (Web content)
Mainly text,
with some multimedia content (increasing)
and mark-up commands including hyperlinks.
Underlying databases (not directly accessible).
Knowledge discovery from text and links
Pattern discovery in unstructured textual data.
Pattern discovery in the Web graph / hypertext.

9
Approaches to Web mining

Secondary data (Web usage)
Access logs collected by servers,
potentially using cookies,
and a variety of navigational information
collected by Web clients (mainly JavaScript
agents).
Knowledge Discovery from usage data
Discovery of interesting usage patterns, mainly
from server logs.
Web personalization Web intelligence.

10
Contents

Introduction
Knowledge discovery from text links
Introduction
Information filtering and retrieval
Ontology learning
Knowledge discovery from usage data
Important open issues

11
Information access

Goals
Organize documents into categories.
Assign new documents to the categories.
Retrieve information that matches a user query.
Dominating statistical idea
TFIDFterm frequency inverse document frequency
Problems on the Web
Huge scale and high volatility demand automation.

12
Text mining

Knowledge (pattern) discovery in textual data.
Clarifying common misconceptions
Text mining is NOT about assigning documents to
thematic categories, but about learning document
classifiers.
Text mining is NOT about extracting information
from text, but about learning information
extraction patterns.
Difficulty unstructured format of textual data.

13
Approaches to text mining

Combination of language engineering (LE),
machine learning (ML) and statistical methods

ML-Stats
LE
ML-Stats
LE
14
Hyperlink information is useful

Information access can be improved by
identifying authoritative pages (authorities)
and resource index pages (hubs).
Linked pages often contain complementary
information (e.g. product offers).
Thematically related pages are often linked,
either directly or indirectly.

15
Document category modelling
Training documents (pre-classified)
Stopword removal (and, the, etc.) Stemming
(played ? play) Bag-of-words coding
Pre-processing
Statistical selection/combination of
characteristic terms (MI, PCA)
Dimensionality reduction
Machine Learning
Supervised classifier learning
Category models (classifiers)
16
Document category modelling

Example Filtering spam email.
Task classify incoming email as spam and
legitimate (2 document categories).
Simple blacklist and keyword-based methods have
failed.
More intelligent, adaptive approaches are needed
(e.g. naive Bayesian category modeling).

17
Document category modelling

Step 1 (linguistic pre-processing) Tokenization,
removal of stopwords, stemming/lemmatization.
Step 2 (vector representation) bag-of-words or
n-gram modeling (n2,3).
Step 3 (feature selection) information gain
evaluation.
Step 4 (machine learning) Bayesian modeling,
using word/n-gram frequency.

18
Link structure analysis

Improve information retrieval by scoring Web
pages according to their importance in the Web or
a thematic sub-domain of it.
Nodes with large fan-in (authorities) provide
high quality information.
Nodes with large fan-out (hubs) are good starting
points.

19
Link structure analysis

The HITS algorithm Kleinberg, ACM Journal 1999
Given a set of Web pages, e.g. as generated by a
query,
expand the base set by including pages that are
linked to by the ones in the initial set or link
to them,
assign a hub and an authority weight to each
page, initialised to 1,
update the authority weight of page p according
to the hub weights of the pages that link to it
update the hub weight of page p according to the
authority weights of the pages that it links to
repeat the weight update for a given number of
times,
return a list of the pages ranked by their
weights.

20
Link structure analysis

Interesting issues
Does the social network hypothesis hold, i.e.,
authorities are highly cited? This may be
unrealistic in competitive commercial domains.
What happens if link structure adapts to the
method, e.g. unrelated pages link to each other
to increase their rating?
What about interesting new pages? How will people
get to them?

21
Focused crawling spidering

Crawling/Spidering Automatic navigation through
the Web by robots with the aim of indexing the
Web.
Crawling v. Spidering (subjective) inter-site v.
intra-site navigation.
Focused crawling/spidering Efficient, thematic
indexing of relevant Web pages, e.g. maintenance
of a thematic portal.
Underlying assumption similar to HITS
thematically similar pages are linked.

22
Focused crawling

Focused crawling Chakrabarti et al., WWW 1999
Given an initial set of Web pages about a topic,
e.g. as found in a Web directory,
use document category modelling to build a topic
classifier,
extract the hyperlinks within the initial set of
pages and add them to a queue of pages to be
visited,
retrieve pages from the queue,
use the classifier to assess the relevance of
retrieved pages,
use a variant of HITS to assign a hub score to
pages and the hyperlinks in the queue,
re-sort the links in the queue according to their
hub score,
continue the retrieval of new pages, periodically
updating the score of hyperlinks in the queue.

23
Focused crawling spidering

Domain-specific spidering
Goal retrieve interesting pages, without
traversing the whole site.
Differences from crawling
The site is much more restricted in size and
thematic diversity than the whole of the Web.
Social network analysis is less relevant within a
site (no hubs and authorities).
Requirement link scoring using local features,
e.g. the anchor text and the textual context.

24
Information extraction

Goals
Identify interesting events in unstructured
text.
Extract information related to the events and
store it in structured templates.
Typical application
Information extraction from newsfeeds.
Difficulties
Deals with unstructured or semi-structured text.
Identification of entities and relations.
Usually requires some understanding of the text.

25
A typical extraction system
Unstructured text and database schema (event
templates)
Lemmatization (said ? say), Sentence and word
separation. Part-of-speech tagging, etc.
Morphology
Shallow syntactic parsing.
Syntax
Named-entity recognition. Co-reference
resolution. Sense disambiguation.
Semantics
Discourse
Pattern matching.
Structured data (filled templates)
26
Wrappers/fact extraction

Simplified information extraction
Extract interesting facts from Web documents.
Assumes structure in the documents (usually
dynamically generated from databases).
Reduced demand for pre-processing and LE.
Typical application
Product comparison services (price, availability,
).
Difficulties
Semi-structured data.
Different underlying database schemata and
presentation formats.

27
Wrappers/fact extraction
ltHTMLgtltTITLEgt Some Country Codes lt/TITLEgt ltBODYgtltBgt Some Country Codes lt/Bgt ltPgt ltBgt Congo lt/Bgt ltIgt 242 lt/Igt ltBgt Egypt lt/Bgt ltIgt 20 lt/Igt ltBgt Greece lt/Bgt ltIgt 30 lt/Igt ltBgt Spain lt/Bgt ltIgt 34 lt/Igt ltHRgt ltBgt End lt/Bgt lt/BODYgt lt/HTMLgt
Example
Wrapper (page P) Skip past first occurrence of ltPgt in P While (next ltBgt is before next ltHRgt in P) For each ltl, rgt ? (ltBgt, lt/Bgt) , (ltIgt, lt/Igt) Extract the text between l and r return ltcountry, code gt extracted pairs
Country Code
Congo 242
Egypt 20
Greece 30
Spain 34
28
Wrapper induction
Training documents (semi-structured)
Abstraction of mark-up structure (often omitted)
Data pre-processing
Database schema (interesting facts)
Machine Learning
Structural/sequence learning
Fact extraction patterns (wrapper)
29
Ontology learning
Training documents (unclassified)
Stopword removal (and, the, etc.) Stemming
(played ? play) Syntactic/Semantic
analysis Bag-of-words coding
Pre-processing
Hand-made thesauri (Wordnet) Term co-occurrence
(LSI)
Dimensionality reduction
Machine Learning
Unsupervised learning (clustering and association
discovery)
Ontologies
30
Ontology learning

Hierarchical clustering is most suitable
Agglomerative clustering
Conceptual clustering (COBWEB)
Model-based clustering (EM-type MCLUST)
but flat clustering can also be adapted
K-means and its variants
Bayesian clustering (Autoclass)
Neural networks (self-organizing maps)
Association discovery (e.g. Apriori) for
non-taxonomic relations.

31
Ontology learning

Example Acquisition of an ontology for tourist
information. based on Maedche Staab, ECAI 2000

32
Ontology learning

Source data Web pages of tourist sites.
Background knowledge generic and domain-specific
ontologies.
Target users Tourist directories, large travel
agencies.
Goals
Identify types of page (e.g. room descriptions)
and terms/entities inside pages (e.g. hotel
addresses).
Identify taxonomic relations between concepts
(e.g. accommodation hotel).
Identify non-taxonomic relations between concepts
(e.g. accommodation area).

33
Ontology learning

Heavy linguistic pre-processing
Syntactic analysis,e.g. verb subcategorization
framesverb(arrive) -gt prep(at),
dir_obj(Torino).
Semantic analysis, e.g. named entity
recognition Via Lagrange -gt Street namee.g.
special dependency relations Hotel Concord in
Torino

34
Contents

Introduction
Knowledge discovery from text links
Knowledge discovery from usage data
Personalization on the Web
Data collection and preparation issues
Personalized assistants
Discovering generic user models
Sequential pattern discovery
Knowledge discovery in action
Important open issues

35
Personalized information access
sources
personalization server
receivers
36
Personalization v. intelligence

Better service for the user
Reduction of the information overload.
More accurate information retrieval and
extraction.
Recommendation and guidance.

37
Personalized assistants

Personalized crawling Liebermann et al., ACM
Comm., 2000
The system knows the user (log-in).
It uses heuristics to extract important terms
from the Web pages that the user visits and add
them to thematic profiles.
Each time the user views a page, the system
searches the Web for related pages,
filters them according to the relevant thematic
profile,
and constructs a list of recommended links for
the user.
The Letizia version of the system searches the
Web locally, following outgoing links from the
current page.
The Powerscout version uses a search engine to
explore the Web.

38
Personalized assistants

Adaptive Web interfaces Jörding, UM 1999
The TELLIM system collects user information,
(e.g. a selection of a link) using a Java applet
.
User information is used as training data in
order to create generic models reflecting the
users interest in different products.
The system creates short-term personal models
using the generic models and the current users
behavior.
Web pages containing more detailed information
about these products, together with multimedia
content and VRML presentations are created
dynamically and presented to the users.

39
User modelling

Basic elements
Constructing models that can be used to adapt the
system to the users requirements.
Different types of requirement interests (sports
and finance news), knowledge level (novice -
expert), preferences (no-frame GUI), etc.
Different types of model personal generic.
Knowledge discovery facilitates the acquisition
of user models from data.

40
User Models

User model (type A) PERSONAL
User x -gt sports, stock market
User model (type B) PERSONAL
User x, Age 26, Male -gt sports, stock market
User community GENERIC
Users x,y,z -gt sports, stock market
User stereotype GENERIC
Users x,y,z, Age 20..30, Male -gt sports,
stock market

41
Generic user models

Stereotypes Models that represent a type of
user, associating personal characteristics with
parameters of the system,
e.g. Male users of age 20-30 are interested in
sports and politics.
Communities Models that represent a group of
users with common preferences,
e.g. Users that are interested in sports and
politics.

42
Learning user models
43
Knowledge discovery process
Collection of usage data by the server and the
client.
Data collection
Data cleaning, user identification, session
identification
Data pre-processing
Construction of user models
Pattern discovery
Report generation, visualization, personalization
module.
Knowledge post-processing
44
Pre-processing usage data

Cleaning
Log entries that correspond to error responses.
Trails of robots.
Pages that have not been requested explicitly by
the user (mainly image files, loaded
automatically). Should be domain-specific.
User identification
Identification by log-in.
Cookies and Javascript.
Extended Log Format (browser and OS version).
Bookmark user-specific URL.
Various other heuristics.

45
Pre-processing usage data

User session/Transaction identification in log
files
Time-based methods, e.g. 30 min silence interval.
Problems with cache. Partial solutions special
HTTP headers, Java agents.
Context-based methods e.g. separate pages into
navigational and content and impose heuristics on
the type of page that a user session may consist
of.
User sessions can be subdivided into smaller
transaction sequences, e.g. by identifying a
backward reference in the sequence of requests.
Encoding of training data
Bag-of-pages representation of sessions/transactio
ns.
Transition-based representation of
sessions/transactions.
Manually determined features of interest.

46
Collaborative filtering

Information filtering according to the choices of
similar users.
Avoids semantic content analysis.
Cold-start problem with new users.
Approaches
memory-based learning,
model-based clustering,
item-based recommendation.

47
Memory-based learning

Nearest-neighbour approach
Construct a model for each user. Often use
explicit user ratings for each item.
Index the user in the space of system parameters,
e.g. item ratings.
For each new user,
index the user in the same space, and
find the k closest neighbours.
Simple metrics to measure the similarity between
users, e.g. Pearson correlation.
Recommend the items that the new user has not
seen and are popular among the neighbours.

48
Model-based clustering

Clustering users into communities.
Methods used
Conceptual clustering (COBWEB).
Graph-based clustering (Cluster mining).
Statistical clustering (Autoclass).
Neural Networks (Self-Organising Maps).
Model-based clustering (EM-type).
BIRCH.
Community models cluster descriptions.

49
Model-based clustering
0,9
0,9
0,9
0,9
0,8
0,8
0,4
0,4
0,1
0,1
0,5
0,5
50
Item-based recommendation

Focus on item usage in the profiles, instead of
the users themselves.
Practically useful in e-commerce, e.g. cross-sell
recommendations.
Simple modification to the clique-based
clustering method graph of items instead of
graph of users.
Related to frequent itemset discovery in
association rule mining.

51
Item-based recommendation
0,9
0,9
Politics
Sports
0,9
0,9
0,8
0,8
0,4
0,4
0,1
0,1
World
Finance
0,5
0,5
52
Contents

Introduction
Knowledge discovery from text links
Knowledge discovery from usage data
Personalization on the Web
Data collection and preparation issues
Personalized assistants
Discovering generic user models
Sequential pattern discovery
Knowledge discovery in action
Important open issues

53
Sequential pattern discovery

Identifying navigational patterns, rather than
bag-of-page models.
Methods
Clustering transitions between pages.
First-order Markov models.
Probabilistic grammar induction.
Association-rule sequence mining.
Path traversal through graphs.
Personal and community navigation models.

54
Sequential pattern discovery

Clique-based transition clustering small
modification of the model-based item clustering
approach an item is a transition between pages.

0,9
0,9
Sports-gtPolitics
Finance-gtPolitics
0,9
0,9
0,8
0,8
0,4
0,4
0,1
0,1
Sports-gtFinance
Finance-gtSports
0,5
0,5
55
References

J. Borges and M. Levene, Data mining of user
navigation patterns. Proceedings of Workshop on
Web Usage Analysis and User Profiling (WEBKDD),
in conjunction with ACM SIGKDD International
Conference on Knowledge Discovery and Data
Mining. San Diego, CA., pp. 31-36.
S. Chakrabarti, M. H. van den Berg, B. E. Dom,
Focused Crawling a new approach to
topic-specific Web resource discovery,
Proceedings of the Eighth International World
Wide Web Conference (WWW), Toronto, Canada, May
1999.
T. Jörding, T, A Temporary User Modeling Approach
for Adaptive Shopping on the We, In Proceedings
of the 2nd Workshop on Adaptive Systems and User
Modeling on the WWW, UM'99, Banff, Canada, 1999.
J. Kleinberg. Authoritative sources in a
hyperlinked environment. Journal of the ACM, v.
46, 1999.
H. Lieberman, C. Fry and L. Weitzman. Exploring
the Web with Reconnaissance Agents,
Communications of the ACM, August 2001, pp.
69-75.
A. Maedche, S. Staab. Discovering Conceptual
Relations from Text. In W.Horn (ed.) ECAI 2000.
Proceedings of the 14th European Conference on
Artificial Intelligence (ECAI), Berlin, August
21-25, 2000.
A. McCallum, D. Freitag and F. Pereira, Maximum
Entropy Markov Models for Information Extraction
and Segmentation, Proceedings of the
International Conference on Machine Learning
(ICML), Stanford, CA, 2000, pp. 591-598.
I. Muslea , S. Minton and C. Knoblock , STALKER
Learning extraction rules for semistructured
Web-based information sources. Proceedings of the
National Conference on Artificial Intelligence
(AAAI), Madison, Wisconsin, 1998.
C. Nédellec, Corpus-based learning of semantic
relations by the ILP system, Asium, Learning
Language in Logic, Cussens J. and Dzeroski S.
(Eds.), Springer Verlag, September 2000.
J. Rennie and A. McCallum. Efficient Web
Spidering with Reinforcement Learning.
Proceedings of the International Conference on
Machine Learning (ICML), 1999.
E. I. Schwartz. Webonomics. New York Broadway
books, 1997.
E. Schwarzkopf, An adaptive Web site for the
UM2001 conference. Proceedings of the Workshop on
Machine Learning for User Modeling, in
conjunction with the International Conference on
User modelling (UM), pp 77-86, 2001.

Write a Comment

User Comments (0)

About PowerShow.com

Contents - PowerPoint PPT Presentation

Contents

Contents Introduction Knowledge discovery from text & links Knowledge discovery from usage data Important open issues – PowerPoint PPT presentation