Title: Contents
1Contents
- Introduction
- Knowledge discovery from text links
- Knowledge discovery from usage data
- Important open issues
2WWW the new face of the Net
Once upon a time, the Internet was a forum for
exchanging information. Then
came the Web.
The Web introduced new capabilities
and attracted many more people
increasing commercial interest
and turning the Net into a real forum
3Information overload
as more people started using it ...
the quantity of information on the Web
increased...
increasing the quantity of online information
further...
attracting even more people ...
and leading to the overload of information for
the users ...
4WWW an expanding forum
- The Web is large and volatile
- More than 600.000.000 users online
- More than 800.000 sign up every day
- More than 9.000.000 Web sites
- More than 300.000.000.000 pages online
- Less than 50 of Web sites will be there next
year - leading to the abundance problem
- 99 of online information is of no
- interest to 99 of the people
5Information access services
- A number of services aim to help the user gain
access to online information and products ...
but can they really cope?
6New requirements
- Current indexing does not allow for wide
coverage Less than 5 of the Web covered by
search engines. - What I want is hardly ever ranked high enough.
- Product information in catalogues is often biased
towards specific suppliers and outdated. - Product descriptions are incomplete and
insufficient for comparison purposes. - E in E-commerce stands for English More
than 70 of the Web is English. - and many more problems lead to the conclusion
... - that more intelligent solutions are needed!
7A new generation of services
- Some have already made their way to the market
many more are being developed as I speak
8Approaches to Web mining
- Primary data (Web content)
- Mainly text,
- with some multimedia content (increasing)
- and mark-up commands including hyperlinks.
- Underlying databases (not directly accessible).
- Knowledge discovery from text and links
- Pattern discovery in unstructured textual data.
- Pattern discovery in the Web graph / hypertext.
9Approaches to Web mining
- Secondary data (Web usage)
- Access logs collected by servers,
- potentially using cookies,
- and a variety of navigational information
collected by Web clients (mainly JavaScript
agents). - Knowledge Discovery from usage data
- Discovery of interesting usage patterns, mainly
from server logs. - Web personalization Web intelligence.
10Contents
- Introduction
- Knowledge discovery from text links
- Introduction
- Information filtering and retrieval
- Ontology learning
- Knowledge discovery from usage data
- Important open issues
11Information access
- Goals
- Organize documents into categories.
- Assign new documents to the categories.
- Retrieve information that matches a user query.
- Dominating statistical idea
- TFIDFterm frequency inverse document frequency
- Problems on the Web
- Huge scale and high volatility demand automation.
12Text mining
- Knowledge (pattern) discovery in textual data.
- Clarifying common misconceptions
- Text mining is NOT about assigning documents to
thematic categories, but about learning document
classifiers. - Text mining is NOT about extracting information
from text, but about learning information
extraction patterns. - Difficulty unstructured format of textual data.
13Approaches to text mining
- Combination of language engineering (LE),
machine learning (ML) and statistical methods
ML-Stats
LE
ML-Stats
LE
14Hyperlink information is useful
- Information access can be improved by
identifying authoritative pages (authorities)
and resource index pages (hubs). - Linked pages often contain complementary
information (e.g. product offers). - Thematically related pages are often linked,
either directly or indirectly.
15Document category modelling
Training documents (pre-classified)
Stopword removal (and, the, etc.) Stemming
(played ? play) Bag-of-words coding
Pre-processing
Statistical selection/combination of
characteristic terms (MI, PCA)
Dimensionality reduction
Machine Learning
Supervised classifier learning
Category models (classifiers)
16Document category modelling
- Example Filtering spam email.
- Task classify incoming email as spam and
legitimate (2 document categories). - Simple blacklist and keyword-based methods have
failed. - More intelligent, adaptive approaches are needed
(e.g. naive Bayesian category modeling).
17Document category modelling
- Step 1 (linguistic pre-processing) Tokenization,
removal of stopwords, stemming/lemmatization. - Step 2 (vector representation) bag-of-words or
n-gram modeling (n2,3). - Step 3 (feature selection) information gain
evaluation. - Step 4 (machine learning) Bayesian modeling,
using word/n-gram frequency.
18Link structure analysis
- Improve information retrieval by scoring Web
pages according to their importance in the Web or
a thematic sub-domain of it. - Nodes with large fan-in (authorities) provide
high quality information. - Nodes with large fan-out (hubs) are good starting
points.
19Link structure analysis
- The HITS algorithm Kleinberg, ACM Journal 1999
- Given a set of Web pages, e.g. as generated by a
query, - expand the base set by including pages that are
linked to by the ones in the initial set or link
to them, - assign a hub and an authority weight to each
page, initialised to 1, - update the authority weight of page p according
to the hub weights of the pages that link to it - update the hub weight of page p according to the
authority weights of the pages that it links to - repeat the weight update for a given number of
times, - return a list of the pages ranked by their
weights.
20Link structure analysis
- Interesting issues
- Does the social network hypothesis hold, i.e.,
authorities are highly cited? This may be
unrealistic in competitive commercial domains. - What happens if link structure adapts to the
method, e.g. unrelated pages link to each other
to increase their rating? - What about interesting new pages? How will people
get to them?
21Focused crawling spidering
- Crawling/Spidering Automatic navigation through
the Web by robots with the aim of indexing the
Web. - Crawling v. Spidering (subjective) inter-site v.
intra-site navigation. - Focused crawling/spidering Efficient, thematic
indexing of relevant Web pages, e.g. maintenance
of a thematic portal. - Underlying assumption similar to HITS
thematically similar pages are linked.
22Focused crawling
- Focused crawling Chakrabarti et al., WWW 1999
- Given an initial set of Web pages about a topic,
e.g. as found in a Web directory, - use document category modelling to build a topic
classifier, - extract the hyperlinks within the initial set of
pages and add them to a queue of pages to be
visited, - retrieve pages from the queue,
- use the classifier to assess the relevance of
retrieved pages, - use a variant of HITS to assign a hub score to
pages and the hyperlinks in the queue, - re-sort the links in the queue according to their
hub score, - continue the retrieval of new pages, periodically
updating the score of hyperlinks in the queue.
23Focused crawling spidering
- Domain-specific spidering
- Goal retrieve interesting pages, without
traversing the whole site. - Differences from crawling
- The site is much more restricted in size and
thematic diversity than the whole of the Web. - Social network analysis is less relevant within a
site (no hubs and authorities). - Requirement link scoring using local features,
e.g. the anchor text and the textual context.
24Information extraction
- Goals
- Identify interesting events in unstructured
text. - Extract information related to the events and
store it in structured templates. - Typical application
- Information extraction from newsfeeds.
- Difficulties
- Deals with unstructured or semi-structured text.
- Identification of entities and relations.
- Usually requires some understanding of the text.
25A typical extraction system
Unstructured text and database schema (event
templates)
Lemmatization (said ? say), Sentence and word
separation. Part-of-speech tagging, etc.
Morphology
Shallow syntactic parsing.
Syntax
Named-entity recognition. Co-reference
resolution. Sense disambiguation.
Semantics
Discourse
Pattern matching.
Structured data (filled templates)
26Wrappers/fact extraction
- Simplified information extraction
- Extract interesting facts from Web documents.
- Assumes structure in the documents (usually
dynamically generated from databases). - Reduced demand for pre-processing and LE.
- Typical application
- Product comparison services (price, availability,
). - Difficulties
- Semi-structured data.
- Different underlying database schemata and
presentation formats.
27Wrappers/fact extraction
ltHTMLgtltTITLEgt Some Country Codes lt/TITLEgt ltBODYgtltBgt Some Country Codes lt/Bgt ltPgt ltBgt Congo lt/Bgt ltIgt 242 lt/Igt ltBgt Egypt lt/Bgt ltIgt 20 lt/Igt ltBgt Greece lt/Bgt ltIgt 30 lt/Igt ltBgt Spain lt/Bgt ltIgt 34 lt/Igt ltHRgt ltBgt End lt/Bgt lt/BODYgt lt/HTMLgt
Example
Wrapper (page P) Skip past first occurrence of ltPgt in P While (next ltBgt is before next ltHRgt in P) For each ltl, rgt ? (ltBgt, lt/Bgt) , (ltIgt, lt/Igt) Extract the text between l and r return ltcountry, code gt extracted pairs
Country Code
Congo 242
Egypt 20
Greece 30
Spain 34
28Wrapper induction
Training documents (semi-structured)
Abstraction of mark-up structure (often omitted)
Data pre-processing
Database schema (interesting facts)
Machine Learning
Structural/sequence learning
Fact extraction patterns (wrapper)
29Ontology learning
Training documents (unclassified)
Stopword removal (and, the, etc.) Stemming
(played ? play) Syntactic/Semantic
analysis Bag-of-words coding
Pre-processing
Hand-made thesauri (Wordnet) Term co-occurrence
(LSI)
Dimensionality reduction
Machine Learning
Unsupervised learning (clustering and association
discovery)
Ontologies
30Ontology learning
- Hierarchical clustering is most suitable
- Agglomerative clustering
- Conceptual clustering (COBWEB)
- Model-based clustering (EM-type MCLUST)
- but flat clustering can also be adapted
- K-means and its variants
- Bayesian clustering (Autoclass)
- Neural networks (self-organizing maps)
- Association discovery (e.g. Apriori) for
non-taxonomic relations.
31Ontology learning
- Example Acquisition of an ontology for tourist
information. based on Maedche Staab, ECAI 2000
32Ontology learning
- Source data Web pages of tourist sites.
- Background knowledge generic and domain-specific
ontologies. - Target users Tourist directories, large travel
agencies. - Goals
- Identify types of page (e.g. room descriptions)
and terms/entities inside pages (e.g. hotel
addresses). - Identify taxonomic relations between concepts
(e.g. accommodation hotel). - Identify non-taxonomic relations between concepts
(e.g. accommodation area).
33Ontology learning
- Heavy linguistic pre-processing
- Syntactic analysis,e.g. verb subcategorization
framesverb(arrive) -gt prep(at),
dir_obj(Torino). - Semantic analysis, e.g. named entity
recognition Via Lagrange -gt Street namee.g.
special dependency relations Hotel Concord in
Torino
34Contents
- Introduction
- Knowledge discovery from text links
- Knowledge discovery from usage data
- Personalization on the Web
- Data collection and preparation issues
- Personalized assistants
- Discovering generic user models
- Sequential pattern discovery
- Knowledge discovery in action
- Important open issues
35Personalized information access
sources
personalization server
receivers
36Personalization v. intelligence
- Better service for the user
- Reduction of the information overload.
- More accurate information retrieval and
extraction. - Recommendation and guidance.
37Personalized assistants
- Personalized crawling Liebermann et al., ACM
Comm., 2000 - The system knows the user (log-in).
- It uses heuristics to extract important terms
from the Web pages that the user visits and add
them to thematic profiles. - Each time the user views a page, the system
- searches the Web for related pages,
- filters them according to the relevant thematic
profile, - and constructs a list of recommended links for
the user. - The Letizia version of the system searches the
Web locally, following outgoing links from the
current page. - The Powerscout version uses a search engine to
explore the Web.
38Personalized assistants
- Adaptive Web interfaces Jörding, UM 1999
- The TELLIM system collects user information,
(e.g. a selection of a link) using a Java applet
. - User information is used as training data in
order to create generic models reflecting the
users interest in different products. - The system creates short-term personal models
using the generic models and the current users
behavior. - Web pages containing more detailed information
about these products, together with multimedia
content and VRML presentations are created
dynamically and presented to the users.
39User modelling
- Basic elements
- Constructing models that can be used to adapt the
system to the users requirements. - Different types of requirement interests (sports
and finance news), knowledge level (novice -
expert), preferences (no-frame GUI), etc. - Different types of model personal generic.
- Knowledge discovery facilitates the acquisition
of user models from data.
40User Models
- User model (type A) PERSONAL
- User x -gt sports, stock market
- User model (type B) PERSONAL
- User x, Age 26, Male -gt sports, stock market
- User community GENERIC
- Users x,y,z -gt sports, stock market
- User stereotype GENERIC
- Users x,y,z, Age 20..30, Male -gt sports,
stock market
41Generic user models
- Stereotypes Models that represent a type of
user, associating personal characteristics with
parameters of the system, - e.g. Male users of age 20-30 are interested in
sports and politics. - Communities Models that represent a group of
users with common preferences, - e.g. Users that are interested in sports and
politics.
42Learning user models
43Knowledge discovery process
Collection of usage data by the server and the
client.
Data collection
Data cleaning, user identification, session
identification
Data pre-processing
Construction of user models
Pattern discovery
Report generation, visualization, personalization
module.
Knowledge post-processing
44Pre-processing usage data
- Cleaning
- Log entries that correspond to error responses.
- Trails of robots.
- Pages that have not been requested explicitly by
the user (mainly image files, loaded
automatically). Should be domain-specific. - User identification
- Identification by log-in.
- Cookies and Javascript.
- Extended Log Format (browser and OS version).
- Bookmark user-specific URL.
- Various other heuristics.
45Pre-processing usage data
- User session/Transaction identification in log
files - Time-based methods, e.g. 30 min silence interval.
Problems with cache. Partial solutions special
HTTP headers, Java agents. - Context-based methods e.g. separate pages into
navigational and content and impose heuristics on
the type of page that a user session may consist
of. - User sessions can be subdivided into smaller
transaction sequences, e.g. by identifying a
backward reference in the sequence of requests. - Encoding of training data
- Bag-of-pages representation of sessions/transactio
ns. - Transition-based representation of
sessions/transactions. - Manually determined features of interest.
46Collaborative filtering
- Information filtering according to the choices of
similar users. - Avoids semantic content analysis.
- Cold-start problem with new users.
- Approaches
- memory-based learning,
- model-based clustering,
- item-based recommendation.
47Memory-based learning
- Nearest-neighbour approach
- Construct a model for each user. Often use
explicit user ratings for each item. - Index the user in the space of system parameters,
e.g. item ratings. - For each new user,
- index the user in the same space, and
- find the k closest neighbours.
- Simple metrics to measure the similarity between
users, e.g. Pearson correlation. - Recommend the items that the new user has not
seen and are popular among the neighbours.
48Model-based clustering
- Clustering users into communities.
- Methods used
- Conceptual clustering (COBWEB).
- Graph-based clustering (Cluster mining).
- Statistical clustering (Autoclass).
- Neural Networks (Self-Organising Maps).
- Model-based clustering (EM-type).
- BIRCH.
- Community models cluster descriptions.
49Model-based clustering
0,9
0,9
0,9
0,9
0,8
0,8
0,4
0,4
0,1
0,1
0,5
0,5
50Item-based recommendation
- Focus on item usage in the profiles, instead of
the users themselves. - Practically useful in e-commerce, e.g. cross-sell
recommendations. - Simple modification to the clique-based
clustering method graph of items instead of
graph of users. - Related to frequent itemset discovery in
association rule mining.
51Item-based recommendation
0,9
0,9
Politics
Sports
0,9
0,9
0,8
0,8
0,4
0,4
0,1
0,1
World
Finance
0,5
0,5
52Contents
- Introduction
- Knowledge discovery from text links
- Knowledge discovery from usage data
- Personalization on the Web
- Data collection and preparation issues
- Personalized assistants
- Discovering generic user models
- Sequential pattern discovery
- Knowledge discovery in action
- Important open issues
53Sequential pattern discovery
- Identifying navigational patterns, rather than
bag-of-page models. - Methods
- Clustering transitions between pages.
- First-order Markov models.
- Probabilistic grammar induction.
- Association-rule sequence mining.
- Path traversal through graphs.
- Personal and community navigation models.
54Sequential pattern discovery
- Clique-based transition clustering small
modification of the model-based item clustering
approach an item is a transition between pages.
0,9
0,9
Sports-gtPolitics
Finance-gtPolitics
0,9
0,9
0,8
0,8
0,4
0,4
0,1
0,1
Sports-gtFinance
Finance-gtSports
0,5
0,5
55References
- J. Borges and M. Levene, Data mining of user
navigation patterns. Proceedings of Workshop on
Web Usage Analysis and User Profiling (WEBKDD),
in conjunction with ACM SIGKDD International
Conference on Knowledge Discovery and Data
Mining. San Diego, CA., pp. 31-36. - S. Chakrabarti, M. H. van den Berg, B. E. Dom,
Focused Crawling a new approach to
topic-specific Web resource discovery,
Proceedings of the Eighth International World
Wide Web Conference (WWW), Toronto, Canada, May
1999. - T. Jörding, T, A Temporary User Modeling Approach
for Adaptive Shopping on the We, In Proceedings
of the 2nd Workshop on Adaptive Systems and User
Modeling on the WWW, UM'99, Banff, Canada, 1999. - J. Kleinberg. Authoritative sources in a
hyperlinked environment. Journal of the ACM, v.
46, 1999. - H. Lieberman, C. Fry and L. Weitzman. Exploring
the Web with Reconnaissance Agents,
Communications of the ACM, August 2001, pp.
69-75. - A. Maedche, S. Staab. Discovering Conceptual
Relations from Text. In W.Horn (ed.) ECAI 2000.
Proceedings of the 14th European Conference on
Artificial Intelligence (ECAI), Berlin, August
21-25, 2000. - A. McCallum, D. Freitag and F. Pereira, Maximum
Entropy Markov Models for Information Extraction
and Segmentation, Proceedings of the
International Conference on Machine Learning
(ICML), Stanford, CA, 2000, pp. 591-598. - I. Muslea , S. Minton and C. Knoblock , STALKER
Learning extraction rules for semistructured
Web-based information sources. Proceedings of the
National Conference on Artificial Intelligence
(AAAI), Madison, Wisconsin, 1998. - C. Nédellec, Corpus-based learning of semantic
relations by the ILP system, Asium, Learning
Language in Logic, Cussens J. and Dzeroski S.
(Eds.), Springer Verlag, September 2000. - J. Rennie and A. McCallum. Efficient Web
Spidering with Reinforcement Learning.
Proceedings of the International Conference on
Machine Learning (ICML), 1999. - E. I. Schwartz. Webonomics. New York Broadway
books, 1997. - E. Schwarzkopf, An adaptive Web site for the
UM2001 conference. Proceedings of the Workshop on
Machine Learning for User Modeling, in
conjunction with the International Conference on
User modelling (UM), pp 77-86, 2001.