Title: MINING USERS' NAVIGATION PATTERNS AND PREDICTING THEIR NEXT STEPS Mark Levene School of Computer Science and Information Systems Birkbeck University of London
1MINING USERS' NAVIGATION PATTERNS AND PREDICTING
THEIR NEXT STEPS Mark LeveneSchool of Computer
Science and Information SystemsBirkbeck
University of London
2Observation 1The Web is a Complex Network(Map
of the Internet Bell Labs 1998)
3Observation 2Need to make sense of practically
infinite information
- cr008r01-3.sac2.fastsearch.net - -
21/Sep/2003004840 0100 "GET
/mark/handheld.html HTTP/1.0" 200 1730 "-"
"FAST-WebCrawler/3.8 (atw-crawler at fast dot no
http//fast.no/support/crawler.asp)" - cr008r01-3.sac2.fastsearch.net - -
21/Sep/2003004916 0100 "GET
/mark/games.html HTTP/1.0" 200 6582 "-"
"FAST-WebCrawler/3.8 (atw-crawler at fast dot no
http//fast.no/support/crawler.asp)" - cr008r01-3.sac2.fastsearch.net - -
21/Sep/2003010221 0100 "GET
/mark/bookshops.html HTTP/1.0" 200 3568 "-"
"FAST-WebCrawler/3.8 (atw-crawler at fast dot no
http//fast.no/support/crawler.asp)" - cr008r01-3.sac2.fastsearch.net - -
21/Sep/2003014104 0100 "GET /mark/web.html
HTTP/1.0" 200 14639 "-" "FAST-WebCrawler/3.8
(atw-crawler at fast dot no http//fast.no/suppor
t/crawler.asp)" - cr008r01-3.sac2.fastsearch.net - -
21/Sep/2003014217 0100 "GET
/mark/download/optdb_plan.pdf HTTP/1.0" 304 -
"-" "FAST-WebCrawler/3.8 (atw-crawler at fast dot
no http//fast.no/support/crawler.asp)" - ip68-98-199-25.mc.at.cox.net - -
21/Sep/2003021027 0100 "GET
/mark/download/optdb_integrity_constraints.pdf
HTTP/1.0" 200 32768 "http//search.yahoo.com/searc
h?pdefinitionofsuperkeyseiUTF-8frfp-topn2
0fl0xwrt" "Mozilla/4.0 (compatible MSIE 6.0
Windows NT 5.1)" - ip68-98-199-25.mc.at.cox.net - -
21/Sep/2003021028 0100 "GET
/mark/download/optdb_integrity_constraints.pdf
HTTP/1.0" 206 158146 "-" "Mozilla/4.0
(compatible MSIE 6.0 Windows NT 5.1)" - adsl-68-74-73-241.dsl.emhril.ameritech.net - -
21/Sep/2003023929 0100 "GET
/mark/book.html HTTP/1.1" 200 3373
"http//www.google.com/search?hlenieUTF-8oeUT
F-8qrelationaldatabasesbasic" "Mozilla/4.0
(compatible MSIE 6.0 Windows NT 5.1)" - adsl-68-74-73-241.dsl.emhril.ameritech.net - -
21/Sep/2003023930 0100 "GET
/mark/front_cover.gif HTTP/1.1" 200 64168
"http//www.dcs.bbk.ac.uk/mark/book.html"
"Mozilla/4.0 (compatible MSIE 6.0 Windows NT
5.1)" - crawler14.googlebot.com - - 21/Sep/2003033552
0100 "GET /mark/games.html HTTP/1.0" 200 6582
"-" "Googlebot/2.1 (http//www.googlebot.com/bot.
html)" - cr008r01-3.sac2.fastsearch.net - -
21/Sep/2003041559 0100 "GET
/mark/download/optdb_table.pdf HTTP/1.0" 304 -
"-" "FAST-WebCrawler/3.8 (atw-crawler at fast dot
no http//fast.no/support/crawler.asp)" - drone10.sv.av.com - - 21/Sep/2003044709
0100 "GET /mark/ HTTP/1.0" 200 5183 "-"
"Scooter/3.3_SF" - crawler14.googlebot.com - - 21/Sep/2003044922
0100 "GET /mark HTTP/1.0" 301 309 "-"
"Googlebot/2.1 (http//www.googlebot.com/bot.html
)" - cr008r01-3.sac2.fastsearch.net - -
21/Sep/2003051846 0100 "GET
/mark/optdb_mailing_list.html HTTP/1.0" 200 622
"-" "FAST-WebCrawler/3.8 (atw-crawler at fast dot
no http//fast.no/support/crawler.asp)" - pool-68-162-19-184.nwrk.east.verizon.net - -
21/Sep/2003053501 0100 "GET
/mark/download/optdb_erd.pdf HTTP/1.1" 200 0
"http//www.google.com/search?qentityrelationshi
pconcepthlzh-TWlrieUTF-8oeUTF-8start10
saN" "Mozilla/4.0 (compatible MSIE 6.0 Windows
NT 5.1 YComp 5.0.2.6)" - pool-68-162-19-184.nwrk.east.verizon.net - -
21/Sep/2003053502 0100 "GET
/mark/download/optdb_erd.pdf HTTP/1.1" 206 0 "-"
"Mozilla/4.0 (compatible MSIE 6.0 Windows NT
5.1 YComp 5.0.2.6)" - pool-68-162-19-184.nwrk.east.verizon.net - -
21/Sep/2003053507 0100 "GET
/mark/download/optdb_erd.pdf HTTP/1.1" 206
275480 "-" "Mozilla/4.0 (compatible MSIE 6.0
Windows NT 5.1 YComp 5.0.2.6)" - crawler14.googlebot.com - - 21/Sep/2003054947
0100 "GET /mark/ HTTP/1.0" 200 5183 "-"
"Googlebot/2.1 (http//www.googlebot.com/bot.html
)" - cr008r01-3.sac2.fastsearch.net - -
21/Sep/2003061412 0100 "GET
/mark/download/webgraph.pdf HTTP/1.0" 304 - "-"
"FAST-WebCrawler/3.8 (atw-crawler at fast dot no
http//fast.no/support/crawler.asp)"
4Overview
- Mining of sequential patterns
- Web mining
- Prediction
- Detection of unexpected events
- Mobile mining
- A navigation engine
- What next?
5Availability of Sequential Log Data
- Log files (web site, search engine, wireless
network, ) contain access data (time-stamp, id
e.g. IP, cookie, tag, location, query,
clicksteam data, ) - Time-ordered sessions of sequential clicks or
movement can be mined. - Logs are a valuable source of information for
understanding what users are doing and how a
space is being used. - Log files are not enough!
6Mining Navigation Patterns
- Each user session induces a user trail through
the site/space - A trail is a sequence of accesses (interactions
with landmarks) followed by a user during a
session, ordered by time of access. - A pattern in this context is a popular trail.
- Co-occurrence of accesses is important, e.g.
shopping-basket and checkout. - Use a variable length history Markov chain model.
(It matters where you came from!)
7Interaction Network (1st Order Markov Chain, no
past history) of Users and Landmarks
8Interaction Tree (Suffix Tree) Representing a
Variable Length Markov Chain(Record access
information for each node)
9Web Usage Mining
- Analyse trails that emerge from a user (or a
group of users) surfing through a web space. - Applications
- Prefetching and caching web pages
- Prediction (Recommender systems)
- Personalisation
- Clickstream analysis, e.g. eCommrece
- Web site reorganisation
- Detection of unexpected accesses
10Hit and Miss Prediction
- Try and predict next link the user will follow
from the longest suffix of a trail that can be
matched in the suffix tree. - Assume that the maximum probability link was
followed. - Count a hit as 1 and a miss as 0.
11Rank-based Prediction
- Try and predict the next link as before.
- Rank the links from 1 to r (the rth link was
followed) and record the MAE (Mean Absolute
Error) as r-1. - Can generalise to top-n prediction.
12Probability-based Prediction
- Try and predict the next link as before.
- Let p be the maximum probability link.
- The score recorded is the ignorance defined as
-log2(p). - Ignorance is nonlinear, ranging from zero to
infinity i.e. the penalty of performing less
than random is large.
13Unexpected Accesses
- Cannot predict rare events but can detect them.
- If probability of access small, say less or equal
to alpha, then the access is unexpected. - Alternatively, could use the 80/20 rule.
- If prediction is good for expected events, can
detect the unexpected.
14Proof of Concept
- Temporal evaluation split data into k sequences
ordered by time, infer the model from seqs 1 to
i and evaluate prediction on trails from seq i1. - Analysis in progress!
- I Present trends on sample of trails, in
collaboration with Jose Borges from University of
Porto.
15Data Analysis Hit and Misserror against model
order
16Data Analysis MAE error against model order
17Data Analysis Ignorance error against model
order
18Ignorance Probabilities2(-Ignorance)
MSWEB Probabilities All 0
0.01 0.025 0.05 0.1 0.0450
0.0574 0.0938 0.1158 0.1290
0.1875 0.0342 0.0871 0.1186 0.1439
0.1684 0.2263 0.0249 0.1053 0.1357
0.1630 0.1896 0.2478 0.0227 0.1101
0.1426 0.1720 0.1997 0.2596 LTM
Probabilities All 0 0.01
0.025 0.05 0.1 0.0722 0.1220
0.2562 0.3513 0.4670 0.6073 0.0702
0.2085 0.3199 0.4026 0.5059
0.6171 0.0624 0.2502 0.3578 0.4323
0.5280 0.6382 0.0568 0.2807 0.3913
0.4623 0.5506 0.6536
19Mobile Usage Mining
- Analyse trails that emerge from a user (or a
group of users) moving through a physical space. - Applications
- Prediction (Recommender systems)
- Personalisation (location awareness)
- Movement analysis
- Ubiquitous search
- Space reorganisation
- Detection of unexpected interactions
- In collaboration with colleagues from Birkbeck,
University of London
20Trail in a Museum Exhibition
21Trail from entrance to exit in a Zoo
22Navigation Engine Architecture
23Query Interface
24Landmark Query
25Trail Query
26Hot Spots
27Popular Trails
28What Next?
- More evaluation.
- Implement and evaluate prediction and pattern
detection algorithms within the navigation
engine. - Investigate novel visualisation methods of usage
within the navigation engine. -
- Thank you!
29Data Analysis Hit and Miss
30Data Analysis MAE
31Data Analysis Ignorance