Title: Discovering Digital Behavior: Data Mining and the Web
1Discovering Digital BehaviorData Mining and the
Web
- Padhraic Smyth
- Information and Computer Science
- University of California, Irvine
- and
- Jet Propulsion Laboratory
- smyth_at_ics.uci.edu
- www.ics.uci.edu/smyth/
-
-
2Outline
- Data Mining
- techniques for analyzing massive data sets
- ideas from computer science and statistics
- Digital Behavior
- behavior of individuals in a digital environment
- e.g., Web, Windows, virtual reality,etc
- Web Mining
- inferring behavioral patterns from user data
- applications clustering, prediction,
personalization
3Massive Data Sets
1 2 . . N
- Characteristics
- very large N (billions)
- very high dimensionality d (thousands or
millions) - heterogeneous data types
- dynamic, non-stationary
- Large N is relatively easy
- Dimensionality, heterogeneity, non-stationarity
are hard
4Data Mining
- What is data mining?
- the search for structure and patterns in
(massive) data sets - interdisciplinary typically uses ideas from
computer science and statistics as appropriate - data-driven rather than theory-driven
EDA-like - applications-focused
- emphasis often on massive
5Origins of Data Mining
pre 1960 1960s 1970s 1980s 1990s
Hardware (sensors, storage, computation)
Relational Databases
Data Mining
Machine Learning
AI
Pattern Recognition
Flexible Models
EDA
Pencil and Paper
Data Dredging
6Data Mining Research Communities
Statistics
Databases
KDD and Data Mining
Visualization
Machine Learning
Applications
Artificial Intelligence
7What do people want to do with their data?
- Explore
- visualization, EDA, etc
- Summarize
- aggregate patterns, data compression (no notion
of inference) - Understand
- build generative and descriptive models, e.g.,
clusters - Prediction
- predictive modeling (classification, regression,
etc) - Change Detection
- discover trends, unusual patterns, outliers, etc
8 Data Mining of your Telephone Calls
- Background
- ATT has about 100 million customers
- It logs 300 million calls per day, 40 attributes
each - 350 million unique telephone numbers
- The Data Mining Approach (Pregibon and Cortes,
KDD 1998) - Statistical model trained to adaptively track
p(businesscalling data) - Every call to or from an ATT number is used to
update the models - 350 million models (one per phone number)
- Significant systems engineering
- data are downloaded nightly, model updated
- 20 processors, 6Gb RAM, terabyte disk farm
- Provides ATT with a unique daily snapshot of
US calling patterns
9Digital Behavior
- Modeling behavior of individuals in a digital
environment - Motivated by availability of massive data sets
- mouse clicks, key strokes
- Web navigation
- search queries
- biometrics (cameras)
- Goal develop better models of dynamic behavior
- population level (aggregate)
- individual level (personalized)
- Use these models for improved design, feedback,
prediction - Privacy issues
10Digital Data Sets
- Navigation Patterns
- Server Side Web Access Logs
- several gigabytes per day is not unusual
- data can be noisy (difficult to identify users)
- Client Side Browser Monitoring Software
- e.g., Alexa.com software assistant which
downloads all page requests nightly - Web Connectivity
- patterns of connectivity, graphs
- Demographics
- background information on the user
11Models for Digital Behavior
- Information goals
- hidden, difficult to assess
- we can try to infer what we can
- Behavioral Patterns
- type of behavior reading, searching, browsing,
etc - dynamics of this process
- click rates
- is a function of the users background, general
characteristics - Coupled models
- behavior is driven by information goals and
static characteristics - vary over time, context-dependent
12Dynamic Behavior
Population
Group
Individual
Real-Time Behavior
13Information Goals
Dynamic Behavior
Population
Population
Group
Group
Individual
Individual
Real-Time Behavior
14Information Goals
Dynamic Behavior
Static Characteristics
Population
Population
Population
Group
Group
Group
Individual
Individual
Individual
Real-Time Behavior
15Information Goals
Dynamic Behavior
Static Characteristics
Population
Population
Population
Group
Group
Group
Individual
Individual
Individual
Real-Time Behavior
Digital Environment
16Where Data Mining fits in
- Observed data
- the real-time behavior (pages navigated, timing
between clicks, etc) - the environment (Web page content, connectivity,
etc) - static characteristics (perhaps), e.g.,
demographics of each user. - Hidden data
- information goals
- behavioral characteristics
- Data Mining
- postulate relatively simple but flexible models
for behavior and information goals - discover which models best describe your
individuals and populations from the data, i.e.,
fit to the data
17Examples of Models
- Information Goal Modeling
- information retrieval model a document as a bag
of words - term vector vector counting occurrences of
phrases in a document - model the users interests the same way,
- a weighted term vector, weights depending on
interest - Behavior Modeling
- static models
- histograms of pages a user tends to visit
- dynamic models
- stochastic finite-state machines, Markov models
18Behavior Models at the Population Level
- Huberman et al (Xerox PARC)
- Science 1997
- model the value of a page for any user as V
- Vt Vt-1 e
- where e is random zero-mean Gaussian noise
- Expected stopping time seems to match observed
session lengths for large populations of Web
users
19Behavior Models at the Individual Level
- Zukerman et al
- User Modeling Conference 1999
- model each users navigation patterns as a Markov
model (a finite state machine) - learn a users model from observed data over time
- use the model to predict next page for the user
- some empirical success
- intended application is pre-fetching
- but this type of prediction is very hard indeed
20Dice Factories and the Reverend Bayes
Population Parameters
Group Parameters
Individual Parameters
Observed Data
Bayesian framework gt infer parameters given data
21Application of Hierarchical Models
- Clustering of individuals given their
page-sequences - Clustering Markov models (Smyth, 1997)
- Generative model mixture of Markov processes
- each group is characterized by a Markov state
machine - different groups have different navigation
patterns - can use the EM algorithm to learn the different
Markov models given the data - probabilistic model handles different sequence
lengths naturally - Applied to large commercial Web log (with Igor
Cadez) - produced novel insights into user behavior
- valuable as an exploration tool for massive Web
logs
22Hierarchical Models for Online Prediction
- Model a users information goal as a term-vector
- infer (from Web page content) that a user is
interested in French wine - can infer (from a population model) that user is
unlikely to be interested in Bugs Bunny - population model acts as a prior
23Hierarchical Models for Online Prediction
- Model a users information goal as a term-vector
- infer (from Web page content) that a user is
interested in French wine - can infer (from a population model) that user is
unlikely to be interested in Bugs Bunny - population model acts as a prior
- but wait
- user is really a parent ordering Christmas gifts
- online data drives us away from initial prior
(once we see enough data) - Bayesian framework provides a robust mechanism
for these inferences
24Research on Better Dynamic Models
- Markov models are not ideal
- constrain state-durations to be geometric
- semi-Markov models provide a useful
generalization - Introduce time (not just order)
- model bursts of activity as a Poisson process
- modulated by other variables, e.g.,
age/experience - Couple to dynamic evolving information goals
- use input-driven Markov models
- inputs are coming from inferred information goals
- much more realistic
25Applications
- Understanding
- visualization of navigation patterns
- Discovery
- population patterns
- individual behavior
- Feedback
- better design, real-time tools for information
feedback - Prediction
- trend detection, population-level predictions,
bootstrapping a new Web site, etc
26Summary
- Data Mining
- searching for structure in massive data sets
- Digital Logs
- records of individual behavior in digital
environments - Web Mining
- extracting models of user behavior from Web data
- relatively early, but expect to see more new
ideas - challenging research problems, useful applications