ICS 278: Data Mining Lecture 2: Measurement and Data - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

ICS 278: Data Mining Lecture 2: Measurement and Data

Description:

Discovering Patterns and Rules others... Goal is to discover interesting 'local' patterns in the data rather than to ... data we might discover that ... – PowerPoint PPT presentation

Number of Views:113
Avg rating:3.0/5.0
Slides: 35
Provided by: Informatio367
Category:

less

Transcript and Presenter's Notes

Title: ICS 278: Data Mining Lecture 2: Measurement and Data


1
ICS 278 Data MiningLecture 2 Measurement and
Data

2
Todays lecture
  • Questions on homework?
  • Office hours tomorrow 930 to 11
  • Outline of todays lecture
  • From lecture 1 various tasks in data mining
  • Chapter 2 Measurement and Data
  • Types of measurement
  • Distance measures
  • Multidimensional scaling
  • Discussion of class projects

3
Slides from Lecture 1
4
Different Data Mining Tasks
  • Exploratory Data Analysis
  • Descriptive Modeling
  • Predictive Modeling
  • Discovering Patterns and Rules
  • others.

5
Exploratory Data Analysis
  • Getting an overall sense of the data set
  • Computing summary statistics
  • Number of distinct values, max, min, mean,
    median, variance, skewness,..
  • Visualization is widely used
  • 1d histograms
  • 2d scatter plots
  • Higher-dimensional methods
  • Useful for data checking
  • E.g., finding that a variable is always integer
    valued or positive
  • Finding the some variables are highly skewed
  • Simple exploratory analysis can be extremely
    valuable
  • You should always look at your data before
    applying any data mining algorithms

6
Example of Exploratory Data Analysis(Pima
Indians data, scatter plot matrix)
7
Different Data Mining Tasks
  • Exploratory Data Analysis
  • Descriptive Modeling
  • Predictive Modeling
  • Discovering Patterns and Rules
  • others.

8
Descriptive Modeling
  • Goal is to build a descriptive model
  • e.g., a model that could simulate the data if
    needed
  • models the underlying process
  • Examples
  • Density estimation
  • estimate the joint distribution P(x1,xp)
  • Cluster analysis
  • Find natural groups in the data
  • Dependency models among the p variables
  • Learning a Bayesian network for the data

9
Example of Descriptive Modeling
Control Group
Anemia Group
10
Example of Descriptive Modeling
Control Group
Anemia Group
11
Learning User Navigation Patterns from Web Logs
128.195.36.195, -, 3/22/00, 103511, W3SVC,
SRVR1, 128.200.39.181, 781, 363, 875, 200, 0,
GET, /top.html, -, 128.195.36.195, -, 3/22/00,
103516, W3SVC, SRVR1, 128.200.39.181, 5288,
524, 414, 200, 0, POST, /spt/main.html, -,
128.195.36.195, -, 3/22/00, 103517, W3SVC,
SRVR1, 128.200.39.181, 30, 280, 111, 404, 3, GET,
/spt/images/bk1.jpg, -, 128.195.36.101, -,
3/22/00, 161850, W3SVC, SRVR1, 128.200.39.181,
60, 425, 72, 304, 0, GET, /top.html, -,
128.195.36.101, -, 3/22/00, 161858, W3SVC,
SRVR1, 128.200.39.181, 8322, 527, 414, 200, 0,
POST, /spt/main.html, -, 128.195.36.101, -,
3/22/00, 161859, W3SVC, SRVR1, 128.200.39.181,
0, 280, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.200.39.17, -, 3/22/00, 205437, W3SVC,
SRVR1, 128.200.39.181, 140, 199, 875, 200, 0,
GET, /top.html, -, 128.200.39.17, -, 3/22/00,
205455, W3SVC, SRVR1, 128.200.39.181, 17766,
365, 414, 200, 0, POST, /spt/main.html, -,
128.200.39.17, -, 3/22/00, 205455, W3SVC,
SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET,
/spt/images/bk1.jpg, -, 128.200.39.17, -,
3/22/00, 205507, W3SVC, SRVR1, 128.200.39.181,
0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.200.39.17, -, 3/22/00, 205536, W3SVC,
SRVR1, 128.200.39.181, 1061, 382, 414, 200, 0,
POST, /spt/main.html, -, 128.200.39.17, -,
3/22/00, 205536, W3SVC, SRVR1, 128.200.39.181,
0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.200.39.17, -, 3/22/00, 205539, W3SVC,
SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET,
/spt/images/bk1.jpg, -, 128.200.39.17, -,
3/22/00, 205603, W3SVC, SRVR1, 128.200.39.181,
1081, 382, 414, 200, 0, POST, /spt/main.html, -,
128.200.39.17, -, 3/22/00, 205604, W3SVC,
SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET,
/spt/images/bk1.jpg, -, 128.200.39.17, -,
3/22/00, 205633, W3SVC, SRVR1, 128.200.39.181,
0, 262, 72, 304, 0, GET, /top.html, -,
128.200.39.17, -, 3/22/00, 205652, W3SVC,
SRVR1, 128.200.39.181, 19598, 382, 414, 200, 0,
POST, /spt/main.html, -,
3
3
3
3
1
3
1
1
1
3
3
3
2
2
3
2
User 1
1
1
1
3
3
3
User 2
User 3
7
7
7
7
7
7
7
7
1
1
1
1
1
1
5
1
5
1
1
1
5
1
User 4
5
1
1
5
User 5


12
Clusters of Probabilistic State Machines
Cadez, Heckerman, et al, 2003
A
A
Cluster 1
Cluster 2
B
B
C
C
E
E
A
Motivation capture heterogeneity of Web surfing
behavior
B
C
Cluster 3
E
13
WebCanvas algorithm and software - currently
in new SQLServer
14
Another Example of Descriptive Modeling
  • Learning Directed Graphical Models (aka Bayes
    Nets)
  • goal learn directed relationships among p
    variables
  • techniques directed (causal) graphs
  • challenge distinguishing between correlation and
    causation
  • example Do yellow fingers cause lung cancer?

hidden cause smoking
15
Different Data Mining Tasks
  • Exploratory Data Analysis
  • Descriptive Modeling
  • Predictive Modeling
  • Discovering Patterns and Rules
  • others.

16
Predictive Modeling
  • Predict one variable Y given a set of other
    variables X
  • Here X could be a p-dimensional vector
  • Classification Y is categorical
  • Regression Y is real-valued
  • In effect this is function approximation,
    learning the relationship between Y and X
  • Many, many algorithms for predictive modeling in
    statistics and machine learning
  • Often the emphasis is on predictive accuracy,
    less emphasis on understanding the model

17
Predictive Modeling Fraud Detection
  • Credit card fraud detection
  • Credit card losses in the US are over 1 billion
    per year
  • Roughly 1 in 50k transactions are fraudulent
  • Approach
  • For each transaction estimate p(fraudulent
    transaction)
  • Model is built on historical data of known
    fraud/non-fraud
  • High probability transactions investigated by
    fraud police
  • Example
  • Fair-Isaac/HNCs fraud detection software based
    on neural networks, led to reported fraud
    decreases of 30 to 50
  • http//www.fairisaac.com/fairisaac
  • Issues
  • Significant feature engineering/preprocessing
  • false alarm rate vs missed detection what is
    the tradeoff?

18
Predictive Modeling Customer Scoring
  • Example a bank has a database of 1 million past
    customers, 10 of whom took out mortgages
  • Use machine learning to rank new customers as a
    function of p(mortgage customer data)
  • Customer data
  • History of transactions with the bank
  • Other credit data (obtained from Experian, etc)
  • Demographic data on the customer or where they
    live
  • Techniques
  • Binary classification logistic regression,
    decision trees, etc
  • Many, many applications of this nature

19
Predictive Modeling Telephone Call Modeling
  • Background
  • ATT has about 100 million customers
  • It logs 200 million calls per day, 40 attributes
    each
  • 250 million unique telephone numbers
  • Which are business and which are residential?
  • Approach (Pregibon and Cortes, ATT,1997)
  • Proprietary model, using a few attributes,
    trained on known business customers to adaptively
    track p(businessdata)
  • Significant systems engineering data are
    downloaded nightly, model updated (20 processors,
    6Gb RAM, terabyte disk farm)
  • Status
  • running daily at ATT
  • HTML interface used by ATT marketing

20
From C. Cortes and D. Pregibon, Giga-mining, in
Proceedings of the ACM SIGKDD Conference, 1997
21
Different Data Mining Tasks
  • Exploratory Data Analysis
  • Descriptive Modeling
  • Predictive Modeling
  • Discovering Patterns and Rules
  • others.

22
Structure Models and Patterns
  • Model abstract representation of a process
  • e.g., very simple linear model structure
  • Y a X b
  • a and b are parameters determined from the data
  • Y aX b is the model structure
  • Y 0.9X 0.3 is a particular model
  • All models are wrong, some are useful (G.E.
    Box)
  • Pattern represents local structure in a data
    set
  • E.g., if Xgtx then Y gty with probability p
  • or a pattern might be a small cluster of
    outliers in multi-dimensional space

23
Pattern Discovery
  • Goal is to discover interesting local patterns
    in the data rather than to characterize the data
    globally
  • given market basket data we might discover that
  • If customers buy wine and bread then they buy
    cheese with probability 0.9
  • These are known as association rules
  • Given multivariate data on astronomical objects
  • We might find a small group of previously
    undiscovered objects that are very self-similar
    in our feature space, but are very far away in
    feature space from all other objects

24
Example of Pattern Discovery
ADACABDABAABBDDBCADDDDBCDDBCCBBCCDADADAADABDBBDAB
ABBCDDDCDDABDCBBDBDBCBBABBBCBBABCBBACBBDBAACCADDAD
BDBBCBBCCBBBDCABDDBBADDBBBBCCACDABBABDDCDDBBABDBDD
BDDBCACDBBCCBBACDCADCBACCADCCCACCDDADCBCADADBAACCD
DDCBDBDCCCCACACACCDABDDBCADADBCBDDADABCCABDAACABCA
BACBDDDCBADCBDADDDDCDDCADCCBBADABBAAADAAABCCBCABDB
AADCBCDACBCABABCCBACBDABDDDADAABADCDCCDBBCDBDADDCC
BBCDBAADADBCAAAADBDCADBDBBBCDCCBCCCDCCADAADACABDAB
AABBDDBCADDDDBCDDBCCBBCCDADADACCCDABAABBCBDBDBADBB
BBCDADABABBDACDCDDDBBCDBBCBBCCDABCADDADBACBBBCCDBA
AADDDBDDCABACBCADCDCBAAADCADDADAABBACCBB
25
Example of Pattern Discovery
ADACABDABAABBDDBCADDDDBCDDBCCBBCCDADADAADABDBBDAB
ABBCDDDCDDABDCBBDBDBCBBABBBCBBABCBBACBBDBAACCADDAD
BDBBCBBCCBBBDCABDDBBADDBBBBCCACDABBABDDCDDBBABDBDD
BDDBCACDBBCCBBACDCADCBACCADCCCACCDDADCBCADADBAACCD
DDCBDBDCCCCACACACCDABDDBCADADBCBDDADABCCABDAACABCA
BACBDDDCBADCBDADDDDCDDCADCCBBADABBAAADAAABCCBCABDB
AADCBCDACBCABABCCBACBDABDDDADAABADCDCCDBBCDBDADDCC
BBCDBAADADBCAAAADBDCADBDBBBCDCCBCCCDCCADAADACABDAB
AABBDDBCADDDDBCDDBCCBBCCDADADACCCDABAABBCBDBDBADBB
BBCDADABABBDACDCDDDBBCDBBCBBCCDABCADDADBACBBBCCDBA
AADDDBDDCABACBCADCDCBAAADCADDADAABBACCBB
26
Example of Pattern Discovery
  • IBM Advanced Scout System
  • Bhandari et al. (1997)
  • Every NBA basketball game is annotated,
  • e.g., time 6 mins, 32 seconds event 3
    point basket player Michael Jordan
  • This creates a huge untapped database of
    information
  • IBM algorithms search for rules of the form
    If player A is in the game, player Bs scoring
    rate increases from 3.2 points per quarter to 8.7
    points per quarter
  • IBM claimed around 1998 that all NBA teams except
    1 were using this software the other team was
    Chicago.

27
(No Transcript)
28
Components of Data Mining Algorithms
  • Representation
  • Determining the nature and structure of the
    representation to be used
  • Score function
  • quantifying and comparing how well different
    representations fit the data
  • Search/Optimization method
  • Choosing an algorithmic process to optimize the
    score function and
  • Data Management
  • Deciding what principles of data management are
    required to implement the algorithms efficiently.

29
Whats in a Data Mining Algorithm?
Task
Representation
Score Function
Search/Optimization
Data Management
Models, Parameters
30
An Example Linear Regression
Task
Regression
Y Weighted linear sum of Xs
Representation
Score Function
Least-squares
Search/Optimization
Gaussian elimination
Data Management
None specified
Models, Parameters
Regression coefficients
31
An Example Decision Trees (C4.5 or CART)
Task
Classification
Hierarchy of axis-parallel linear class
boundaries
Representation
Cross-validated accuracy
Score Function
Search/Optimization
Greedy Search
Data Management
None specified
Models, Parameters
Decision tree classifier
32
An Example Hierarchical Clustering
Task
Clustering
Representation
Tree of clusters
Score Function
Various
Search/Optimization
Greedy search
Data Management
None specified
Models, Parameters
Dendrogram
33
An Example Association Rules
Task
Pattern Discovery
Rules if A and B then C with prob p
Representation
No explicit score
Score Function
Search/Optimization
Systematic search
Data Management
Multiple linear scans
Models, Parameters
Set of Rules
34
Next Lecture
  • Chapter 3
  • Exploratory data analysis and visualization
Write a Comment
User Comments (0)
About PowerShow.com