Data mining An overview of techniques and applications presentation

About This Presentation

Transcript and Presenter's Notes

Title: Data mining An overview of techniques and applications

1
Data miningAn overview of techniques and
applications

Sunita Sarawagi
IIT Bombay
http//www.it.iitb.ernet.in/sunita

2
Data mining

Process of semi-automatically analyzing large
databases to find patterns that are
valid hold on new data with some certainity
novel non-obvious to the system
useful should be possible to act on the item
understandable humans should be able to
interpret the pattern
Other names Knowledge discovery in databases,
Data analysis

3
Relationship with other fields

Overlaps with machine learning, statistics,
artificial intelligence, databases, visualization
but more stress on
scalability of number of features and instances
stress on algorithms and architectures whereas
foundations of methods and formulations provided
by statistics and machine learning.
automation for handling large, heterogeneous data

4
Outline

Mining operations
Classification
Clustering
Association rule mining
Sequence mining
Two applications
Intrusion detection
Information extraction

5
Classification
X1 X2 ... Xn Y
2 6.7 BB
5 3.4 .. CA
..
..

.. ..

..
10 0.9 CX -

Given, a table D of rows with columns X1..Xn,Y
Xi could numeric or string
Special attribute Y, the class-label
Training
Learn a classifier C that can predict the label Y
in terms of X1,X2..Xn
C must hold for
examples in D
unseen data
Application
Use C to predict Y for new X-s

10 0.9 CX ?
6
Automatic loan approval

Given old data about customers and payments,
predict new applicants loan eligibility.

Previous customers
Classifier
Decision rules
Salary gt 5 L
Age Salary Profession Location Customer type
Good/ bad
Prof. Exec
Age Salary Profession Location
New applicants data
7
Drug design molecular Bioactivity

Predict activity of compounds binding to thrombin
Library of compounds included
1909 known molecules (42 actively binding
thrombin)
139,351 binary features describe the 3-D
structure of each compound
636 new compounds with unknown capacity to bind
thrombin

8
Automatic webpage classification

Several large categorized search engines
Yahoo, Dmoz used in Google/Altavista
Web 2 billion pages and only a subset in the
directories
Existing taxonomies manually created
Need to automatically classify new pages

9
Several classification methods

Choose based on
data type (numeric,categorical)
number of attributes
number of classes
number of training examples
need for interpretation

Regression
Decision tree classifier
Rule-learners
Neural networks
Generative models
Nearest neighbor
Support vector machines

10
Nearest neighbor

Define similarity between instances
Find neighbors of new instance in training data
K-NN approach assign majority class amongst k
nearest neighbour
weighted regression learn a new regression
equation by weighting each training instance
based on distance from new instance

Cons
Slow during application.
No feature selection.
Notion of proximity vague

Pros
Fast training

11
Decision tree classifiers

Widely used learning method
Easy to interpret can be re-represented as
if-then-else rules
Approximates function by piece wise constant
regions
Does not require any prior knowledge of data
distribution, works well on noisy data.

12
Decision trees

Tree where internal nodes are simple decision
rules on one or more attributes and leaf nodes
are predicted class labels.

Salary lt 20K
Profession teacher
Age lt 30
13
Algorithm for tree building

Greedy top-down construction.

Gen_Tree (Node, data)
Yes
make node a leaf?
Stop
Selection criteria
Find best attribute and best split on attribute
Partition data on split condition
For each child j of node Gen_Tree (node_j,
data_j)
14
Support vector machines

Binary classifier find hyper-plane providing
maximum margin between vectors of the two classes

fj
fi
15
Support Vector Machines

Extendable to
Non-separable problems (Cortes Vapnik, 1995)
Non-linear classifiers (Boser et al., 1992)
Good generalization performance
OCR (Boser et al.)
Vision (Poggio et al.)
Text classification (Joachims)
Requires tuning which kernel, what parameters?
Several freely available packages SVMTorch

16
Neural networks

Useful for learning complex data like
handwriting, speech and image recognition

Decision boundaries
Neural network
Classification tree
Linear regression
17
Bayesian learning

Assume a probability model on generation of data.
Apply Bayes theorem to find most likely class as
Naïve bayes Assume attributes conditionally
independent given class value

18
Meta learning methods

No single classifier good under all cases
Difficult to evaluate in advance the conditions
Meta learning combine the effects of the
classifiers
Voting sum up votes of component classifiers
Combiners learn a new classifier on the outcomes
of previous ones
Boosting staged classifiers
Disadvantage interpretation hard
Knowledge probing learn single classifier to
mimick meta classifier

19
Outline

Mining operations
Classification
Clustering
Association rule mining
Sequence mining

20
What is Cluster Analysis?

Cluster a collection of data objects
Similar to one another within the same cluster
Dissimilar to the objects in other clusters
Cluster analysis
Grouping a set of data objects into clusters
Clustering is unsupervised classification no
predefined classes
Typical applications
As a stand-alone tool to get insight into data
distribution
As a preprocessing step for other algorithms

21
Applications

Customer segmentation e.g. for targeted marketing
Group/cluster existing customers based on time
series of payment history such that similar
customers in same cluster.
Identify micro-markets and develop policies
foreach
Image processing
Text clustering e.g. scatter/gather
Compression

22
Distance functions

Numeric data euclidean, manhattan distances
Minkowski metric sum(xi-yi)m(1/m)
Larger m gives higher weight to larger distances
Categorical data 0/1 to indicate
presence/absence
Euclidean distance equal weightage to 1 and 0
match
Hamming distance ( dissimilarity)
Jaccard coefficients similarity in 1s/( of 1s)
(0-0 matches not important
data dependent measures similarity of A and B
depends on co-occurance with C.
Combined numeric and categorical dataweighted
normalized distance

23
Clustering methods

Hierarchical clustering
agglomerative Vs divisive
single link Vs complete link
Partitional clustering
distance-based K-means
model-based EM
density-based

24
Outline

Mining operations
Classification
Clustering
Association rule mining
Sequence mining
Two applications
Intrusion detection
Information extraction

25
Intrusion via privileged programs

Attacks exploit a loophole in the program to do
illegal actions
Example exploit buffer over-flows to run
user-code
What to monitor of an executing privileged
program to detect attacks?

open lseek lstat mmap execve ioctl ioctl close e
xecve close unlink

Sequence of system calls
S set of all possible system calls 100
Mining problem given traces of previous normal
execution, monitor a new execution and flag
attack or normal
Challenge is it possible to do this given widely
varying normal conditions?

26
Detecting attacks on privileged programs

Short sequences of system calls made during
normal execution of system calls are very
consistent, yet different from the sequences of
its abnormal executions
Each execution a trace of system calls
ignore online traces for the moment
Two approaches
STIDE
Create dictionary of unique k-windows in normal
traces, count what fraction occur in new traces
and threshold.
IDS
next...

27
Classification models on k-grams

When both normal and abnormal data available
class label normal/abnormal
When only normal trace,
class-labelk-th system call

Learn rules to predict class-label RIPPER
28
Examples of output RIPPER rules

Both-traces
if the 2nd system call is vtimes and the 7th is
vtrace, then the sequence is normal
if the 6th system call is lseek and the 7th is
sigvec, then the sequence is normal
if none of the above, then the sequence is
abnormal
Only-normal
if the 3rd system call is lstat and the 4th is
write, then the 7th is stat
if the 1st system call is sigblock and the 4th is
bind, then the 7th is setsockopt
if none of the above, then the 7th is open

29
Experimental results on sendmail

The output rule sets contain 250 rules, each
with 2 or 3 attribute tests
Score each trace by counting fraction of
mismatches and thresholding
Summary Only normal traces sufficient to
detect intrusions

30
Information extraction

Automatically extract structured fields from
unstructured documents
by learning from examples
Technology
Graph models (Hidden Markov Models)
Probabilistic parsers
Applications
Comparison shopping agents
Bibliography databases (citeseer)
Address elementization (IIT Bombay)

31
Problem definition

Source concatenation of structured elements with
limited reordering and some missing fields
Example Addresses, bib records

House number
Zip
City
Building
Road
Area
156 Hillside ctype Scenic drive Powai Mumbai
400076
P.P.Wangikar, T.P. Graycar, D.A. Estell, D.S.
Clark, J.S. Dordick (1993) Protein and Solvent
Engineering of Subtilising BPN' in Nearly
Anhydrous Organic Media J.Amer. Chem. Soc. 115,
12231-12237.
32
Learning to segment

Given,
list of structured elements
several examples showing position of structured
elements in text,
Train a model to identify them in unseen text

At top-level a classification problem

Issues
What are the input features?
Build per-element classifiers or a single joint
classifier?
Which type of classifier to use?
How much training data is required?

33
Input features

Content of the element
Specific keywords like street, zip, vol, pp,
Properties of words like capitalization, parts
of speech, number?
Inter-element sequencing
Intra-element sequencing
Element length

34
IE with Hidden Markov Models

Probabilistic models for IE

Title
Author
Journal
Year
35
HMM Structure

Naïve Model One state per element

Mahatma Gandhi Road Near Parkland ...
Mahatma Gandhi Road Near Landmark Parkland
...
36
Results Comparative Evaluation
Dataset instances Elements
IITB student Addresses 2388 17
Company Addresses 769 6
US Addresses 740 6
The Nested model does best in all three cases
37
Mining market

Around 20 to 30 mining tool vendors
Major tool players
SASs Enterprise Miner.
IBMs Intelligent Miner,
SGIs MineSet,
All pretty much the same set of tools
Many embedded products
fraud detection
electronic commerce applications,
health care,
customer relationship management Epiphany

Write a Comment

User Comments (0)

About PowerShow.com

Data mining An overview of techniques and applications PowerPoint PPT Presentation