Data mining An overview of techniques and applications - PowerPoint PPT Presentation

About This Presentation
Title:

Data mining An overview of techniques and applications

Description:

Given old data about customers and payments, predict new applicant's loan eligibility. ... Word. 0.3. 0.1. 0.5. dddd. dd. 0.8. 0.2. HMM Structure ... Mahatma ... – PowerPoint PPT presentation

Number of Views:93
Avg rating:3.0/5.0
Slides: 38
Provided by: Sun43
Category:

less

Transcript and Presenter's Notes

Title: Data mining An overview of techniques and applications


1
Data miningAn overview of techniques and
applications
  • Sunita Sarawagi
  • IIT Bombay
  • http//www.it.iitb.ernet.in/sunita

2
Data mining
  • Process of semi-automatically analyzing large
    databases to find patterns that are
  • valid hold on new data with some certainity
  • novel non-obvious to the system
  • useful should be possible to act on the item
  • understandable humans should be able to
    interpret the pattern
  • Other names Knowledge discovery in databases,
    Data analysis

3
Relationship with other fields
  • Overlaps with machine learning, statistics,
    artificial intelligence, databases, visualization
    but more stress on
  • scalability of number of features and instances
  • stress on algorithms and architectures whereas
    foundations of methods and formulations provided
    by statistics and machine learning.
  • automation for handling large, heterogeneous data

4
Outline
  • Mining operations
  • Classification
  • Clustering
  • Association rule mining
  • Sequence mining
  • Two applications
  • Intrusion detection
  • Information extraction

5
Classification
X1 X2 ... Xn Y
2 6.7 BB
5 3.4 .. CA
..
..

.. ..

..
10 0.9 CX -
  • Given, a table D of rows with columns X1..Xn,Y
  • Xi could numeric or string
  • Special attribute Y, the class-label
  • Training
  • Learn a classifier C that can predict the label Y
    in terms of X1,X2..Xn
  • C must hold for
  • examples in D
  • unseen data
  • Application
  • Use C to predict Y for new X-s

10 0.9 CX ?
6
Automatic loan approval
  • Given old data about customers and payments,
    predict new applicants loan eligibility.

Previous customers
Classifier
Decision rules
Salary gt 5 L
Age Salary Profession Location Customer type
Good/ bad
Prof. Exec
Age Salary Profession Location
New applicants data
7
Drug design molecular Bioactivity
  • Predict activity of compounds binding to thrombin
  • Library of compounds included
  • 1909 known molecules (42 actively binding
    thrombin)
  • 139,351 binary features describe the 3-D
    structure of each compound
  • 636 new compounds with unknown capacity to bind
    thrombin

8
Automatic webpage classification
  • Several large categorized search engines
  • Yahoo, Dmoz used in Google/Altavista
  • Web 2 billion pages and only a subset in the
    directories
  • Existing taxonomies manually created
  • Need to automatically classify new pages

9
Several classification methods
  • Choose based on
  • data type (numeric,categorical)
  • number of attributes
  • number of classes
  • number of training examples
  • need for interpretation
  • Regression
  • Decision tree classifier
  • Rule-learners
  • Neural networks
  • Generative models
  • Nearest neighbor
  • Support vector machines

10
Nearest neighbor
  • Define similarity between instances
  • Find neighbors of new instance in training data
  • K-NN approach assign majority class amongst k
    nearest neighbour
  • weighted regression learn a new regression
    equation by weighting each training instance
    based on distance from new instance
  • Cons
  • Slow during application.
  • No feature selection.
  • Notion of proximity vague
  • Pros
  • Fast training

11
Decision tree classifiers
  • Widely used learning method
  • Easy to interpret can be re-represented as
    if-then-else rules
  • Approximates function by piece wise constant
    regions
  • Does not require any prior knowledge of data
    distribution, works well on noisy data.

12
Decision trees
  • Tree where internal nodes are simple decision
    rules on one or more attributes and leaf nodes
    are predicted class labels.

Salary lt 20K
Profession teacher
Age lt 30
13
Algorithm for tree building
  • Greedy top-down construction.

Gen_Tree (Node, data)
Yes
make node a leaf?
Stop
Selection criteria
Find best attribute and best split on attribute
Partition data on split condition
For each child j of node Gen_Tree (node_j,
data_j)
14
Support vector machines
  • Binary classifier find hyper-plane providing
    maximum margin between vectors of the two classes

fj
fi
15
Support Vector Machines
  • Extendable to
  • Non-separable problems (Cortes Vapnik, 1995)
  • Non-linear classifiers (Boser et al., 1992)
  • Good generalization performance
  • OCR (Boser et al.)
  • Vision (Poggio et al.)
  • Text classification (Joachims)
  • Requires tuning which kernel, what parameters?
  • Several freely available packages SVMTorch

16
Neural networks
  • Useful for learning complex data like
    handwriting, speech and image recognition

Decision boundaries
Neural network
Classification tree
Linear regression
17
Bayesian learning
  • Assume a probability model on generation of data.
  • Apply Bayes theorem to find most likely class as
  • Naïve bayes Assume attributes conditionally
    independent given class value

18
Meta learning methods
  • No single classifier good under all cases
  • Difficult to evaluate in advance the conditions
  • Meta learning combine the effects of the
    classifiers
  • Voting sum up votes of component classifiers
  • Combiners learn a new classifier on the outcomes
    of previous ones
  • Boosting staged classifiers
  • Disadvantage interpretation hard
  • Knowledge probing learn single classifier to
    mimick meta classifier

19
Outline
  • Mining operations
  • Classification
  • Clustering
  • Association rule mining
  • Sequence mining

20
What is Cluster Analysis?
  • Cluster a collection of data objects
  • Similar to one another within the same cluster
  • Dissimilar to the objects in other clusters
  • Cluster analysis
  • Grouping a set of data objects into clusters
  • Clustering is unsupervised classification no
    predefined classes
  • Typical applications
  • As a stand-alone tool to get insight into data
    distribution
  • As a preprocessing step for other algorithms

21
Applications
  • Customer segmentation e.g. for targeted marketing
  • Group/cluster existing customers based on time
    series of payment history such that similar
    customers in same cluster.
  • Identify micro-markets and develop policies
    foreach
  • Image processing
  • Text clustering e.g. scatter/gather
  • Compression

22
Distance functions
  • Numeric data euclidean, manhattan distances
  • Minkowski metric sum(xi-yi)m(1/m)
  • Larger m gives higher weight to larger distances
  • Categorical data 0/1 to indicate
    presence/absence
  • Euclidean distance equal weightage to 1 and 0
    match
  • Hamming distance ( dissimilarity)
  • Jaccard coefficients similarity in 1s/( of 1s)
    (0-0 matches not important
  • data dependent measures similarity of A and B
    depends on co-occurance with C.
  • Combined numeric and categorical dataweighted
    normalized distance

23
Clustering methods
  • Hierarchical clustering
  • agglomerative Vs divisive
  • single link Vs complete link
  • Partitional clustering
  • distance-based K-means
  • model-based EM
  • density-based

24
Outline
  • Mining operations
  • Classification
  • Clustering
  • Association rule mining
  • Sequence mining
  • Two applications
  • Intrusion detection
  • Information extraction

25
Intrusion via privileged programs
  • Attacks exploit a loophole in the program to do
    illegal actions
  • Example exploit buffer over-flows to run
    user-code
  • What to monitor of an executing privileged
    program to detect attacks?

open lseek lstat mmap execve ioctl ioctl close e
xecve close unlink
  • Sequence of system calls
  • S set of all possible system calls 100
  • Mining problem given traces of previous normal
    execution, monitor a new execution and flag
    attack or normal
  • Challenge is it possible to do this given widely
    varying normal conditions?

26
Detecting attacks on privileged programs
  • Short sequences of system calls made during
    normal execution of system calls are very
    consistent, yet different from the sequences of
    its abnormal executions
  • Each execution a trace of system calls
  • ignore online traces for the moment
  • Two approaches
  • STIDE
  • Create dictionary of unique k-windows in normal
    traces, count what fraction occur in new traces
    and threshold.
  • IDS
  • next...

27
Classification models on k-grams
  • When both normal and abnormal data available
  • class label normal/abnormal
  • When only normal trace,
  • class-labelk-th system call

Learn rules to predict class-label RIPPER
28
Examples of output RIPPER rules
  • Both-traces
  • if the 2nd system call is vtimes and the 7th is
    vtrace, then the sequence is normal
  • if the 6th system call is lseek and the 7th is
    sigvec, then the sequence is normal
  • if none of the above, then the sequence is
    abnormal
  • Only-normal
  • if the 3rd system call is lstat and the 4th is
    write, then the 7th is stat
  • if the 1st system call is sigblock and the 4th is
    bind, then the 7th is setsockopt
  • if none of the above, then the 7th is open

29
Experimental results on sendmail
  • The output rule sets contain 250 rules, each
    with 2 or 3 attribute tests
  • Score each trace by counting fraction of
    mismatches and thresholding
  • Summary Only normal traces sufficient to
    detect intrusions

30
Information extraction
  • Automatically extract structured fields from
    unstructured documents
  • by learning from examples
  • Technology
  • Graph models (Hidden Markov Models)
  • Probabilistic parsers
  • Applications
  • Comparison shopping agents
  • Bibliography databases (citeseer)
  • Address elementization (IIT Bombay)

31
Problem definition
  • Source concatenation of structured elements with
    limited reordering and some missing fields
  • Example Addresses, bib records

House number
Zip
City
Building
Road
Area
156 Hillside ctype Scenic drive Powai Mumbai
400076
P.P.Wangikar, T.P. Graycar, D.A. Estell, D.S.
Clark, J.S. Dordick (1993) Protein and Solvent
Engineering of Subtilising BPN' in Nearly
Anhydrous Organic Media J.Amer. Chem. Soc. 115,
12231-12237.
32
Learning to segment
  • Given,
  • list of structured elements
  • several examples showing position of structured
    elements in text,
  • Train a model to identify them in unseen text

At top-level a classification problem
  • Issues
  • What are the input features?
  • Build per-element classifiers or a single joint
    classifier?
  • Which type of classifier to use?
  • How much training data is required?

33
Input features
  • Content of the element
  • Specific keywords like street, zip, vol, pp,
  • Properties of words like capitalization, parts
    of speech, number?
  • Inter-element sequencing
  • Intra-element sequencing
  • Element length

34
IE with Hidden Markov Models
  • Probabilistic models for IE

Title
Author
Journal
Year
35
HMM Structure
  • Naïve Model One state per element

Mahatma Gandhi Road Near Parkland ...
Mahatma Gandhi Road Near Landmark Parkland
...
36
Results Comparative Evaluation
Dataset instances Elements
IITB student Addresses 2388 17
Company Addresses 769 6
US Addresses 740 6
The Nested model does best in all three cases
37
Mining market
  • Around 20 to 30 mining tool vendors
  • Major tool players
  • SASs Enterprise Miner.
  • IBMs Intelligent Miner,
  • SGIs MineSet,
  • All pretty much the same set of tools
  • Many embedded products
  • fraud detection
  • electronic commerce applications,
  • health care,
  • customer relationship management Epiphany
Write a Comment
User Comments (0)
About PowerShow.com