CS 459995

About This Presentation

Title:

CS 459995

Description:

Amount of data in databases and files grows exponentially 9Petabytes for Earth ... and fouls) to gain competitive advantage for New York Knicks and Miami Heat ... – PowerPoint PPT presentation

Number of Views:33

Avg rating:3.0/5.0

Slides: 34

Provided by: ksu7

Learn more at: http://www.cs.kent.edu

more less

Transcript and Presenter's Notes

Title: CS 459995

1
Introduction

CS 459995
Introduction to Data Mining

2
Outline

What is data mining?
Basic Data Mining Tasks
Classification
Clustering
Association
Data mining Algorithms
Are all the patterns interesting?

3
What is Data Mining

Amount of data in databases and files grows
exponentially 9Petabytes for Earth observation
project in 2010 and 14Petabytes in 2015.
Data Mining is interested in finding information
in these huge data sources
Typical database query SQL, Access and other
database languages to get data
Data Mining query differs from Database query
Query not well formulated
Data in many sources
Output is mostly either visual or multimedia
Data Mining algorithms to get the information are
consisting of three parts
Model The purpose of the algorithm to fit the
model to the data
Preferences Criteria to decide which model is
better
Search All algorithms require some search
techniques

4
Information retrieval
Statistic
Data Mining
Knowledge Based System
Algorithms
Machine Learning
5
Statistic is not Data Mining

A big objection to data mining was that it was
looking for so many vague connections that it was
sure to find things that were bogus
The Rhine Paradox a great example of how not to
conduct scientific research.
David Rhine was a parapsychologist in the 1950s
who hypothesized that some people had
Extra-Sensory Perception (ESP).
He devised an experiment where subjects were
asked to guess 10 hidden cards --- red or blue.
He discovered that almost 1 in 1000 had ESP ---
they were able to get all 10 right!

6
Example(cont)

He told these people they had ESP and called them
in for another test of the same type.
Alas, he discovered that almost all of them had
lost their ESP.
What did he conclude?
You shouldnt tell people that they have ESP it
causes them to lose it

7
Example (cont)

What has really happened
There are 1024 combinations of red and blue
combinations of red and blue of length 10.
Thus with probability 0.98 at least one person
will guess
the sequence of red blue correctly

8
Knowledge Based System are not Data Mining

KDD process selects the data and finds knowledge
in the data
Data Mining in addition trying to make inferences
from the data
However, the boundaries are not easy to define

9
Machine Learning is not Data Mining

Machine Learning design systems that can learn in
the process of processing data
Checkers program designed by one of the scientist
eventually learned to play better than the
program designer
Data Mining incorporates the Machine learning
methods but also benefits from the methods of
other disciplines such as database and statistic

10
What is Data Mining

Data Mining major task is to find all and only
interesting patterns in a set of data sources
Find all interesting patterns means
Completeness
Can it be done
Heuristic vs Exhaustive search
Find only interesting patterns Consistency
Is it possible
Approaches Generate all patterns and filter out
uninteresting patterns generate only patterns
that are interesting

11
Data Mining On What Kind of Data?

Relational databases Universal relation vs
Multirelational search
Data warehouses
Transactional databases
Advanced DB and information repositories
Object-oriented and object-relational databases
Spatial databases
Time-series data and temporal data
Sensor Data
Text databases and multimedia databases
Heterogeneous and legacy databases
WWW

12
Data Mining On What Kind of Data?

Attribute Types
Categorical attribute that has a finite number
of values
Ordinal attributes can be ordered by their
values
Attribute Transformations
Continuing - attribute that may have infinite
but countable set of values. These attributes
always can be ordered
Interval scale
Boolean
Nominal attributes that cannot be ordered by
their values
Operational - example measurement of programming
productivity as am(nm)log(ab)/2b, where a is
the number of unique operators,b is the number of
unique operands, n-number of total operators
occurrences and m the number of total operands
occurrences

13
Data Mining Models
Data Mining
Descriptive Models
Predictive Models
Time series
Clustering
Sequence Discovery
Summarization
S
Classification
Regression
Association Rules
Prediction
14
Classification

Given a set of classes, distribute the data into
a given set of classes so that a newly arrived
data will be with the high probability will fall
into one of the classes.
Credit Card example 4 classes authorize
request more info do not authorize contact
police
Data is a set of credit card applications that
contain Name, age, credit score, address, income,
own or rent primary residence, etc.

15
Regression

Regression is a process of mapping a given data
to some function. Regression may be linear
(mapping into a linear function the set of given
data or non-linear function.
For example, one may map saving amount to a
person age as follows
samt aageb, where constant
a and b are
determined by existing
data
Fitting the rest of the data into a defined
function should have the least possible error

16
Time Series Analysis

Given data that changes with time to predict the
data behavior based on the known data
Example predict stock market, predict the stock
price of a specific company
Visualization is an important tool of time series
analysis
There are special operations on time series that
facilitate the time series analysis

17
Prediction

Differences between Classification and
Prediction
Classification deals with an existing data
Prediction deals with future events
Mathematical Models are normally used for
prediction Weather forecast, quake forecast, etc.

18
Clustering

Clustering is a process of distributing given
data into several sets so that distance between
different sets is larger than the distance
between elements in the same set
Difference between Clustering and Classification
is that the number of clusters is not known in
advance, whereas the number of classes is known
in advance.
Examples

19
Association Rules and Sequence Discovery

Association rules discovery relates to uncovering
unexpected relationships between data attribute
values. For example people who buy coffee may not
buy tee, or man who buy diapers also buy beer.
However, women who buy diapers do not buy beer
Sequence discovery an ability to determine
sequential patterns in the data

20
Data Mining Tasks

Data Selection
Data Integration
Data Cleaning
Data Transformation
Data Mining
Outlier Analysis
Result Interpretation
Trend and Evolution Analysis

21
Data Visualization

Graphical Interface bar charts, histograms,
line graphs
Geometric scatter diagrams techniques
Icon based figures, colors to improve results
presentation
Hierarchical Divide a display area into
segments
Hybrid a combination all of the above

22
Data Mining Major Issues

Human Interface
Model Selection
How to deal with outliers
Results Interpretations
Visualization Results
Dealing with large amounts of data
Dimensionality Curse
Multimedia Data
Missing Data
Irrelevant data
Integration
Application

23
Data Mining Major Issues

Mining Methodology
Mining different types of data in databases
Interactive data mining
Incorporation of known data
Noise and incomplete data
Performance and scalability
Social Impact Data Privacy and Security

24
Potential Applications

Database analysis and decision support
Market analysis and management
target marketing, customer relation management,
market basket analysis, cross selling, market
segmentation
Risk analysis and management
Forecasting, customer retention, improved
underwriting, quality control, competitive
analysis
Fraud detection and management
Other Applications
Text mining (news group, email, documents) and
Web analysis.
Intelligent query answering

25
Market Analysis and Management (1)

Where are the data sources for analysis?
Credit card transactions, loyalty cards, discount
coupons, customer complaint calls, plus (public)
lifestyle studies
Target marketing
Find clusters of model customers who share the
same characteristics interest, income level,
spending habits, etc.
Determine customer purchasing patterns over time
Conversion of single to a joint bank account
marriage, etc.
Cross-market analysis
Associations/co-relations between product sales
Prediction based on the association information

26
Market Analysis and Management (2)

Customer profiling
data mining can tell you what types of customers
buy what products (clustering or classification)
Identifying customer requirements
identifying the best products for different
customers
use prediction to find what factors will attract
new customers
Provides summary information
various multidimensional summary reports
statistical summary information (data central
tendency and variation)

27
Corporate Analysis and Risk Management

Finance planning and asset evaluation
cash flow analysis and prediction
contingent claim analysis to evaluate assets
cross-sectional and time series analysis
(financial-ratio, trend analysis, etc.)
Resource planning
summarize and compare the resources and spending
Competition
monitor competitors and market directions
group customers into classes and a class-based
pricing procedure
set pricing strategy in a highly competitive
market

28
Fraud Detection and Management (1)

Applications
widely used in health care, retail, credit card
services, telecommunications (phone card fraud),
etc.
Approach
use historical data to build models of fraudulent
behavior and use data mining to help identify
similar instances
Examples
auto insurance detect a group of people who
stage accidents to collect on insurance
money laundering detect suspicious money
transactions (US Treasury's Financial Crimes
Enforcement Network)
medical insurance detect professional patients
and ring of doctors and ring of references

29
Fraud Detection and Management (2)

Detecting inappropriate medical treatment
Detecting telephone fraud
Telephone call model destination of the call,
duration, time of day or week. Analyze patterns
that deviate from an expected norm.
British Telecom identified discrete groups of
callers with frequent intra-group calls,
especially mobile phones, and broke a
multimillion dollar fraud.
Retail
Analysts estimate that 38 of retail shrink is
due to dishonest employees.

30
Other Applications

Sports
IBM Advanced Scout analyzed NBA game statistics
(shots blocked, assists, and fouls) to gain
competitive advantage for New York Knicks and
Miami Heat
Astronomy
JPL and the Palomar Observatory discovered 22
quasars with the help of data mining
Internet Web Surf-Aid
IBM Surf-Aid applies data mining algorithms to
Web access logs for market-related pages to
discover customer preference and behavior pages,
analyzing effectiveness of Web marketing,
improving Web site organization, etc.

31
Data Mining System Architecture

Database, data warehouse, data files- set of data
to be mined. Data Cleaning and data integration
may be performed at this stage
Database or data warehouse server is responsible
for fetching relevant data. How to define
relevancy?
Knowledge Base Domain knowledge that drives a
search for patterns. Concept hierarchy, User
Beliefs, Interestingness Constraints
Data Mining Engine-Functional algorithms to
perform a search for domain experts
Pattern Evaluation Use knowledge base and other
methods to narrow search for domain patters
GUI Communicator between users and data mining
system

32
Architecture of a Typical Data Mining System
Graphical user interface
Pattern evaluation
Data mining engine
Knowledge-base
Database or data warehouse server
Filtering
Data cleaning data integration
Data Warehouse
Databases
33
Summary

Data mining discovering interesting patterns
from large amounts of data
A natural evolution of database technology, in
great demand, with wide applications
A KDD process includes data cleaning, data
integration, data selection, transformation, data
mining, pattern evaluation, and knowledge
presentation
Mining can be performed in a variety of
information repositories
Data mining functionalities characterization,
discrimination, association, classification,
clustering, outlier and trend analysis, etc.
Classification of data mining systems
Major issues in data mining

Write a Comment

User Comments (0)