CS 459995

1 / 33
About This Presentation
Title:

CS 459995

Description:

Amount of data in databases and files grows exponentially 9Petabytes for Earth ... and fouls) to gain competitive advantage for New York Knicks and Miami Heat ... – PowerPoint PPT presentation

Number of Views:33
Avg rating:3.0/5.0
Slides: 34
Provided by: ksu7
Learn more at: http://www.cs.kent.edu

less

Transcript and Presenter's Notes

Title: CS 459995


1
Introduction
  • CS 459995
  • Introduction to Data Mining

2
Outline
  • What is data mining?
  • Basic Data Mining Tasks
  • Classification
  • Clustering
  • Association
  • Data mining Algorithms
  • Are all the patterns interesting?

3
What is Data Mining
  • Amount of data in databases and files grows
    exponentially 9Petabytes for Earth observation
    project in 2010 and 14Petabytes in 2015.
  • Data Mining is interested in finding information
    in these huge data sources
  • Typical database query SQL, Access and other
    database languages to get data
  • Data Mining query differs from Database query
  • Query not well formulated
  • Data in many sources
  • Output is mostly either visual or multimedia
  • Data Mining algorithms to get the information are
    consisting of three parts
  • Model The purpose of the algorithm to fit the
    model to the data
  • Preferences Criteria to decide which model is
    better
  • Search All algorithms require some search
    techniques

4
Information retrieval
Statistic
Data Mining
Knowledge Based System
Algorithms
Machine Learning
5
Statistic is not Data Mining
  • A big objection to data mining was that it was
    looking for so many vague connections that it was
    sure to find things that were bogus
  • The Rhine Paradox a great example of how not to
    conduct scientific research.
  • David Rhine was a parapsychologist in the 1950s
    who hypothesized that some people had
    Extra-Sensory Perception (ESP).
  • He devised an experiment where subjects were
    asked to guess 10 hidden cards --- red or blue.
  • He discovered that almost 1 in 1000 had ESP ---
    they were able to get all 10 right!

6
Example(cont)
  • He told these people they had ESP and called them
    in for another test of the same type.
  • Alas, he discovered that almost all of them had
    lost their ESP.
  • What did he conclude?
  • You shouldnt tell people that they have ESP it
    causes them to lose it

7
Example (cont)
  • What has really happened
  • There are 1024 combinations of red and blue
  • combinations of red and blue of length 10.
  • Thus with probability 0.98 at least one person
    will guess
  • the sequence of red blue correctly

8
Knowledge Based System are not Data Mining
  • KDD process selects the data and finds knowledge
    in the data
  • Data Mining in addition trying to make inferences
    from the data
  • However, the boundaries are not easy to define

9
Machine Learning is not Data Mining
  • Machine Learning design systems that can learn in
    the process of processing data
  • Checkers program designed by one of the scientist
    eventually learned to play better than the
    program designer
  • Data Mining incorporates the Machine learning
    methods but also benefits from the methods of
    other disciplines such as database and statistic

10
What is Data Mining
  • Data Mining major task is to find all and only
    interesting patterns in a set of data sources
  • Find all interesting patterns means
    Completeness
  • Can it be done
  • Heuristic vs Exhaustive search
  • Find only interesting patterns Consistency
  • Is it possible
  • Approaches Generate all patterns and filter out
    uninteresting patterns generate only patterns
    that are interesting

11
Data Mining On What Kind of Data?
  • Relational databases Universal relation vs
    Multirelational search
  • Data warehouses
  • Transactional databases
  • Advanced DB and information repositories
  • Object-oriented and object-relational databases
  • Spatial databases
  • Time-series data and temporal data
  • Sensor Data
  • Text databases and multimedia databases
  • Heterogeneous and legacy databases
  • WWW

12
Data Mining On What Kind of Data?
  • Attribute Types
  • Categorical attribute that has a finite number
    of values
  • Ordinal attributes can be ordered by their
    values
  • Attribute Transformations
  • Continuing - attribute that may have infinite
    but countable set of values. These attributes
    always can be ordered
  • Interval scale
  • Boolean
  • Nominal attributes that cannot be ordered by
    their values
  • Operational - example measurement of programming
    productivity as am(nm)log(ab)/2b, where a is
    the number of unique operators,b is the number of
    unique operands, n-number of total operators
    occurrences and m the number of total operands
    occurrences

13
Data Mining Models
Data Mining
Descriptive Models
Predictive Models
Time series
Clustering
Sequence Discovery
Summarization
S
Classification
Regression
Association Rules
Prediction
14
Classification
  • Given a set of classes, distribute the data into
    a given set of classes so that a newly arrived
    data will be with the high probability will fall
    into one of the classes.
  • Credit Card example 4 classes authorize
    request more info do not authorize contact
    police
  • Data is a set of credit card applications that
    contain Name, age, credit score, address, income,
    own or rent primary residence, etc.

15
Regression
  • Regression is a process of mapping a given data
    to some function. Regression may be linear
    (mapping into a linear function the set of given
    data or non-linear function.
  • For example, one may map saving amount to a
    person age as follows
  • samt aageb, where constant
    a and b are
  • determined by existing
    data
  • Fitting the rest of the data into a defined
    function should have the least possible error

16
Time Series Analysis
  • Given data that changes with time to predict the
    data behavior based on the known data
  • Example predict stock market, predict the stock
    price of a specific company
  • Visualization is an important tool of time series
    analysis
  • There are special operations on time series that
    facilitate the time series analysis

17
Prediction
  • Differences between Classification and
    Prediction
  • Classification deals with an existing data
  • Prediction deals with future events
  • Mathematical Models are normally used for
    prediction Weather forecast, quake forecast, etc.

18
Clustering
  • Clustering is a process of distributing given
    data into several sets so that distance between
    different sets is larger than the distance
    between elements in the same set
  • Difference between Clustering and Classification
    is that the number of clusters is not known in
    advance, whereas the number of classes is known
    in advance.
  • Examples

19
Association Rules and Sequence Discovery
  • Association rules discovery relates to uncovering
    unexpected relationships between data attribute
    values. For example people who buy coffee may not
    buy tee, or man who buy diapers also buy beer.
    However, women who buy diapers do not buy beer
  • Sequence discovery an ability to determine
    sequential patterns in the data

20
Data Mining Tasks
  • Data Selection
  • Data Integration
  • Data Cleaning
  • Data Transformation
  • Data Mining
  • Outlier Analysis
  • Result Interpretation
  • Trend and Evolution Analysis

21
Data Visualization
  • Graphical Interface bar charts, histograms,
    line graphs
  • Geometric scatter diagrams techniques
  • Icon based figures, colors to improve results
    presentation
  • Hierarchical Divide a display area into
    segments
  • Hybrid a combination all of the above

22
Data Mining Major Issues
  • Human Interface
  • Model Selection
  • How to deal with outliers
  • Results Interpretations
  • Visualization Results
  • Dealing with large amounts of data
  • Dimensionality Curse
  • Multimedia Data
  • Missing Data
  • Irrelevant data
  • Integration
  • Application

23
Data Mining Major Issues
  • Mining Methodology
  • Mining different types of data in databases
  • Interactive data mining
  • Incorporation of known data
  • Noise and incomplete data
  • Performance and scalability
  • Social Impact Data Privacy and Security

24
Potential Applications
  • Database analysis and decision support
  • Market analysis and management
  • target marketing, customer relation management,
    market basket analysis, cross selling, market
    segmentation
  • Risk analysis and management
  • Forecasting, customer retention, improved
    underwriting, quality control, competitive
    analysis
  • Fraud detection and management
  • Other Applications
  • Text mining (news group, email, documents) and
    Web analysis.
  • Intelligent query answering

25
Market Analysis and Management (1)
  • Where are the data sources for analysis?
  • Credit card transactions, loyalty cards, discount
    coupons, customer complaint calls, plus (public)
    lifestyle studies
  • Target marketing
  • Find clusters of model customers who share the
    same characteristics interest, income level,
    spending habits, etc.
  • Determine customer purchasing patterns over time
  • Conversion of single to a joint bank account
    marriage, etc.
  • Cross-market analysis
  • Associations/co-relations between product sales
  • Prediction based on the association information

26
Market Analysis and Management (2)
  • Customer profiling
  • data mining can tell you what types of customers
    buy what products (clustering or classification)
  • Identifying customer requirements
  • identifying the best products for different
    customers
  • use prediction to find what factors will attract
    new customers
  • Provides summary information
  • various multidimensional summary reports
  • statistical summary information (data central
    tendency and variation)

27
Corporate Analysis and Risk Management
  • Finance planning and asset evaluation
  • cash flow analysis and prediction
  • contingent claim analysis to evaluate assets
  • cross-sectional and time series analysis
    (financial-ratio, trend analysis, etc.)
  • Resource planning
  • summarize and compare the resources and spending
  • Competition
  • monitor competitors and market directions
  • group customers into classes and a class-based
    pricing procedure
  • set pricing strategy in a highly competitive
    market

28
Fraud Detection and Management (1)
  • Applications
  • widely used in health care, retail, credit card
    services, telecommunications (phone card fraud),
    etc.
  • Approach
  • use historical data to build models of fraudulent
    behavior and use data mining to help identify
    similar instances
  • Examples
  • auto insurance detect a group of people who
    stage accidents to collect on insurance
  • money laundering detect suspicious money
    transactions (US Treasury's Financial Crimes
    Enforcement Network)
  • medical insurance detect professional patients
    and ring of doctors and ring of references

29
Fraud Detection and Management (2)
  • Detecting inappropriate medical treatment
  • Detecting telephone fraud
  • Telephone call model destination of the call,
    duration, time of day or week. Analyze patterns
    that deviate from an expected norm.
  • British Telecom identified discrete groups of
    callers with frequent intra-group calls,
    especially mobile phones, and broke a
    multimillion dollar fraud.
  • Retail
  • Analysts estimate that 38 of retail shrink is
    due to dishonest employees.

30
Other Applications
  • Sports
  • IBM Advanced Scout analyzed NBA game statistics
    (shots blocked, assists, and fouls) to gain
    competitive advantage for New York Knicks and
    Miami Heat
  • Astronomy
  • JPL and the Palomar Observatory discovered 22
    quasars with the help of data mining
  • Internet Web Surf-Aid
  • IBM Surf-Aid applies data mining algorithms to
    Web access logs for market-related pages to
    discover customer preference and behavior pages,
    analyzing effectiveness of Web marketing,
    improving Web site organization, etc.

31
Data Mining System Architecture
  • Database, data warehouse, data files- set of data
    to be mined. Data Cleaning and data integration
    may be performed at this stage
  • Database or data warehouse server is responsible
    for fetching relevant data. How to define
    relevancy?
  • Knowledge Base Domain knowledge that drives a
    search for patterns. Concept hierarchy, User
    Beliefs, Interestingness Constraints
  • Data Mining Engine-Functional algorithms to
    perform a search for domain experts
  • Pattern Evaluation Use knowledge base and other
    methods to narrow search for domain patters
  • GUI Communicator between users and data mining
    system

32
Architecture of a Typical Data Mining System
Graphical user interface
Pattern evaluation
Data mining engine
Knowledge-base
Database or data warehouse server
Filtering
Data cleaning data integration
Data Warehouse
Databases
33
Summary
  • Data mining discovering interesting patterns
    from large amounts of data
  • A natural evolution of database technology, in
    great demand, with wide applications
  • A KDD process includes data cleaning, data
    integration, data selection, transformation, data
    mining, pattern evaluation, and knowledge
    presentation
  • Mining can be performed in a variety of
    information repositories
  • Data mining functionalities characterization,
    discrimination, association, classification,
    clustering, outlier and trend analysis, etc.
  • Classification of data mining systems
  • Major issues in data mining
Write a Comment
User Comments (0)