Data Mining: An Introduction - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

Data Mining: An Introduction

Description:

Data Mining: An Introduction Billy Mutell – PowerPoint PPT presentation

Number of Views:83
Avg rating:3.0/5.0
Slides: 24
Provided by: William1354
Learn more at: https://www.math.wm.edu
Category:

less

Transcript and Presenter's Notes

Title: Data Mining: An Introduction


1
Data Mining An Introduction
  • Billy Mutell

2
The Library of Babel Analogy
Network of bookshelves with every book ever
written All the books one could possibly imagine
must exist somewhere in this library Books
have titles like Axaxxas mlo, The Bible
Tomorrow's Winning Lottery Numbers Roughly
251,312,000 or 1.956 x 101,834,097 volumes in
library May be viewed as a metaphor for
information in todays society, where theres
growing amounts of data and, but not enough
information
3
Content
  • General Information
  • Approaches to searching for information
  • Project and plans

4
What is Data Mining?
  • The nontrivial extraction of implicit, previously
    unknown, and potentially useful information from
    data
  • The science of extracting useful information from
    large data sets or databases

5
How Did it Evolve to What We Have Today?
  • With increased data, techniques needed to be
    created

Information Retrieval
Statistics
Database Management
Data Mining
Algorithms
Machine Learning
6
Practical Applications
Government Intelligence
Insurance
Bank Finance
Branch Evaluation
Pharmaceutical Reactions in Patients
7
Content
  • General Information
  • Approaches to searching for information
  • Project and plans

8
There are two models for mining data
Predictive Makes projected conclusions about
values based on known results from different
data Includes Regression, Classification, Time
Series Analysis Classification Maps data into
predefined groups Example Identifying
potential credit risks Time Series Analysis
Examining the value of an attribute as it varies
over time Example Choosing stocks
9
There are two models for mining data
Descriptive Identifies patterns or relationships
in data Includes Clustering, Association
Rules, Sequence Discovery Clustering Very
similar to Classification, but groups are defined
by data and not predefined Association Rules
Identifies specific types of data
pairings Example If someone buys jelly,
theyre probably buying peanut butter
Sequence Discovery Highlights patterns on
temporal sequences Example If someone buys a
CD player, theyll probably buy CDs within a
week
10
Information Analysis
  • Statistical Based Algorithms
  • Decision Tree Based Algorithms
  • Rule Based Algorithms
  • Distance Based Algorithms

11
Linear Regression Examples
Regression- Estimation of output value based on
input values takes input data and fits it into a
formula according to output
12
Statistical Based Algorithms
By determining the regression coefficients c0,
c1, , cn, we can estimate the relationship the
output parameter, y, and the input parameters,
x1,, xn
13
Decision Tree Example 20 Questions
14
Rule Based Algorithms
Works well to perform classification through
if-then analysis Trees have an implied
order in which there is splitting rules have
no order
15
Parametric vs Nonparametric Models
Parametric Model- Describes the relationship
between input and output through algebraic
equations where some parameters arent specified
Nonparametric Model- Data driven and more
appropriate for mining applications
Creates models based on input while Parametric
Methods assume models ahead of time More
flexible than Parametric Models and generally
easier to work with
16
Content
  • General Information
  • Approaches to searching for information
  • Project and plans

17
NetFlix A Case Study
  • Quest to improve customer/movie predictability
    through data mining and linear regression
  • Teams win 1,000,000 prize
  • Must beat Cinematch, Netflixs current program to
    predict movie preferences
  • http//www.netflixprize.com/

18
What others have done so far
If I have seen further, it is by standing on the
shoulders of giants. -Isaac Newton 1676
There are currently 31,443 contestants on 25,713
teams from 167 different countries.
Important to remember that everyone is given the
same amount of incomplete data, and we have to
use that to predict rest of the data (unknown to
us, known to Netflix)
Current Leaders are from Budapest, Hungry and
theyve accurately predicted the data 8.7 better
than Cinematch
19
K-Nearest Neighbor Algorithm (k-NN)
A set of pairs is given, where the xis take
values in a metric space X upon which is defined
a metric d and the ?is take values in the set
1,2,M of possible classes. Each ?i is
considered to be an index of the category to
which the ith individual belongs, and each xi is
the outcome of the set of measurements made upon
that individual. A new pair (x,?) is given,
where only the measurement of x is observable,
and it is desired to estimate ? by using
information in the set of correctly classified
points. Thus, we will call the nearest
neighbor of x if
The Nearest-Neighbor classification decision
method gives to x the category ?n of its nearest
neighbor xn

20
K-Nearest Neighbor Algorithm (k-NN)
If k3, we classify the dot as a triangle If
k5, we classify the dot as a rectangle
x
21
Suppose we want to know what the entry ltPat, F,
1.6gt would be classified as
Name Gender Height) Output
Kristina F 1.6 Short
Jim M 2 Tall
Maggie F 1.9 Medium
Martha F 1.88 Medium
Stephanie F 1.7 Short
Bob M 1.85 Medium
Kathy F 1.6 Short
Dave M 1.7 Short
Worth M 2.2 Tall
Steven M 2.1 Tall
Debbie F 1.8 Medium
Todd M 1.95 Medium
Kim F 1.9 Medium
Amy F 1.8 Medium
Wynette F 1.75 Medium
Set K5 and find the K nearest neighbors ltKristin
a, F, 1.6gt gt SHORT ltKathy, F, 1.6gt gt
SHORT ltStephanie, F, 1.7gt gt SHORT ltDave, M,1.7gt
gt SHORT ltWynette, F, 1.75gt gt MEDIUM Thus KNN
would classify ltPat, F, 1.6gt as SHORT
22
What I plan to do from here
Take data from Netflix and sift through
it Develop a function that maps non-linear data
to a linear format so that it may be clustered
and regressed Map data to matrices in Rn Use
Support Vector Machines to map input vectors to a
higher dimensional space where a maximal
separating hyper-plane is constructed Create a
way to interpret this data in the form of movie
recommendations Also Use k-NN Approach along
with Latent Semantic Indexing techniques to
analyze scripts and key thematic plots and look
for correlations/clusters
23
Questions?
Write a Comment
User Comments (0)
About PowerShow.com