Title: Artificial Neural Networks and Data Mining
1Artificial Neural NetworksandData Mining
Wismar Business School
www.wi.hs-wismar.de/laemmel Uwe.Laemmel_at_hs-wismar
.de
2Content
- Data Mining
- Classification approach
- Data Mining Cup
- 2004 Who will cancel?
- 2007 Who will get a rebate coupon?
- 2008 How long will someone participate in a
lottery ? - 2009 ?
- Clustering approach
- Behaviour of bank customers
3Data Mining
- Data Mining is a
- systematic and automated discovery and
extraction - of previously unknown knowledge
- out of huge amount of data.
- "KDD Knowledge Discovery in Data bases"
synonym - Notion wrong Gold Mining ? Data Mining
4Data Mining Applications
- classification
- clustering
- association
- prediction
- text mining
- web mining
5Data Mining Process
CRISP-DM model
6Content
- Data Mining
- Classification approach using NN
- Data Mining Cup
- 2004 Who will cancel?
- 2007 Who will get a rebate coupon?
- 2008 How long will someone participate in a
lottery ? - 2009 ?
- Clustering approach
- Behaviour of bank customers
7Classification using NN
training p.
- prerequisite
- set of training pattern (many patterns)
- approach
- code the values
- divide set of training pattern into
- training set
- test set
- build a network
- train the network using the training set
- check the network quality using the test set
coded p.
test set
training set
real data
8Development of an NN-application
9Build an Artificial Neural Network
- Number of Input Neurons?
- depends on the number of attributes
- depends on the coding
- Number of Output Neurons?
- depends on the coding of the class attribute
- Number of Hidden Neurons?
- experiments necessary
- generally not more than input neurons
- quarter half of number of input neurons may
work - see capacity of a neural network
10Experiments using the JavaNNS
- Build a network
- Load training-pattern
- open the Error Graph
- open the Control Panel
- Initialize the network
- try different learning parameter 0.1, 0.2, 0.5,
0.8 - Start Learning
11Getting Results
- value the error
- Finally
- make the test-Pattern the actual one
- Save Data
- include output files
- save as a .res-file
- Evaluate the .res-file
12Experiments
- How can we improve the results?
- Data pre-processing?
- Architecture of ANN?
- Learning Parameters?
- Evaluation of the results post-processing?
record your work!
13Content
- Data Mining
- Classification approach
- Data Mining Cup
- 2004 Who will cancel?
- 2007 Who will get a rebate coupon?
- 2008 How long will someone participate in a
lottery ? - 2009 ?
- Clustering approach
- Behaviour of bank customers
14Data Mining Cup www.dataminingcup.de
- annual competition for students
- runs April May /June
- real world problem
- problem
- set of training data
- set of data for classification
- to be developed classification
- supported by many companies (data/software)
- 200 300 participants
- workshop (user day)
15DMC2004 A Mailing Action
- mailing action of a company
- special offer
- estimated annual income per customer
- given
- 10,000 sets of customer datacontaining 1,000
cancellers (training) - problem
- test set contains 10,000 customer data
- Who will cancel ?
- Whom to send an offer?
16Mailing Action Aim?
- no mailing action
- 9,000 x 72.00 648,000
- everybody gets an offer
- 1,000 x 43.80 9,000 x 66.30 640,500
- maximum (100 correct classification)
- 1,000 x 43.80 9,000 x 72.00 691,800
17Goal Function Lift
- basis no mailing action 9,000 72.00
- goal extra income
- liftM 43.8 cM 66.30 nkM 72.00 nkM
18Data
?----- 32 input data ------?
ltimportant
resultsgt
missing values
19Feed Forward Network What to do?
- train the net with training set (10,000)
- test the net using the test set ( another 10,000)
- classify all 10,000 customer into canceller or
loyal - evaluate the additional income
20Results
data mining cup 2002
- gain
- additional income by the mailing actionif
target group was chosen according analysis
21DMC 2007 Rebate System
- Check-out couponing allows an individual coupon
generation at the check-out - The coupon is printed at the end of the sales
slip depending on the current customer. - Questions
- How can the retailer identify whether a customer
is a potential couponing customer? - On what coupons he will respond?
22Couponing
- Print
- coupon A
- coupon B
- No coupon
- 50,000 customer cards for training
- Classify another 50,000 customer!
- Cost function
- coupon not redeemed (false assignment to A or B)
1 - coupon A redeemed (correct assignment to A) 3
- coupon B redeemed (correct assignment to B) 6
- Maximize the value!
23Data Understanding
- What is the meaning of the attributes?
- Type and range of values?
2420202 Network
Profit 3?AA 6 ? BB (NANBBAAB)
- results
- winner 2007 7,890
- my version 6,714
- our students 6,468 (73/230)
25DMC2008 Participation in a Lottery
- Predicting, at the beginning of the lottery, how
long participants will participate - 0 The first ticket has not been paid for
- 1 Only the ticket for the first class has been
paid for - 2 Only the first two classes were played
- 3 The lottery was played until the end but no
ticket purchased for the following lottery - 4 At least first ticket for the following
lottery purchased
cost matrix
26Data
- 113,476 pattern!
- 69 attributes
- new customer (yes/no)
- age
- bank
- car
2710040205 Network
- results
- 1,030,240 RWTH Aachen (1) 1,024,535 RWTH
Aachen (8) - 865,565 Bauhaus Univ. Weimar (100)
- Univ. Wismar 878,550 835,035
- 1,494,315 (212)
28DMC 2009?
29Content
- Data Mining
- Classification approach
- Data Mining Cup
- 2004 Who will cancel?
- 2007 Who will get a rebate coupon?
- 2008 How long will someone participate in a
lottery ? - 2009 ?
- Clustering approach
- Behaviour of bank customers
30Clustering Transaction Data
- Cooperation
- Hochschule Wismar
- HypoVereinsbank
- Medienhaus Rostock
- Issue
- What information can be extracted from turnover
time series? - Strategy
- Clustering time series data
- Assign customers/accounts to clusters
- Examine clusters
31Transaction Data Time Series
- Corporate clients
- 223 branches
- Cumulated transactions per
- Month
- Account
- Type of transaction
- ... for a total of 6 years
- Original financial data not suitable
- Order of values is important
- Time displacements are problematic
32Fourier versus Original Data
- No displacement
- Similarity detected on both
- transaction curve and
- frequency spectrum
Data is displaced frequency spectrum shows
similarity
33Using a classification model
34Clustering Prediction Results
- 140.000 records
- 1 record 1 account
- 6x5 SOM max. 30 clusters
- average changes of cluster assignments ca. 19
Variability per Business Sector22,3 Taxi 239/107
022,3 Ship Broker Offices 64/47120,9 Churches
228/109120,2 Trucking 1010/5008
35Ende