???????? Basic Data Mining Techniques Prepared by: Dr. Tsung-Nan Tsai - PowerPoint PPT Presentation

1 / 69
About This Presentation
Title:

???????? Basic Data Mining Techniques Prepared by: Dr. Tsung-Nan Tsai

Description:

Title: Data Mining A Tutorial-Based Primer Author: Richard J. Roiger Keywords: Addison-Wesley Publishing Last modified by: Stoney Tsai Created Date – PowerPoint PPT presentation

Number of Views:53
Avg rating:3.0/5.0
Slides: 70
Provided by: Richa187
Category:

less

Transcript and Presenter's Notes

Title: ???????? Basic Data Mining Techniques Prepared by: Dr. Tsung-Nan Tsai


1
????????Basic Data Mining TechniquesPrepared
by Dr. Tsung-Nan Tsai
  • Chapter 3

2
Contents
  • ?????????
  • ????????????
  • ??????????????
  • K-means ???
  • ?????????????????????????
  • ???????????????

3
??????
????
???????
????????
?????????
??
????
????
????
??? (?????)
??? (?????)
CN2
ITRULE
CART
CART
C4.5/C5 ID3
M5
Cubist
4
3.1 ??? (Decision Trees)
5
??????
  • ????????????????????????????
  • ????????????????????????Yes ? Stop?
  • ??????????????????????????????????Yes ? Stop?

6
??????
  • ? T ????????
  • ?T????????????????.
  • ?????????????????
  • ???????????,?????????????????????????
  • ?????????????????????
  • ????step 3??????
  • ????????????????,???????????????,????????????????
  • ?????????????????????????,?? T ????????????? step
    2.

7
??????
  • ??????????????????,?????????????
  • C4.5???
  • ???n?????,C4.5??-log2(1/n)???????For example
    n4, -log2(1/4)2.
  • ????? (00, 01, 10, 11) ?????????????????????A,?
    -log2(1/2)1 ? 1?????????????
  • Lift(???) 1 ? C4.5 ???
  • C4.5 ???????????????????

8
C4.5
  • Quinlan(1979)??,?Shannon(1949)?????????
  • ?????k???,??????pi???????????????I(??Entropy-???)?
    I -(p1 ? log2(p1) p2 ? log2(p2)
    pk?log2(pk))
  • Example 1 ? k4? p10.25, p20.25, p30.25,
    p40.25 I-(0.25 ? log2(0.25)
    ? 4)2
  • Example 2 ? k2? p10, p20.5, p30, p40.5
    I-(0.5 ? log2(0.5) ? 2)1
  • Example 3 ? k1? p11, p20, p30, p40
    I-(1 ? log2(1))0

9
?????????
(Categorical)
(Categorical)
(Continuous)
10
??
11
??? 11/150.73 ??0.73/40.183
????
?? ????
12
??? 9/150.6 ??0.6/20.3
See page.72
13
??? 12/150.8 ??0.8/20.4
????
14
  • ????????

15
C4.5
16
C4.5
17
ID3
18
ID3
19
?????
20
?????
21
?????
22
A Rule for the Tree in Figure 3.4
23
?????????
  • CART?????????. ???????????
  • CART?C4.5??
  • CART ??????(???????)
  • CART????????????????????
  • CHAID ???SAS?SPSS??????????????

24
?????
  • ?????
  • ?????????????
  • ??????????
  • ?????????????
  • ?????????

25
?????
  • ?????????.
  • ?????????.
  • ??????.
  • ?????????????????????.

26
3.2 ??????
27
?????
  • ?????(affinity analysis) ?????????????????
  • ???????????????????
  • ?????????????????
  • ????????????
  • Example
  • ????????
  • ????????
  • ???/? ??? ????

28
???????
  • ??If A then B, ?????????,?A???B?????????
  • Example ????10,000??????????,??????????5000?,????
    ??5000/1000050?
  • ??(?)?The minimum percentage of instances in the
    database that contain all items listed in a given
    association rule.
  • ????????????????????????????????

29
  • ????????

30
??????
  • Apriori??? ???????(item set) ????????-???????????
    ???????????,?????
  • Apriori ????
  • ??????
  • ????????????????
  • ???3.3????,(??????????)

31
??????
  • ?1????4?????????????(gt3) ?

7Y/3N
32
??????
33
??????
34
????????
See page. 83 and 84
??7??????? ??5???????????
??????? ????? ? ????? ? ????? (????,
????) ????? ? ????? ? ???? (?)
35
????????
  • ?3?? ??????????????????????

1
????? ? ????? ? ?????? (????? 3)
36
????????
37
????
  • We are interested in association rules that show
    a lift in product sales where the lift is the
    result of the products association with one or
    more other products. ????????????????????,
    ??????????????????????
  • We are also interested in association rules that
    show a lower than expected confidence for a
    particular association.
  • ???????????????????????(??????????????)

38
3.3 k-means ???
  1. Choose a value for K(????) the total number of
    clusters.
  2. Randomly choose K points (???) as cluster centers
    (????). ??????
  3. Assign the remaining instances to their closest
    cluster center.???????????????
  4. Calculate a new cluster center for each cluster.
    ??????????(????)
  5. Repeat steps 3-5 until the cluster centers do not
    change.(??3-5???????????)

39
K-means ??
  • ????????????
  • ??x?y,???x-y????,?????3.6???

40
K-means ?? (?3.6)
C1
C2
41
K-means ??
  • ??1 ????K2,????2???????????
  • ????1, 3????????

(2-1)21.5-1.5)20.51
(1-1)21.5-1.5)20.50
?
(2-1)21.5-4.5)20.53.16
(1-1)21.5-4.5)20.53
?
?
?
?
?
42
K-means ??
  • C1????1 and 2. C2??3, 4, 5, 6.
  • ??????????
  • C1
  • x(1.01.0)/2 1.0
  • y(1.54.5)/2 3.0
  • C2
  • x(2.02.03.05.0)/43.0
  • y(1.53.52.56.0)/43.375
  • ?????? C1(1.0, 3.0) and C2(3.0, 3.375)
  • ???????See page. 89

43
K-means ??
44
K-means ??
45
K-means ?? - Tanagra
46
K-means?????
  • ???????
  • ???????.
  • ?????????.
  • ???????????????????.
  • ???????????????.
  • ??????.

47
3.4 ????? (Genetic algorithm)
  • ???????????????????,??John Holland
    (1986)????,??????????????
  • ?????????????????????????????????
  • ??????????????????????????

48
3.4 ????? (GA)
  • ?????
  • ?n??????????P,?????????????????????
  • ??????????
  • ?????????????????,??????????????????,??????P??
  • ??????m???(mn) ,??????????(n-m)????,???????????

49
3.4 ????? (GA)
50
GA???
51
??
52
3.4 ?????
  • ???????
  • ????(Selection)
  • ????(Crossover)
  • ????(Mutation)

53
??
  • ???(Roulette Wheel selection)
  • ???(Tournament selection)
  • ???(Steady-state selection)
  • ??????(Ranking and scaling)
  • ???(Sharing)

54
??
  • GAs?????????????????????????,????????????????????
    ???????,????????????
  • ??????(Arithmetical Crossover)
  • ???????(Heuristic Crossover)

55
??
  • ???????????????????????????
  • ????????,?????????????
  • ???????????????????,??????????

56
???????????
57
???????????
P
????
58
??
  • ????????????????,??????????????????????????

???
59
??
  • ????????????????,??????????????????????????

60
??
  • ????????????????,??????????????????????????

61
??
  • ????????????????,??????????????????????????

62
Genetic Algorithms and Unsupervised Clustering
63
?????????
64
?????????
65
GA ????
  • Global optimization is not a guarantee.
  • The fitness function determines the complexity of
    the algorithm.
  • Explain their results provided the fitness
    function is understandable.
  • Transforming the data to a form suitable for
    genetic learning can be a challenge.

66
3.5 ????????
  • Is learning supervised or unsupervised?
  • Is explanation required?
  • What is the interaction between input and output
    attributes?
  • What are the data types of the input and output
    attributes

67
????????-????
  • Do We Know the Distribution of the Data?
  • Do We Know Which Attributes Best Define the Data?
  • Does the Data Contain Missing Values?
  • Is Time an Issue?
  • Which Technique Is Most Likely to Give a Best
    Test Set Accuracy?

68
Q1
???? Yes No
Yes 8 1
No 1 5
69
Q2
  • IF Age gt 43
  • THEN life insurance promotion no
  • IF age lt43 sex female
  • THEN life insurance promotion yes
  • IF age lt43 sex male credit card insurance
    no THEN life insurance promotion no
  • IF age lt43 sex male credit card insurance
    yes THEN life insurance promotion yes
Write a Comment
User Comments (0)
About PowerShow.com