Title: OCR
1?????? ?????? ???????? II 1015-1215 ??????????
??????? ????? 1015-1155 ????????? ?? ??
2 3??????
??????
??????????
?????????
?????????? ????????? ??DB??? ??????? OLAP
?????? (??????? Data Warehouse)
????????? (????????)
??????????
- ????????????
- ???????????
- ????
4Association Rules
5????? X ? Y ??????No ? ??????????Yes
???? Pr(X??Y) ? 5 ??? Pr(YX) ?
32 ????????????? interesting ????
Interesting Rules ??????
?? B ? C ? interesting Pr(BC) ????? Pr(B) ?
Pr(C) ?????
6???
- ?????????????? ?? ???????????
- ?????????????? ??????????????
7????????????????(???????)??? ?????????????? ????
???
????A,B,C ? ABC ??????
8ABCD
????????????????(???????)??? ?????????????? ????
??? ??? Pr(AB) lt ?? ? Pr(ABC) lt ?? ???
B ? C ???? Pr(CB)Pr(BC)/ Pr(B) ??????????
ABC
ABD
ACD
BCD
AB
AC
BC
AD
BD
CD
A
B
C
D
f
A Pr(A)??? AB Pr(AB)lt??
9??????????
?????????????
?????????????????????????
ACDE
10??????????
?????????????
?????????????????????????
ACDE
Hash table
11??????????
?????????????
?????????????????????????
ACDE
Hash table
12??????????
?????????????
?????????????????????????
ABDE
Hash table
13????????????
???????????????????
? ????????5???
14????????????
ABCD
ABC
ABD
ACD
BCD
AB
AC
BC
AD
BD
CD
A
B
C
D
f
???1? ????? ?????
15ABCD
ABC
ABD
ACD
BCD
AB
AC
BC
AD
BD
CD
???2 ???
???
A
B
C
D
f
???1? ????? ?????
16ABCD
ABC
ABD
ACD
BCD
???
???3 ???
AB
AC
BC
AD
BD
CD
???2 ???
A
B
C
D
f
17ABCD
???1? ?????? ??
???
ABC
ABD
ACD
BCD
???3 ???
AB
AC
BC
AD
BD
CD
???2 ???
A
B
C
D
f
18???1? ?????? ??
ABCD
? 1 ? ? ? ?
ABC
ABD
ACD
BCD
???3 ???
AB
AC
BC
AD
BD
CD
???2? ????
? ? ?
A
B
C
D
f
???1? ????? ?????? ???
19A priori ??? 20??4???????????????
ABCD
???1? ?????? ??
? 1 ? ? ? ?
ABC
ABD
ACD
BCD
???3? ????
? ? ?
AB
AC
BC
AD
BD
CD
???2? ????
A
B
C
D
f
???1? ????? ?????? ???
20?????R ? ????????Yes
????
Pr(?????R )?10??????
???????????
21?????R ? ????????Yes
????
Pr(?????R )?10??????
???????????
???80??? Pr(?????R )??
22?????R ? ????????Yes ?? Pr(?????R) ???
?? ??????????? R
???? X ? ( Pr(?????X) , Pr(?????X,????????
Yes)
23?????R ? ????????Yes ?? Pr(?????R) ???
?? ??????????? R
???? X ? ( Pr(?????X) , Pr(?????X,????????
Yes)
O(M log M) M number of records
24(No Transcript)
25Clockwise Search
26(No Transcript)
27Counter Clockwise Search
Clockwise, Counter Clockwise ?????????1????? ??
28(??,????)?S ? ????????Yes
29(??,????)?S ? ????????Yes
30(??,????)?S ? ????????Yes
31(??,????)?S ? ????????Yes
32???
????
X????
?????
p( (??,????)?S ) ????S?????? ???????
????????????????????????S ????????
????????????????????????S
33(??,????)? S ? ????????Yes
???? M, ????? n
??????? ?????????????? O(n1.5) ?????
????
???X???????????? ?????????????? X???O(n
M)?????O(n 1.5 M) ?????? n ? log M
?????????????? P NP ?????????
??
???????
????????
34(??,????)? S ? ????????Yes
S
p( ??,????)?S, ????????Yes )
p((??,????)?S)
35(??,????)? S ? ????????Yes
S
p( ??,????)?S, ????????Yes )
p((??,????)?S)
36Hand Probing ???????
1?? hand probing ???? X???? O(n) ????? O(n1.5)
hand probing ????O(log M)
37y ?x a
?? a ????
- ????????????????
- ???????????????
38?????? - ????????????
????????????? ?????????????? ????????????????
10-fold Cross Validation
39Classification
40???
?????? ????????????????
?? ??? ???? ??? GPT GOT
????
41???
?????? ????????????????
?? lt 125
Yes
No
Yes
No
????
Yes
?????????? ???? ??????????? ?? ????????????????
No
42??? ??????????
?????
?????
43??? ??????????
Quinlan??????? ???
?????
?????
n
Ent1- (p log p q log q)
Ent2
n1
n2
p
q
44S
???????????? ???????????? ????????? Hand
Probing ??? ?????????? (????????? ????????????)
S????????
S??????
45Ent(???XYZ??????) ? min(Ent(X),Ent(Y),ENT(Z))
?? Ent(Z)? ???????????? ?????? Branch and Bound
Search
???????O(logM)?Hand Probing
46???????
UC Irvine, Repository of Machine Learning
databases http//www.ics.uci.edu/mlearn/MLReposit
ory.html
10-fold Cross Validation
47??? (Regression Tree)
BPS GDM YEN TB3M TB30Y
SP500 GOLD 1.443530 0.407460
0.004980 7.02 9.31 210.88
326.00 1.446120 0.408050 0.004950 7.04
9.28 205.96 339.45
48(No Transcript)
49Yes
No
No
Yes
50?
D2
D1
???
µ1
µ2
??????????????
51A
µ
?
D2
D1
???
µ2
µ1
??????? ???
D1?D2
D1 ( µ -µ1 )2 D2 ( µ -µ2 )2
??????????
D1?D2
52S
???????????? ???????????? ????????? Hand
Probing ??? ?????????? Branch and Bound
Search ?????O(log M)
S???? ????? ????
S??????
53???????
http//www.cs.utoronto.ca/delve/data/datasets.htm
l
10-fold Cross Validation
??????(???????) ?????? ?????? ??? X?? ???
?? ??? add10 9792 10 0.141 0.123 0.156 0.185
abalone 4177 8 0.521 0.515 0.534 0.539 kin-8fh
8192 8 0.447 0.433 0.459 0.479 kin-8fm 8192
8 0.225 0.197 0.257 0.249 kin-8nh 8192
8 0.649 0.618 0.619 0.655 kin-8nm 8192
8 0.494 0.449 0.478 0.541 pumadyn-kin-8fh 8192
8 0.412 0.402 0.409 0.410 pumadyn-kin-8fh 8192
8 0.0604 0.0595 0.0653 0.0632 pumadyn-kin-8fh 8192
8 0.347 0.337 0.353 0.355 pumadyn-kin-8fh 8192
8 0.0530 0.0496 0.0550 0.0535
54(No Transcript)
55??? ???, ??,
???? (3102?) ????????
?? 102 103 ?
56(No Transcript)
57(No Transcript)
58??? ???, ??, ??????, ????, ???, ...
???? (102107?) ??????, SNP, ...
?? 102 104 ?
59Clustering
60Expression Patterns of Genes in Various Tissues
Brain in embryo
Five brain tissues of adult mouse
61Clustering genes via expression patterns is
promising.
- A set of genes are expected to share common
rolesin cellular processes.
- Genes in the same group would be observed in
the same tissue at the same time.
- Their expression patterns would be similar.
- Clustering genes by expression patterns would
providesubstantial insight on real groups of
genes.
62Graphical Representation of Expression Patterns
63Cluster of genes coding ribosomal proteins
64Tightness of a cluster C of points
diameter max x y x and y are points
in C
- intra-class variance (1 / C ) S x in C x
c(C) 2 - C number of points in C
- c(C) centroid (mean) of C, S x in C x
65(No Transcript)
66Diameter Problem
- NP-hard if k is treated as a variable
- Approximation within a factor a of the optimal
diameter is NP-hard for a lt 2.
- Approximation factor of 2 is achieved by
furthest point heuristic in O(n k)-time. - (n number of points)
Diameter1 Diameter2
Intra-class variance1 gtgt Intra-class variance2
67Intra-class Variance Problem
- O(n (d2)k1 )-time algorithm (d number of
dimensions)
- O(n(1/e)d )-time e-approximate 2-clustering
algorithm
Problems of k-clustering
- It is hard to guess an appropriate value for k,
beforehand.
- It is not easy to avoid generating a
false-positive cluster of large intra-class
variance that may contain genes of different
functions.
Our Approach
- Perform hierarchical clustering by e-approximate
2-clustering.
- Stop dividing a cluster if its intra-class
variance is no more than a given threshold.
68Cluster of genes coding ribosomal
proteins intra-class variance 209
Clusters of genes coding myelin intra-class
variance 128
69?????
70- ??????????
- Apriori
- Dynamic Itemset Counting
- ????
- ????
- Correlation
- ???????
- 2?????
- ????? ?????
- ?????
- NP?? NP??
- ?????
- ????
71- ???? / ??? / ???
- C4.5
- CART
- ??????
- NP-hardness / Parallel Search
- Optimized Ranges / Regions
- Boosting / Bagging / Weighted Majority
- ???????
- NP??
- ?????
- ???
72- ??????
- ???????
- ???????? Google / Clever
- ?????????
- Clustering / Nearest Neighborhood
- k-means / k-clustering
- ???????
- ????????
- ?????????