Title: Classifiers in Atlas
1Classifiers in Atlas
2Data Mining
- Classifiers
- Bayesian classifiers
- Decision trees
- The Apriori Algorithm
- DBSCAN Clustering
- http//wis.cs.ucla.edu/atlas/examples.html
3The Classification Task
- Input a training set of tuples, each labelled
with one class label - Output a model (classifier) which assigns a
class label to each tuple based on the other
attributes. - The model can be used to predict the class of new
tuples, for which the class label is missing or
unknown - Some natural applications
- credit approval
- medical diagnosis
- treatment effectiveness analysis
4Train Test
- The tuples (observations, samples) are
partitioned in training set test set. - Classification is performed in two steps
- training - build the model from training set
- Testing (for accuracy, etc.)
5Classical example play tennis?
Training set from Quinlans Book Seq Could
have Been used to generate the RID column
6Bayesian classification
- The classification problem may be formalized
using a-posteriori probabilities - P(CX) prob. that the sample tuple
Xltx1,,xkgt is of class C. - E.g. P(classN outlooksunny,windytrue,)
- Idea assign to sample X the class label C such
that P(CX) is maximal
7Estimating a-posteriori probabilities
- Bayes theorem
- P(CX) P(XC)P(C) / P(X)
- P(X) is constant for all classes
- P(C) relative freq of class C samples
- C such that P(CX) is maximum C such that
P(XC)P(C) is maximum
8Naïve Bayesian Classification
- Naïve assumption attribute independence
- P(x1,,xkC) P(x1C)P(xkC)
- For Categorical attributes P(xiC) is estimated
as the relative freq of samples having value xi
as i-th attribute in class C - Computationally this is a count with grouping
9Play-tennis example estimating P(xiC)
outlook
P(sunnyp) 2/9 P(sunnyn) 3/5
P(overcastp) 4/9 P(overcastn) 0
P(rainp) 3/9 P(rainn) 2/5
temperature
P(hotp) 2/9 P(hotn) 2/5
P(mildp) 4/9 P(mildn) 2/5
P(coolp) 3/9 P(cooln) 1/5
humidity
P(highp) 3/9 P(highn) 4/5
P(normalp) 6/9 P(normaln) 2/5
windy
P(truep) 3/9 P(truen) 3/5
P(falsep) 6/9 P(falsen) 2/5
P(p) 9/14
P(n) 5/14
10Bayesian Classifiers
- The training can be done by SQL count and
grouping sets (but that might require many passes
through the data). If the results are stored in a
table called SUMMARY, then - The testing is a simple SQL query on SUMMARY
- First operation is to verticalize the table
11Decision tree obtained with ID3 (Quinlan 86)
0
1
3
2
4
12Decision Tree Classifiers
- Computed in a recursive fashion
- Various ways to split and computing the
splitting function - First operation is to verticalize the table
13Classical example
14Initial state the node column
Training set from Quinlans book
15First Level (Outlook will then be deleted)
16Gini index
- E.g., two classes, Pos and Neg, and dataset S
with p Pos-elements and n Neg-elements. - fp p/(pn) fn n/(pn)
- gini(S) 1 fp2 - fn2
- If dataset S is split into S1,S2 ,S3 then
- ginisplit(S1,S2 ,S3 ) gini(S1)(p1n1)/(pn)
gini(S2)(p2n2)/(pn)
gini(S3)(p2n2)/(pn) - These computations can be easily expressed in
ATLaS
17Programming in ATLaS
- Table-based programming is powerful and natural
for data intensive - SQL can be ackward and many extensions are
possible - But even SQL as is is adequate
18The ATLaS System
- The system compile ATLaS programs into C
programs, which - Executes on Berkeley DB record manager
- The 100 Apriori program compiles into 2,800 lines
of C - Other data structures (R-trees, in-memory tables)
have been added using the same API. - The system is now 54,000 lines of C code.
19ATLaS Conclusions
- A native extensibility mechanism for SQLand a
simple one. More efficient than Java or PL/SQL - Effective with Data Minining Applications
- Also OLAP applications, and recursive queries,
and temporal database applications - Complement current mechanisms based on UDFs and
Data Blades - Supports and favors streaming aggregates (SQL
implicit default is blocking) - Good basis for determining program properties
e.g. (non)monotonic and blocking behavior - These are lessons that future QLs cannot easily
ignore.
20The Future
- Continuous queries on Data Streams
- Other extensions and improvements
- Stay tuned www.wis.ucla.edu