Classifiers in Atlas - PowerPoint PPT Presentation

About This Presentation
Title:

Classifiers in Atlas

Description:

The model can be used to predict the class of new tuples, for ... Executes on Berkeley DB record manager. The 100 Apriori program compiles into 2,800 lines of C ... – PowerPoint PPT presentation

Number of Views:24
Avg rating:3.0/5.0
Slides: 21
Provided by: rk82
Learn more at: http://web.cs.ucla.edu
Category:

less

Transcript and Presenter's Notes

Title: Classifiers in Atlas


1
Classifiers in Atlas
  • CS240B
  • Class Notes
  • UCLA

2
Data Mining
  • Classifiers
  • Bayesian classifiers
  • Decision trees
  • The Apriori Algorithm
  • DBSCAN Clustering
  • http//wis.cs.ucla.edu/atlas/examples.html

3
The Classification Task
  • Input a training set of tuples, each labelled
    with one class label
  • Output a model (classifier) which assigns a
    class label to each tuple based on the other
    attributes.
  • The model can be used to predict the class of new
    tuples, for which the class label is missing or
    unknown
  • Some natural applications
  • credit approval
  • medical diagnosis
  • treatment effectiveness analysis

4
Train Test
  • The tuples (observations, samples) are
    partitioned in training set test set.
  • Classification is performed in two steps
  • training - build the model from training set
  • Testing (for accuracy, etc.)

5
Classical example play tennis?
Training set from Quinlans Book Seq Could
have Been used to generate the RID column
6
Bayesian classification
  • The classification problem may be formalized
    using a-posteriori probabilities
  • P(CX) prob. that the sample tuple
    Xltx1,,xkgt is of class C.
  • E.g. P(classN outlooksunny,windytrue,)
  • Idea assign to sample X the class label C such
    that P(CX) is maximal

7
Estimating a-posteriori probabilities
  • Bayes theorem
  • P(CX) P(XC)P(C) / P(X)
  • P(X) is constant for all classes
  • P(C) relative freq of class C samples
  • C such that P(CX) is maximum C such that
    P(XC)P(C) is maximum

8
Naïve Bayesian Classification
  • Naïve assumption attribute independence
  • P(x1,,xkC) P(x1C)P(xkC)
  • For Categorical attributes P(xiC) is estimated
    as the relative freq of samples having value xi
    as i-th attribute in class C
  • Computationally this is a count with grouping

9
Play-tennis example estimating P(xiC)
outlook
P(sunnyp) 2/9 P(sunnyn) 3/5
P(overcastp) 4/9 P(overcastn) 0
P(rainp) 3/9 P(rainn) 2/5
temperature
P(hotp) 2/9 P(hotn) 2/5
P(mildp) 4/9 P(mildn) 2/5
P(coolp) 3/9 P(cooln) 1/5
humidity
P(highp) 3/9 P(highn) 4/5
P(normalp) 6/9 P(normaln) 2/5
windy
P(truep) 3/9 P(truen) 3/5
P(falsep) 6/9 P(falsen) 2/5
P(p) 9/14
P(n) 5/14
10
Bayesian Classifiers
  • The training can be done by SQL count and
    grouping sets (but that might require many passes
    through the data). If the results are stored in a
    table called SUMMARY, then
  • The testing is a simple SQL query on SUMMARY
  • First operation is to verticalize the table

11
Decision tree obtained with ID3 (Quinlan 86)
0
1
3
2

4
12
Decision Tree Classifiers
  • Computed in a recursive fashion
  • Various ways to split and computing the
    splitting function
  • First operation is to verticalize the table

13
Classical example
14
Initial state the node column
Training set from Quinlans book
15
First Level (Outlook will then be deleted)
16
Gini index
  • E.g., two classes, Pos and Neg, and dataset S
    with p Pos-elements and n Neg-elements.
  • fp p/(pn) fn n/(pn)
  • gini(S) 1 fp2 - fn2
  • If dataset S is split into S1,S2 ,S3 then
  • ginisplit(S1,S2 ,S3 ) gini(S1)(p1n1)/(pn)
    gini(S2)(p2n2)/(pn)
    gini(S3)(p2n2)/(pn)
  • These computations can be easily expressed in
    ATLaS

17
Programming in ATLaS
  • Table-based programming is powerful and natural
    for data intensive
  • SQL can be ackward and many extensions are
    possible
  • But even SQL as is is adequate

18
The ATLaS System
  • The system compile ATLaS programs into C
    programs, which
  • Executes on Berkeley DB record manager
  • The 100 Apriori program compiles into 2,800 lines
    of C
  • Other data structures (R-trees, in-memory tables)
    have been added using the same API.
  • The system is now 54,000 lines of C code.

19
ATLaS Conclusions
  • A native extensibility mechanism for SQLand a
    simple one. More efficient than Java or PL/SQL
  • Effective with Data Minining Applications
  • Also OLAP applications, and recursive queries,
    and temporal database applications
  • Complement current mechanisms based on UDFs and
    Data Blades
  • Supports and favors streaming aggregates (SQL
    implicit default is blocking)
  • Good basis for determining program properties
    e.g. (non)monotonic and blocking behavior
  • These are lessons that future QLs cannot easily
    ignore.

20
The Future
  • Continuous queries on Data Streams
  • Other extensions and improvements
  • Stay tuned www.wis.ucla.edu
Write a Comment
User Comments (0)
About PowerShow.com