C20.0046: Database Management Systems Lecture - PowerPoint PPT Presentation

1 / 49
About This Presentation
Title:

C20.0046: Database Management Systems Lecture

Description:

club cards. webserver logs. call centers. 311 calls in NYC. cameras. M.P. Johnson, DBMS, Stern/NYU, Spring 2005. 5. Lots of data. Can store this data ... – PowerPoint PPT presentation

Number of Views:74
Avg rating:3.0/5.0
Slides: 50
Provided by: pagesSt
Category:

less

Transcript and Presenter's Notes

Title: C20.0046: Database Management Systems Lecture


1
C20.0046 Database Management SystemsLecture 26
  • M.P. Johnson
  • Stern School of Business, NYU
  • Spring, 2005

2
Agenda
  • Last time
  • OLAP Data Warehouses
  • Data Mining
  • Websearch
  • Etc.

3
Goals after today
  • Be aware of what problems DM solves
  • What the algorithms are
  • Supervised v. unsupervised learning
  • Understand how to find frequent sets and why

4
New topic Data Mining
  • Situ data rich but knowledge poor
  • Terabytes and terabytes of data
  • Searching for needles in haystacks
  • Can collect lots of data
  • credit card xacts
  • bar codes
  • club cards
  • webserver logs
  • call centers
  • 311 calls in NYC
  • cameras

5
Lots of data
  • Can store this data
  • DBMSs, data warehousing
  • Can query it
  • SQL
  • DW/OLAP queries
  • find the things such that
  • Find the totals for each of
  • Can get answers to specific questions, but what
    does it all mean?

6
But how to learn from this data?
  • What kinds of people will buy my widgets?
  • Whom should I send my direct-mail literature to?
  • How can I segment my market?
  • http//www.joelonsoftware.com/articles/CamelsandRu
    bberDuckies.html
  • Whom should we approve for a gold card?
  • Who gets approved for car insurance?
  • And what should we charge?

7
Knowledge Discovery
  • Goal Extract interesting/actionable knowledge
    from large collections of data
  • Finding rules
  • Finding patterns
  • Classifying instances
  • KD DM Business Analytics
  • Business Intelligence semi-intelligent
    reporting/OLAP

8
Data Mining at intersection of disciplines
  • DBMS
  • query processing, data warehousing, OLAP
  • Statistics
  • mathematical modeling, regression
  • Visualization
  • 3d representation
  • AI/machine learning
  • neural networks, decision trees, etc.
  • Except the algorithms must really scale

9
ML/DM two basic kinds
  • Supervised learning
  • Classification
  • Regression
  • Unsupervised learning
  • Clustering
  • Or Find something interesting

10
Supervised learning
  • Situ you have many particular instances
  • Transactions
  • Customers
  • Credit applications
  • Have various fields describing them
  • But other one property youre interested in
  • E.g., credit-worthiness
  • Goal infer this dependent property from the
    other data

11
Supervised learning
  • Supervised learning starts with training data
  • Many instances including the dependent property
  • Use the training data to build a model
  • Many different algorithms
  • Given the model, you can now determine the
    dependent property for new instances
  • And ideally, the answers are correct
  • Categorical property ? classification
  • Numerical property ? regression

12
k-Nearest Neighbor
  • Very simple algorithm
  • Sometimes works
  • Map training data points in a space
  • Given any two points, can compute the distance
    between them
  • Euclidean
  • Given an unlabeled point, label this way
  • Find the k nearest points
  • Let them vote on the label

13
Neural Networks (skip?)
  • Hill climbing
  • based on connections between neurons in brain
  • simple NN
  • input layer with 3 nodes
  • hidden layer with 2 nodes
  • output layer with 1 node
  • each node points to each node in next level
  • each node as some activation level and a
    something like a critical mass
  • Draw picture
  • What kind of graph is this?

14
Neural Networks (skip?)
  • values passed into input nodes represent the
    problem instance
  • given weighted sum of inputs to neuron, sends out
    pulse only if the sum is greater than its
    threshold
  • values output by hidden nodes sent to output node
  • if the weighted sum going into output node is
    high enough, it outputs 1, otherwise 0

15
NN applications (skip?)
  • plausible application we have data about
    potential customers
  • party registration
  • married or not
  • gender
  • income level
  • or have credit applicant information
  • employment
  • income
  • home ownership
  • bankruptcy
  • should we give a credit card to him?

16
How NNs work (skip?)
  • hope plug in customer ? out comes whether we
    should market toward him
  • How does it get the right answer?
  •  
  • Initially, all weights are random!
  • But we assume we have data for lots of people
    which we know to be either interested in our
    products or not (lets say)
  • we have data for both kinds
  • So when we plug in one of these customers, we
    know what the right answer is supposed to be

17
How NNs work (skip?)
  • can used the Backpropagation algorithm
  • for each known problem instance, plug in and look
    at answer
  • if answer is wrong, change edges weights in one
    way
  • o.w., change them the opposite way (details
    omitted)
  • repeat
  • the more iterations we do, the more the NN learns
    our known data
  • with enough confidence, can apply NN to unknown
    customer data to learn whether to market toward
    them

18
LBS example (skip?)
  • Investments
  • goal maximize return on investments
  • buy/sell right securities at right time
  • lots of time-series data for different props for
    different stocks
  • return market signals
  • pick right ones
  • react
  • soln create NN for each stock
  • retrain weekly

19
Decision Trees
  • Another use of (rooted) trees
  • Trees but not BSTs
  • Each node one attribute
  • Its children possible values of that attribute
  • E.g. each node some field on a credit app
  • Each path from root to leaf is one rule
  • If these fields have these values then make this
    decision

20
Decision Trees
  • Details
  • for binary property, two out edges, but may be
    more
  • for continuous property (income), divide values
    into discrete ranges
  • a property may appear more tan once
  • Example top node history of bankruptcy?
  • if yes, REJECT
  • if no, then employed?
  • If no, (maybe look for high monthly housing
    payment)
  • If yes,
  • Particular algorithms ID3, CART, etc.

21
Naïve Bayes Classifier
  • Bayes Theorem Pr(BA) Pr(B,A)/Pr(A)
  • (Pr(AB)Pr(B))/Pr(A)
  • Or Pr(SW) Pr(S,W)/Pr(W)
  • (Pr(WS)Pr(S))/Pr(W)
  • Used in many spam filters
  • W means the msg has the words W1,W2,Wn
  • S means its spam
  • Goal given new msg with certain words, is this
    spam?
  • Is Pr(SW) gt 50?

22
Naïve Bayes Classifier
  • This is supervised learning, so we first have a
    training phase
  • Look at lots of spam messages and my non-spam
    messages
  • Training phase
  • For each word Wi, compute Pr(Wi)
  • For each word Wi, compute Pr(WiS)
  • Compute Pr(S)
  • Thats it!
  • Now, we wait for email to arrive

23
Naïve Bayes Classifier
  • When a new msg with words W W1Wn arrives, we
    compute
  • Pr(SW) (Pr(WS)Pr(S))/Pr(W)
  • Whats Pr(W) and Pr(WS)?
  • Assuming words are independent (obviously false),
    we have
  • Pr(W) Pr(W1)Pr(W2)Pr(Wn)
  • Pr(WS) Pr(W1S)Pr(W2S)Pr(WnS)
  • Each number here, we have precomputed!
  • Except for new words
  • To decide spam status of message, then, we just
    do the math!
  • Very simple, but works surprisingly well in
    practice
  • Really simple can write in a page of Perl code
  • See also Paul Graham http//www.paulgraham.com/sp
    am.html

24
Naïve Bayes Classifier
The reason is because no names were used to lodge
in the consignment containing the funds. Instead,
he used PERSONAL IDENTIFICATION NUMBERS (PIN)and
declared the contents as Bearer Bonds and
Treasury Bills. Also the firm issued him with a
certificate of deposit of the consignment. Note
that I have these information in my custody.Right
now, my husband has asked me to negotiate with a
foreigner who would assist us lay claim to the
consignment, as the eldest wife of my husband,I
believe that I owe the entire family an
obligation to ensure that the US12M
is successfully transferred abroad for investment
purposes. With the present situation, I cannot
do it all by myself. It is based on this that I
am making this contact with you. I have done a
thorough homework and fine-tuned the best way to
create you as the beneficiary to the
consignment containing the funds and effect the
transfer of the consignment accordingly. It is
rest assured that the modalities I have resolved
to finalize the entire project guarantees our
safety and the successful transfer of the funds.
So, you will be absolutely right when you say
that this project is risk free and viable. If
you are capable and willing to assist,contact me
at once via this email for more details. Believe
me, there is no one else we can trust again. All
my husband's friends have deserted us after
exploiting us on the pretence of trying to help
my husband. As it is said, it is at the time of
problems that you know your true friends.So long
as you keep everything to yourself, we would
definitely have no problems. For your assistance,
I am ready to give you as much as 25 of the
total funds after confirmation in your
possession and invest a reasonable percentage
into any viable business you may suggest. Please,
I need your assistance to make this happen and
please do not undermine it because it will also
be a source of up liftment to you also. You have
absolutely nothing to loose in assisting us
instead,you have so much to gain. I will
appreciate if you can contact me at my
private emailmrsaminabulama_at_gawab.com
once. Awaiting your urgent and positive
response. Thanks and Allah be with you. Hajia
(Mrs) Amina Shettima Bulama.
  • From "mrs bulama shettima" ltaminabulama_at_pnetmail.
    co.za
  • Date September 11, 2004 43214 AM EDT
  • To aminabulama_at_pnetmail.co.za
  • Subject SALAM.
  • SALAM.
  • My name is Hajia Amina Shettima Mohammed Bulama
    the eldest
  • wife of Shettima Mohammed Bulama who was the est.
    while
  • managing director of the Bank of the North Nig,
    Plc.
  • I am contacting you in a benevolent spirit
    utmost
  • confidence and trust to enable us provide a
    solution to a
  • money transfer of 12M that is presentlt the last
    resort of
  • my family.My husband has just been retired from
    office and
  • has also been placed under survelience of which
    he can not
  • travel out of the country for now due to
    allegations
  • levelled against him while in office as the
    Managing
  • director of the bank of the north Nig. Plc of
    which he is

25
Judging the results
  • (Relatively) easy to create a model that does
    very well on the training data
  • What matters should do well on future, unseen
    data
  • Common problem overfitting
  • Model is too close to the training data
  • Needs to be pruned
  • Common approach cross-validation
  • Divide training data into 10 parts
  • Train on 9/10, test on 1/10
  • Do this all 10 ways

26
Association applications
  • Spam filtering
  • Network intrusion detection
  • Trading strategies
  • Political marketing
  • Mail order tulip bulbs ? Conservative party
    voters
  • NYTimes last year

27
New topic Clustering
  • Unsupervised learning
  • divide items up into clusters of same type
  • What are the different types of customers we
    have?
  • What types of webpages are there?
  • Each item becomes a point in the space
  • One dim for every property
  • Poss plotting for webpages dim. for each word
  • Value in dim d is occur.s of d / all word
    occur.s
  • k-means

28
k-means
  • Simple clustering algorithm
  • Want to partition data points into sets of
    similar instances
  • Plot in n-space
  • ? partition points into clumps in space
  • Like a unsupervised analog to k-NN
  • But iterative

29
k-means
  • Alg
  • Choose initial means (centers)
  • Do
  • assign each point to the closest mean
  • recompute means of clumps
  • until change in means lt e
  • Visualization
  • http//www.delft-cluster.nl/textminer/theory/kmean
    s/kmeans.html

30
  • Amazons Page You Made
  • mpj9_at_col
  • How does this work?

31
New topic Frequent Sets
  • Find sets of items that go together
  • Intuition market-basket data
  • Suppose youre Wal-Mart, with 460TB of data
  • http//developers.slashdot.org/article.pl?sid04/1
    1/14/2057228
  • Might not know customer hist (but maybe club
    cards!)
  • One thing you do know groupings of items
  • Each set of item wrung up in a an individual
    purchase
  • Famous (claimed) example beer and diapers are
    positively correlated
  • Intuition Babies at home, not at bars
  • ? put chips in between

32
Finding Frequent Sets
  • Situ one table
  • Baskets(id, item)
  • One row for each purchase of each item
  • Alt interp words on webpages
  • Goal given support s, find set of items that
    appear together in gts baskets
  • For pairs, obvious attempt

SELECT B1.item, B2.item, count(B.id) FROM Baskets
B1, Baskets B2 WHERE B1.id B2.id AND B1.item lt
B2.item GROUP BY B1.item, B2.item HAVING
count(B1.basket) gt s
Whats wrong?
33
A Priori Property
  • a priori prior to
  • Prior to investigation
  • Deductive/from reason/non-empirical
  • As opposed to a posteriori post-
  • After investigation/experience
  • Logical observation if support(X) gt s, then for
    every individual x in X, support(x) gt s
  • ? any item y with support lt s can be ignored

34
A Priori Property
  • E.g. You sell 10,000 items, stores record of
    1,000,000 baskets, with avg of 20 items each
  • ? 20,000,000 rows in Baskets
  • pairs of items from 1,000,000 baskets
  • C(20,2)1,000,000 190,000,000 pairs!
  • One idea
  • Find support-s items
  • Run old query on them

35
A Priori Property
  • Suppose were looking for sup-10,000 pairs
  • Each item in pair must be sup-10,000
  • Counting arg have only 20,000,000 individual
    items purchased
  • Can have at most 2000 20,000,000/10,000 popular
    items
  • Eliminated 4/5s of the item types!
  • Actually, probably much more
  • Will also lower the average basket size
  • Suppose now have 500,000 rows with avg basket
    size 10
  • pairs 500,000C(10,2) 22,500,000 rows

36
A Priori Algorithm
  • But a frequent itemset may have gt2 items
  • Idea build frequent sets (not just pairs)
    iteratively
  • Alg
  • First, find all frequent items
  • These are the size-1 frequent sets
  • Next, for each size-k frequent set
  • For each frequent item
  • Check whether the union is frequent
  • If so, add to collection of size-k1 frequent
    sets

37
Mining for rules
  • Frequent sets tell us which things go together
  • But they dont tell us causality
  • Sometimes want to say a causes b
  • Or at least presence a makes b likely
  • E.g. razor purchase ? future razorblade
    purchases
  • but not vice versa
  • ? make razors cheap

38
Association rules
  • Heres some data what implies what?

39
Association rules
  • Candidates
  • Pen ? ink
  • Ink ? pen
  • Water ? juice
  • Etc.
  • How to pick?
  • Two main measures

40
Judging association rules
  • Support
  • Support(X ? Y) support (X union Y)
  • Pr(X union Y)
  • Does the rule matter?
  • Confidence
  • Conf(X ? Y) support(X ? Y) / support(X)
  • Pr(YX)
  • Show we believe the rule?

41
Association rules
  • What are the supports and confidences?
  • Pen ? ink
  • Ink ? pen
  • Water ? juice
  • Which matter?

42
Discovering association rules
  • Association rules only matter if their support
    and confidence are both high enough
  • User specifies minimum allowed for each
  • First, support
  • High support(X?Y) ? high support(X u Y)
  • ? X u Y is a frequent set
  • So, to find a good association rule
  • Generate a frequent set Z
  • Divide into subsets X,Y all possible ways
  • Checking if conf(X?Y) is high enough

43
Association rules
  • Suppose pen,ink is frequent
  • Divide in subsets both ways
  • Pen ? ink
  • Ink ? pen
  • Which do we choose?

44
Other kinds of baskets
  • Here, the basket was a single transaction
    (purchase)
  • But could have other baskets
  • All purchases from each customer
  • All purchases on first-day-of-the-months
  • Etc.

45
Frequent set/association applications
  • Store/website/catalog layout
  • Page You Made
  • Direct marketing
  • Fraud detection

46
Mining v. warehousing
  • Warehousing let user search, group by
    interesting properties
  • Give me the sales of A4s by year and dealer, for
    these colors
  • User tries to learn from results which properties
    are important/interesting
  • Whats driving sales?
  • Mining tell the user what the interesting
    properties are
  • How can I increase sales?

47
Social/political concerns
  • Privacy
  • TIA
  • Sensitive data
  • Allow mining but not queries
  • Opt-in/opt-out
  • Dont be evil.

48
For more info
  • See Dahr Stein Seven Methods for Transforming
    Corporate Data into Business Intelligence (1997)
  • Drawn on above
  • A few years old, but very accessible
  • http//www.kdnuggets.com/
  • Data mining courses offered here

49
Future
  • RAID
  • Websearch
  • Proj5 due
  • Print by 1030 and turn in on time (at 11)
  • If not, email
  • Final Exam next Thursday, 5/5,10-1150am
  • Info is up
Write a Comment
User Comments (0)
About PowerShow.com