C20.0046: Database Management Systems Lecture

About This Presentation

Title:

C20.0046: Database Management Systems Lecture

Description:

club cards. webserver logs. call centers. 311 calls in NYC. cameras. M.P. Johnson, DBMS, Stern/NYU, Spring 2005. 5. Lots of data. Can store this data ... – PowerPoint PPT presentation

Number of Views:74

Avg rating:3.0/5.0

Slides: 50

Provided by: pagesSt

Learn more at: https://pages.stern.nyu.edu

Category:

more less

Transcript and Presenter's Notes

Title: C20.0046: Database Management Systems Lecture

1
C20.0046 Database Management SystemsLecture 26

M.P. Johnson
Stern School of Business, NYU
Spring, 2005

2
Agenda

Last time
OLAP Data Warehouses
Data Mining
Websearch
Etc.

3
Goals after today

Be aware of what problems DM solves
What the algorithms are
Supervised v. unsupervised learning
Understand how to find frequent sets and why

4
New topic Data Mining

Situ data rich but knowledge poor
Terabytes and terabytes of data
Searching for needles in haystacks
Can collect lots of data
credit card xacts
bar codes
club cards
webserver logs
call centers
311 calls in NYC
cameras

5
Lots of data

Can store this data
DBMSs, data warehousing
Can query it
SQL
DW/OLAP queries
find the things such that
Find the totals for each of
Can get answers to specific questions, but what
does it all mean?

6
But how to learn from this data?

What kinds of people will buy my widgets?
Whom should I send my direct-mail literature to?
How can I segment my market?
http//www.joelonsoftware.com/articles/CamelsandRu
bberDuckies.html
Whom should we approve for a gold card?
Who gets approved for car insurance?
And what should we charge?

7
Knowledge Discovery

Goal Extract interesting/actionable knowledge
from large collections of data
Finding rules
Finding patterns
Classifying instances
KD DM Business Analytics
Business Intelligence semi-intelligent
reporting/OLAP

8
Data Mining at intersection of disciplines

DBMS
query processing, data warehousing, OLAP
Statistics
mathematical modeling, regression
Visualization
3d representation
AI/machine learning
neural networks, decision trees, etc.
Except the algorithms must really scale

9
ML/DM two basic kinds

Supervised learning
Classification
Regression
Unsupervised learning
Clustering
Or Find something interesting

10
Supervised learning

Situ you have many particular instances
Transactions
Customers
Credit applications
Have various fields describing them
But other one property youre interested in
E.g., credit-worthiness
Goal infer this dependent property from the
other data

11
Supervised learning

Supervised learning starts with training data
Many instances including the dependent property
Use the training data to build a model
Many different algorithms
Given the model, you can now determine the
dependent property for new instances
And ideally, the answers are correct
Categorical property ? classification
Numerical property ? regression

12
k-Nearest Neighbor

Very simple algorithm
Sometimes works
Map training data points in a space
Given any two points, can compute the distance
between them
Euclidean
Given an unlabeled point, label this way
Find the k nearest points
Let them vote on the label

13
Neural Networks (skip?)

Hill climbing
based on connections between neurons in brain
simple NN
input layer with 3 nodes
hidden layer with 2 nodes
output layer with 1 node
each node points to each node in next level
each node as some activation level and a
something like a critical mass
Draw picture
What kind of graph is this?

14
Neural Networks (skip?)

values passed into input nodes represent the
problem instance
given weighted sum of inputs to neuron, sends out
pulse only if the sum is greater than its
threshold
values output by hidden nodes sent to output node
if the weighted sum going into output node is
high enough, it outputs 1, otherwise 0

15
NN applications (skip?)

plausible application we have data about
potential customers
party registration
married or not
gender
income level
or have credit applicant information
employment
income
home ownership
bankruptcy
should we give a credit card to him?

16
How NNs work (skip?)

hope plug in customer ? out comes whether we
should market toward him
How does it get the right answer?
Initially, all weights are random!
But we assume we have data for lots of people
which we know to be either interested in our
products or not (lets say)
we have data for both kinds
So when we plug in one of these customers, we
know what the right answer is supposed to be

17
How NNs work (skip?)

can used the Backpropagation algorithm
for each known problem instance, plug in and look
at answer
if answer is wrong, change edges weights in one
way
o.w., change them the opposite way (details
omitted)
repeat
the more iterations we do, the more the NN learns
our known data
with enough confidence, can apply NN to unknown
customer data to learn whether to market toward
them

18
LBS example (skip?)

Investments
goal maximize return on investments
buy/sell right securities at right time
lots of time-series data for different props for
different stocks
return market signals
pick right ones
react
soln create NN for each stock
retrain weekly

19
Decision Trees

Another use of (rooted) trees
Trees but not BSTs
Each node one attribute
Its children possible values of that attribute
E.g. each node some field on a credit app
Each path from root to leaf is one rule
If these fields have these values then make this
decision

20
Decision Trees

Details
for binary property, two out edges, but may be
more
for continuous property (income), divide values
into discrete ranges
a property may appear more tan once
Example top node history of bankruptcy?
if yes, REJECT
if no, then employed?
If no, (maybe look for high monthly housing
payment)
If yes,
Particular algorithms ID3, CART, etc.

21
Naïve Bayes Classifier

Bayes Theorem Pr(BA) Pr(B,A)/Pr(A)
(Pr(AB)Pr(B))/Pr(A)
Or Pr(SW) Pr(S,W)/Pr(W)
(Pr(WS)Pr(S))/Pr(W)
Used in many spam filters
W means the msg has the words W1,W2,Wn
S means its spam
Goal given new msg with certain words, is this
spam?
Is Pr(SW) gt 50?

22
Naïve Bayes Classifier

This is supervised learning, so we first have a
training phase
Look at lots of spam messages and my non-spam
messages
Training phase
For each word Wi, compute Pr(Wi)
For each word Wi, compute Pr(WiS)
Compute Pr(S)
Thats it!
Now, we wait for email to arrive

23
Naïve Bayes Classifier

When a new msg with words W W1Wn arrives, we
compute
Pr(SW) (Pr(WS)Pr(S))/Pr(W)
Whats Pr(W) and Pr(WS)?
Assuming words are independent (obviously false),
we have
Pr(W) Pr(W1)Pr(W2)Pr(Wn)
Pr(WS) Pr(W1S)Pr(W2S)Pr(WnS)
Each number here, we have precomputed!
Except for new words
To decide spam status of message, then, we just
do the math!
Very simple, but works surprisingly well in
practice
Really simple can write in a page of Perl code
See also Paul Graham http//www.paulgraham.com/sp
am.html

24
Naïve Bayes Classifier
The reason is because no names were used to lodge
in the consignment containing the funds. Instead,
he used PERSONAL IDENTIFICATION NUMBERS (PIN)and
declared the contents as Bearer Bonds and
Treasury Bills. Also the firm issued him with a
certificate of deposit of the consignment. Note
that I have these information in my custody.Right
now, my husband has asked me to negotiate with a
foreigner who would assist us lay claim to the
consignment, as the eldest wife of my husband,I
believe that I owe the entire family an
obligation to ensure that the US12M
is successfully transferred abroad for investment
purposes. With the present situation, I cannot
do it all by myself. It is based on this that I
am making this contact with you. I have done a
thorough homework and fine-tuned the best way to
create you as the beneficiary to the
consignment containing the funds and effect the
transfer of the consignment accordingly. It is
rest assured that the modalities I have resolved
to finalize the entire project guarantees our
safety and the successful transfer of the funds.
So, you will be absolutely right when you say
that this project is risk free and viable. If
you are capable and willing to assist,contact me
at once via this email for more details. Believe
me, there is no one else we can trust again. All
my husband's friends have deserted us after
exploiting us on the pretence of trying to help
my husband. As it is said, it is at the time of
problems that you know your true friends.So long
as you keep everything to yourself, we would
definitely have no problems. For your assistance,
I am ready to give you as much as 25 of the
total funds after confirmation in your
possession and invest a reasonable percentage
into any viable business you may suggest. Please,
I need your assistance to make this happen and
please do not undermine it because it will also
be a source of up liftment to you also. You have
absolutely nothing to loose in assisting us
instead,you have so much to gain. I will
appreciate if you can contact me at my
private emailmrsaminabulama_at_gawab.com
once. Awaiting your urgent and positive
response. Thanks and Allah be with you. Hajia
(Mrs) Amina Shettima Bulama.

From "mrs bulama shettima" ltaminabulama_at_pnetmail.
co.za
Date September 11, 2004 43214 AM EDT
To aminabulama_at_pnetmail.co.za
Subject SALAM.
SALAM.
My name is Hajia Amina Shettima Mohammed Bulama
the eldest
wife of Shettima Mohammed Bulama who was the est.
while
managing director of the Bank of the North Nig,
Plc.
I am contacting you in a benevolent spirit
utmost
confidence and trust to enable us provide a
solution to a
money transfer of 12M that is presentlt the last
resort of
my family.My husband has just been retired from
office and
has also been placed under survelience of which
he can not
travel out of the country for now due to
allegations
levelled against him while in office as the
Managing
director of the bank of the north Nig. Plc of
which he is

25
Judging the results

(Relatively) easy to create a model that does
very well on the training data
What matters should do well on future, unseen
data
Common problem overfitting
Model is too close to the training data
Needs to be pruned
Common approach cross-validation
Divide training data into 10 parts
Train on 9/10, test on 1/10
Do this all 10 ways

26
Association applications

Spam filtering
Network intrusion detection
Trading strategies
Political marketing
Mail order tulip bulbs ? Conservative party
voters
NYTimes last year

27
New topic Clustering

Unsupervised learning
divide items up into clusters of same type
What are the different types of customers we
have?
What types of webpages are there?
Each item becomes a point in the space
One dim for every property
Poss plotting for webpages dim. for each word
Value in dim d is occur.s of d / all word
occur.s
k-means

28
k-means

Simple clustering algorithm
Want to partition data points into sets of
similar instances
Plot in n-space
? partition points into clumps in space
Like a unsupervised analog to k-NN
But iterative

29
k-means

Alg
Choose initial means (centers)
Do
assign each point to the closest mean
recompute means of clumps
until change in means lt e
Visualization
http//www.delft-cluster.nl/textminer/theory/kmean
s/kmeans.html

Amazons Page You Made
mpj9_at_col
How does this work?

31
New topic Frequent Sets

Find sets of items that go together
Intuition market-basket data
Suppose youre Wal-Mart, with 460TB of data
http//developers.slashdot.org/article.pl?sid04/1
1/14/2057228
Might not know customer hist (but maybe club
cards!)
One thing you do know groupings of items
Each set of item wrung up in a an individual
purchase
Famous (claimed) example beer and diapers are
positively correlated
Intuition Babies at home, not at bars
? put chips in between

32
Finding Frequent Sets

Situ one table
Baskets(id, item)
One row for each purchase of each item
Alt interp words on webpages
Goal given support s, find set of items that
appear together in gts baskets
For pairs, obvious attempt

SELECT B1.item, B2.item, count(B.id) FROM Baskets
B1, Baskets B2 WHERE B1.id B2.id AND B1.item lt
B2.item GROUP BY B1.item, B2.item HAVING
count(B1.basket) gt s
Whats wrong?
33
A Priori Property

a priori prior to
Prior to investigation
Deductive/from reason/non-empirical
As opposed to a posteriori post-
After investigation/experience
Logical observation if support(X) gt s, then for
every individual x in X, support(x) gt s
? any item y with support lt s can be ignored

34
A Priori Property

E.g. You sell 10,000 items, stores record of
1,000,000 baskets, with avg of 20 items each
? 20,000,000 rows in Baskets
pairs of items from 1,000,000 baskets
C(20,2)1,000,000 190,000,000 pairs!
One idea
Find support-s items
Run old query on them

35
A Priori Property

Suppose were looking for sup-10,000 pairs
Each item in pair must be sup-10,000
Counting arg have only 20,000,000 individual
items purchased
Can have at most 2000 20,000,000/10,000 popular
items
Eliminated 4/5s of the item types!
Actually, probably much more
Will also lower the average basket size
Suppose now have 500,000 rows with avg basket
size 10
pairs 500,000C(10,2) 22,500,000 rows

36
A Priori Algorithm

But a frequent itemset may have gt2 items
Idea build frequent sets (not just pairs)
iteratively
Alg
First, find all frequent items
These are the size-1 frequent sets
Next, for each size-k frequent set
For each frequent item
Check whether the union is frequent
If so, add to collection of size-k1 frequent
sets

37
Mining for rules

Frequent sets tell us which things go together
But they dont tell us causality
Sometimes want to say a causes b
Or at least presence a makes b likely
E.g. razor purchase ? future razorblade
purchases
but not vice versa
? make razors cheap

38
Association rules

Heres some data what implies what?

39
Association rules

Candidates
Pen ? ink
Ink ? pen
Water ? juice
Etc.
How to pick?
Two main measures

40
Judging association rules

Support
Support(X ? Y) support (X union Y)
Pr(X union Y)
Does the rule matter?
Confidence
Conf(X ? Y) support(X ? Y) / support(X)
Pr(YX)
Show we believe the rule?

41
Association rules

What are the supports and confidences?

Pen ? ink
Ink ? pen
Water ? juice
Which matter?

42
Discovering association rules

Association rules only matter if their support
and confidence are both high enough
User specifies minimum allowed for each
First, support
High support(X?Y) ? high support(X u Y)
? X u Y is a frequent set
So, to find a good association rule
Generate a frequent set Z
Divide into subsets X,Y all possible ways
Checking if conf(X?Y) is high enough

43
Association rules

Suppose pen,ink is frequent

Divide in subsets both ways
Pen ? ink
Ink ? pen
Which do we choose?

44
Other kinds of baskets

Here, the basket was a single transaction
(purchase)
But could have other baskets
All purchases from each customer
All purchases on first-day-of-the-months
Etc.

45
Frequent set/association applications

Store/website/catalog layout
Page You Made
Direct marketing
Fraud detection

46
Mining v. warehousing

Warehousing let user search, group by
interesting properties
Give me the sales of A4s by year and dealer, for
these colors
User tries to learn from results which properties
are important/interesting
Whats driving sales?
Mining tell the user what the interesting
properties are
How can I increase sales?

47
Social/political concerns

Privacy
TIA
Sensitive data
Allow mining but not queries
Opt-in/opt-out
Dont be evil.

48
For more info

See Dahr Stein Seven Methods for Transforming
Corporate Data into Business Intelligence (1997)
Drawn on above
A few years old, but very accessible
http//www.kdnuggets.com/
Data mining courses offered here

49
Future