Title: C20.0046: Database Management Systems Lecture
1C20.0046 Database Management SystemsLecture 26
- M.P. Johnson
- Stern School of Business, NYU
- Spring, 2005
2Agenda
- Last time
- OLAP Data Warehouses
- Data Mining
- Websearch
- Etc.
3Goals after today
- Be aware of what problems DM solves
- What the algorithms are
- Supervised v. unsupervised learning
- Understand how to find frequent sets and why
4New topic Data Mining
- Situ data rich but knowledge poor
- Terabytes and terabytes of data
- Searching for needles in haystacks
- Can collect lots of data
- credit card xacts
- bar codes
- club cards
- webserver logs
- call centers
- 311 calls in NYC
- cameras
5Lots of data
- Can store this data
- DBMSs, data warehousing
- Can query it
- SQL
- DW/OLAP queries
- find the things such that
- Find the totals for each of
- Can get answers to specific questions, but what
does it all mean?
6But how to learn from this data?
- What kinds of people will buy my widgets?
- Whom should I send my direct-mail literature to?
- How can I segment my market?
- http//www.joelonsoftware.com/articles/CamelsandRu
bberDuckies.html - Whom should we approve for a gold card?
- Who gets approved for car insurance?
- And what should we charge?
7Knowledge Discovery
- Goal Extract interesting/actionable knowledge
from large collections of data - Finding rules
- Finding patterns
- Classifying instances
- KD DM Business Analytics
- Business Intelligence semi-intelligent
reporting/OLAP
8Data Mining at intersection of disciplines
- DBMS
- query processing, data warehousing, OLAP
- Statistics
- mathematical modeling, regression
- Visualization
- 3d representation
- AI/machine learning
- neural networks, decision trees, etc.
- Except the algorithms must really scale
9ML/DM two basic kinds
- Supervised learning
- Classification
- Regression
- Unsupervised learning
- Clustering
- Or Find something interesting
10Supervised learning
- Situ you have many particular instances
- Transactions
- Customers
- Credit applications
- Have various fields describing them
- But other one property youre interested in
- E.g., credit-worthiness
- Goal infer this dependent property from the
other data
11Supervised learning
- Supervised learning starts with training data
- Many instances including the dependent property
- Use the training data to build a model
- Many different algorithms
- Given the model, you can now determine the
dependent property for new instances - And ideally, the answers are correct
- Categorical property ? classification
- Numerical property ? regression
12k-Nearest Neighbor
- Very simple algorithm
- Sometimes works
- Map training data points in a space
- Given any two points, can compute the distance
between them - Euclidean
- Given an unlabeled point, label this way
- Find the k nearest points
- Let them vote on the label
13Neural Networks (skip?)
- Hill climbing
- based on connections between neurons in brain
- simple NN
- input layer with 3 nodes
- hidden layer with 2 nodes
- output layer with 1 node
- each node points to each node in next level
- each node as some activation level and a
something like a critical mass - Draw picture
- What kind of graph is this?
14Neural Networks (skip?)
- values passed into input nodes represent the
problem instance - given weighted sum of inputs to neuron, sends out
pulse only if the sum is greater than its
threshold - values output by hidden nodes sent to output node
- if the weighted sum going into output node is
high enough, it outputs 1, otherwise 0
15NN applications (skip?)
- plausible application we have data about
potential customers - party registration
- married or not
- gender
- income level
- or have credit applicant information
- employment
- income
- home ownership
- bankruptcy
- should we give a credit card to him?
16How NNs work (skip?)
- hope plug in customer ? out comes whether we
should market toward him - How does it get the right answer?
- Â
- Initially, all weights are random!
- But we assume we have data for lots of people
which we know to be either interested in our
products or not (lets say) - we have data for both kinds
- So when we plug in one of these customers, we
know what the right answer is supposed to be
17How NNs work (skip?)
- can used the Backpropagation algorithm
- for each known problem instance, plug in and look
at answer - if answer is wrong, change edges weights in one
way - o.w., change them the opposite way (details
omitted) - repeat
- the more iterations we do, the more the NN learns
our known data - with enough confidence, can apply NN to unknown
customer data to learn whether to market toward
them
18LBS example (skip?)
- Investments
- goal maximize return on investments
- buy/sell right securities at right time
- lots of time-series data for different props for
different stocks - return market signals
- pick right ones
- react
- soln create NN for each stock
- retrain weekly
19Decision Trees
- Another use of (rooted) trees
- Trees but not BSTs
- Each node one attribute
- Its children possible values of that attribute
- E.g. each node some field on a credit app
- Each path from root to leaf is one rule
- If these fields have these values then make this
decision
20Decision Trees
- Details
- for binary property, two out edges, but may be
more - for continuous property (income), divide values
into discrete ranges - a property may appear more tan once
- Example top node history of bankruptcy?
- if yes, REJECT
- if no, then employed?
- If no, (maybe look for high monthly housing
payment) - If yes,
- Particular algorithms ID3, CART, etc.
21Naïve Bayes Classifier
- Bayes Theorem Pr(BA) Pr(B,A)/Pr(A)
- (Pr(AB)Pr(B))/Pr(A)
- Or Pr(SW) Pr(S,W)/Pr(W)
- (Pr(WS)Pr(S))/Pr(W)
- Used in many spam filters
- W means the msg has the words W1,W2,Wn
- S means its spam
- Goal given new msg with certain words, is this
spam? - Is Pr(SW) gt 50?
22Naïve Bayes Classifier
- This is supervised learning, so we first have a
training phase - Look at lots of spam messages and my non-spam
messages - Training phase
- For each word Wi, compute Pr(Wi)
- For each word Wi, compute Pr(WiS)
- Compute Pr(S)
- Thats it!
- Now, we wait for email to arrive
23Naïve Bayes Classifier
- When a new msg with words W W1Wn arrives, we
compute - Pr(SW) (Pr(WS)Pr(S))/Pr(W)
- Whats Pr(W) and Pr(WS)?
- Assuming words are independent (obviously false),
we have - Pr(W) Pr(W1)Pr(W2)Pr(Wn)
- Pr(WS) Pr(W1S)Pr(W2S)Pr(WnS)
- Each number here, we have precomputed!
- Except for new words
- To decide spam status of message, then, we just
do the math! - Very simple, but works surprisingly well in
practice - Really simple can write in a page of Perl code
- See also Paul Graham http//www.paulgraham.com/sp
am.html
24Naïve Bayes Classifier
The reason is because no names were used to lodge
in the consignment containing the funds. Instead,
he used PERSONAL IDENTIFICATION NUMBERS (PIN)and
declared the contents as Bearer Bonds and
Treasury Bills. Also the firm issued him with a
certificate of deposit of the consignment. Note
that I have these information in my custody.Right
now, my husband has asked me to negotiate with a
foreigner who would assist us lay claim to the
consignment, as the eldest wife of my husband,I
believe that I owe the entire family an
obligation to ensure that the US12M
is successfully transferred abroad for investment
purposes. With the present situation, I cannot
do it all by myself. It is based on this that I
am making this contact with you. I have done a
thorough homework and fine-tuned the best way to
create you as the beneficiary to the
consignment containing the funds and effect the
transfer of the consignment accordingly. It is
rest assured that the modalities I have resolved
to finalize the entire project guarantees our
safety and the successful transfer of the funds.
So, you will be absolutely right when you say
that this project is risk free and viable. If
you are capable and willing to assist,contact me
at once via this email for more details. Believe
me, there is no one else we can trust again. All
my husband's friends have deserted us after
exploiting us on the pretence of trying to help
my husband. As it is said, it is at the time of
problems that you know your true friends.So long
as you keep everything to yourself, we would
definitely have no problems. For your assistance,
I am ready to give you as much as 25 of the
total funds after confirmation in your
possession and invest a reasonable percentage
into any viable business you may suggest. Please,
I need your assistance to make this happen and
please do not undermine it because it will also
be a source of up liftment to you also. You have
absolutely nothing to loose in assisting us
instead,you have so much to gain. I will
appreciate if you can contact me at my
private emailmrsaminabulama_at_gawab.com
once. Awaiting your urgent and positive
response. Thanks and Allah be with you. Hajia
(Mrs) Amina Shettima Bulama.
- From "mrs bulama shettima" ltaminabulama_at_pnetmail.
co.za - Date September 11, 2004 43214 AM EDT
- To aminabulama_at_pnetmail.co.za
- Subject SALAM.
- SALAM.
- My name is Hajia Amina Shettima Mohammed Bulama
the eldest - wife of Shettima Mohammed Bulama who was the est.
while - managing director of the Bank of the North Nig,
Plc. - I am contacting you in a benevolent spirit
utmost - confidence and trust to enable us provide a
solution to a - money transfer of 12M that is presentlt the last
resort of - my family.My husband has just been retired from
office and - has also been placed under survelience of which
he can not - travel out of the country for now due to
allegations - levelled against him while in office as the
Managing - director of the bank of the north Nig. Plc of
which he is
25Judging the results
- (Relatively) easy to create a model that does
very well on the training data - What matters should do well on future, unseen
data - Common problem overfitting
- Model is too close to the training data
- Needs to be pruned
- Common approach cross-validation
- Divide training data into 10 parts
- Train on 9/10, test on 1/10
- Do this all 10 ways
26Association applications
- Spam filtering
- Network intrusion detection
- Trading strategies
- Political marketing
- Mail order tulip bulbs ? Conservative party
voters - NYTimes last year
27New topic Clustering
- Unsupervised learning
- divide items up into clusters of same type
- What are the different types of customers we
have? - What types of webpages are there?
- Each item becomes a point in the space
- One dim for every property
- Poss plotting for webpages dim. for each word
- Value in dim d is occur.s of d / all word
occur.s - k-means
28k-means
- Simple clustering algorithm
- Want to partition data points into sets of
similar instances - Plot in n-space
- ? partition points into clumps in space
- Like a unsupervised analog to k-NN
- But iterative
29k-means
- Alg
- Choose initial means (centers)
- Do
- assign each point to the closest mean
- recompute means of clumps
- until change in means lt e
- Visualization
- http//www.delft-cluster.nl/textminer/theory/kmean
s/kmeans.html
30- Amazons Page You Made
- mpj9_at_col
- How does this work?
31New topic Frequent Sets
- Find sets of items that go together
- Intuition market-basket data
- Suppose youre Wal-Mart, with 460TB of data
- http//developers.slashdot.org/article.pl?sid04/1
1/14/2057228 - Might not know customer hist (but maybe club
cards!) - One thing you do know groupings of items
- Each set of item wrung up in a an individual
purchase - Famous (claimed) example beer and diapers are
positively correlated - Intuition Babies at home, not at bars
- ? put chips in between
32Finding Frequent Sets
- Situ one table
- Baskets(id, item)
- One row for each purchase of each item
- Alt interp words on webpages
- Goal given support s, find set of items that
appear together in gts baskets - For pairs, obvious attempt
SELECT B1.item, B2.item, count(B.id) FROM Baskets
B1, Baskets B2 WHERE B1.id B2.id AND B1.item lt
B2.item GROUP BY B1.item, B2.item HAVING
count(B1.basket) gt s
Whats wrong?
33A Priori Property
- a priori prior to
- Prior to investigation
- Deductive/from reason/non-empirical
- As opposed to a posteriori post-
- After investigation/experience
- Logical observation if support(X) gt s, then for
every individual x in X, support(x) gt s - ? any item y with support lt s can be ignored
34A Priori Property
- E.g. You sell 10,000 items, stores record of
1,000,000 baskets, with avg of 20 items each - ? 20,000,000 rows in Baskets
- pairs of items from 1,000,000 baskets
- C(20,2)1,000,000 190,000,000 pairs!
- One idea
- Find support-s items
- Run old query on them
35A Priori Property
- Suppose were looking for sup-10,000 pairs
- Each item in pair must be sup-10,000
- Counting arg have only 20,000,000 individual
items purchased - Can have at most 2000 20,000,000/10,000 popular
items - Eliminated 4/5s of the item types!
- Actually, probably much more
- Will also lower the average basket size
- Suppose now have 500,000 rows with avg basket
size 10 - pairs 500,000C(10,2) 22,500,000 rows
36A Priori Algorithm
- But a frequent itemset may have gt2 items
- Idea build frequent sets (not just pairs)
iteratively - Alg
- First, find all frequent items
- These are the size-1 frequent sets
- Next, for each size-k frequent set
- For each frequent item
- Check whether the union is frequent
- If so, add to collection of size-k1 frequent
sets
37Mining for rules
- Frequent sets tell us which things go together
- But they dont tell us causality
- Sometimes want to say a causes b
- Or at least presence a makes b likely
- E.g. razor purchase ? future razorblade
purchases - but not vice versa
- ? make razors cheap
38Association rules
- Heres some data what implies what?
39Association rules
- Candidates
- Pen ? ink
- Ink ? pen
- Water ? juice
- Etc.
- How to pick?
- Two main measures
40Judging association rules
- Support
- Support(X ? Y) support (X union Y)
- Pr(X union Y)
- Does the rule matter?
- Confidence
- Conf(X ? Y) support(X ? Y) / support(X)
- Pr(YX)
- Show we believe the rule?
41Association rules
- What are the supports and confidences?
- Pen ? ink
- Ink ? pen
- Water ? juice
- Which matter?
42Discovering association rules
- Association rules only matter if their support
and confidence are both high enough - User specifies minimum allowed for each
- First, support
- High support(X?Y) ? high support(X u Y)
- ? X u Y is a frequent set
- So, to find a good association rule
- Generate a frequent set Z
- Divide into subsets X,Y all possible ways
- Checking if conf(X?Y) is high enough
43Association rules
- Suppose pen,ink is frequent
- Divide in subsets both ways
- Pen ? ink
- Ink ? pen
- Which do we choose?
44Other kinds of baskets
- Here, the basket was a single transaction
(purchase) - But could have other baskets
- All purchases from each customer
- All purchases on first-day-of-the-months
- Etc.
45Frequent set/association applications
- Store/website/catalog layout
- Page You Made
- Direct marketing
- Fraud detection
46Mining v. warehousing
- Warehousing let user search, group by
interesting properties - Give me the sales of A4s by year and dealer, for
these colors - User tries to learn from results which properties
are important/interesting - Whats driving sales?
- Mining tell the user what the interesting
properties are - How can I increase sales?
47Social/political concerns
- Privacy
- TIA
- Sensitive data
- Allow mining but not queries
- Opt-in/opt-out
- Dont be evil.
48For more info
- See Dahr Stein Seven Methods for Transforming
Corporate Data into Business Intelligence (1997) - Drawn on above
- A few years old, but very accessible
- http//www.kdnuggets.com/
- Data mining courses offered here
49Future
- RAID
- Websearch
- Proj5 due
- Print by 1030 and turn in on time (at 11)
- If not, email
- Final Exam next Thursday, 5/5,10-1150am
- Info is up