Title: Advanced Database Systems F24DS2 F29AT2
1Advanced Database SystemsF24DS2 / F29AT2
2Data Mining has Many meanings
- There are lots of things you can do with a
database - Access data via straightforward queries, to
answer - straightforward questions about instances.
E.g. - What is Ellen McArthurs home phone number?
- What is the ISBN number of Eats, shoots and
leaves, - by Lynne Truss?,
- What grade did Larry Page get for the Internet
module?? - Give me a list of all pages on the www that
contain the phrase fried egg sandwich.
3Data Mining has Many meanings (cont)
-
- Generate simple reports about data via
straightforward queries, to answer questions
about sets of instances. E.g. - How many of our customers are called Trevor?
- Which of our books has been borrowed more times
in the last month than Eats, shoots and leaves,
- by Lynne Truss?,
- Which student has the highest average marks?
- What percentage of house owners also own a
car?.
4Data Mining has Many meanings (cont)
-
- Generate complex/and/or comprehensive
statistical reports about the database as a
whole, to summarise and understand the data
this is what tends to be done in the Analysis
stage of Data Cleaning.. E.g. - For each field, generate a histogram of the
values - Run one or more clustering algorithms to find the
clusters in the data.
5Data Mining has Many meanings (cont)
- Build Predictive models that may then be useful
for business or research. For example - Based on stock market data, we can construct a
model that attempts to predict tomorrows Dow
Index closing price, given the previous few days
prices. - Based on blood test data from past patients, we
can construct a model that attempts to predict
whether or not a patient is developing hepatitis. - Based on historic data on vibrations, we can
build a model that tries to predict beforehand if
an aircraft wing is likely to fail.
6Data Mining has Many meanings (cont)
-
- Discover INTERESTING and USEFUL rules that are
hidden in the data. For example - An analysis of supermarket basket data will show
a surprising amount of baskets that contain both
beer and nappies.. - Analysis of crime records data may find that the
violent crimes rate in newcastle seems to reduce
significantly whenever the violent crimes rate in
Sunderland increases significantly. .
7All Together
- Accessing
- Reporting
- Clustering/Histograms
- Predictive models
- Discovery of interesting/surprising things
8Data Mining
- Accessing
- Reporting
- Clustering/Histograms
- Predictive models
- Discovery of interesting/surprising things
- When you hear the term data mining, it can mean
any of 2, 3, 4 and 5. In business/industry, 2
and 3 are called data mining. In academia we
usually take data mining to mean mainly 4 and
5
9Notes on 3/4
- Accessing
- Reporting
- Clustering/Histograms
- Predictive models
- Discovery of interesting/surprising things
- These are the things that you would look at more
closely in a machine learning course. The
predictive models are things like neural
networks, decision trees and rulesets
10What DM means for us
- Accessing
- Reporting
- Clustering/Histograms
- Predictive models
- Discovery of interesting/surprising things
- So, 3 and 4 are dealt with in another course. 5
could be an entire MSc course on its own, but
that is what DM means for us. - . In particular, we take a small bite of it that
is relevant to practical discovery of interesting
things in very large DBs. We look at a fast
algorithm that can discover interesting rules in
transaction databases, and that is a component in
several advanced commercial systems..
11First, some important motivational/explanatory
notes
- Why do we need something like type 5 data
mining at all? Couldnt the beer and nappies
thing have been found by types 2 or 3 DM?
The next slide shows a tiny supermarket basket
database. E.g. Record 11 is a customer who
bought eggs and glue only record 12 Records a
transaction where the basket contained only
apples.
12ID apples, beer, cheese, dates, eggs,
fish, glue, honey, ice-cream
13Numbers
- Our example DB has 20 records of supermarket
transactions, from a supermarket that only sells
9 things - One month in a large supermarket with five stores
spread around a reasonably sized city might
easily yield a DB of 20,000,000 baskets, each
containing a set of products from a pool of
around 1,000
14Rules
- A rule is something like this
- If a basket contains apples and cheese, then it
also contains beer - Any such rule has two associated measures
- confidence when the if part is true, how
often is the then bit true? - coverage or support how much of the database
contains the if part?
15Example
- What is the confidence and coverage of
- If the basket contains beer and cheese, then it
also contains honey
2/20 of the records contain both beer and cheese,
so coverage is 10
Of these 2, 1 contains honey, so confidence is 50
16Interesting/Useful rules
- Statistically, anything that is interesting is
something that happens significantly more than
you would expect by chance. - E.g. basic statistical analysis of basket data
may show that 10 of baskets contain bread, and
4 of baskets contain washing-up powder. I.e - There is a probability 0.1 that a basket contains
bread. - There is a probability 0.04 that a basket
contains washing-up powder.
17Bread and washing up powder
- What is the probability of a basket containing
both bread and washing-up powder? The laws of
probability say - If these two things are independent, chance is
0.1 0.04 0.004 - That is, we would expect 0.4 of baskets to
contain both bread and washing up powder
18Interesting means surprising
- We therefore have a prior expectation that just
4 in 1,000 baskets should contain both bread and
washing up powder. - If we investigate, and discover that really it
is 20 in 1,000 baskets, then we will be very
surprised. It tells us that - Something is going on in shoppers minds bread
and washing-up powder are connected in some way. - There may be ways to exploit this discovery put
the powder and bread at opposite ends of the
supermarket?
19Finding surprising rules
- Suppose we ask what is the most surprising rule
in this database? This would be, presumably, a
rule whose accuracy is more different from its
expected accuracy than any others. But it also
has to have a suitable level of coverage, or else
it may be just a statistical blip, and/or
unexploitable. - Looking only at rules of the form
- if basket contains X and Y, then it also
contains Z - our realistic numbers tell us that there may be
around 500,000,000 distinct possible rules. For
each of these we need to work out its accuracy
and coverage, by trawling through a database of
around 20,000,000 basket records. c 1016
operations - Yes, its easy to use type 2 DM, say, to work
out the confidence and coverage of a given rule.
But type 5 DM is all about searching through,
somehow, 500,000,000 (or usually immensely more)
rules to sniff out what may be the interesting
ones.
20Here are some interesting onesin our mini basket
DB
- If a basket contains glue, then it also contains
either beer or eggs - confidence 100 coverage 25
- If a basket contains apples and dates, then it
also contains honey - confidence 100 coverage 20
21What this lecture was about
- The many different meanings of data mining
- Warming up for the next lecture, via gentle
discussion on transaction databases, rules,
confidence, coverage, and what it takes for a
rule to be interesting.
22Next
- A classic fast algorithm for finding useful rules
in large databases,