Advanced Database Systems F24DS2 F29AT2 - PowerPoint PPT Presentation

About This Presentation
Title:

Advanced Database Systems F24DS2 F29AT2

Description:

There are lots of things you can do with a database: ... (or usually immensely more) rules to sniff out what may be the interesting ones. ... – PowerPoint PPT presentation

Number of Views:32
Avg rating:3.0/5.0
Slides: 23
Provided by: macs3
Category:

less

Transcript and Presenter's Notes

Title: Advanced Database Systems F24DS2 F29AT2


1
Advanced Database SystemsF24DS2 / F29AT2
  • About Data Mining

2
Data Mining has Many meanings
  • There are lots of things you can do with a
    database
  • Access data via straightforward queries, to
    answer
  • straightforward questions about instances.
    E.g.
  • What is Ellen McArthurs home phone number?
  • What is the ISBN number of Eats, shoots and
    leaves,
  • by Lynne Truss?,
  • What grade did Larry Page get for the Internet
    module??
  • Give me a list of all pages on the www that
    contain the phrase fried egg sandwich.

3
Data Mining has Many meanings (cont)
  • Generate simple reports about data via
    straightforward queries, to answer questions
    about sets of instances. E.g.
  • How many of our customers are called Trevor?
  • Which of our books has been borrowed more times
    in the last month than Eats, shoots and leaves,
  • by Lynne Truss?,
  • Which student has the highest average marks?
  • What percentage of house owners also own a
    car?.

4
Data Mining has Many meanings (cont)
  • Generate complex/and/or comprehensive
    statistical reports about the database as a
    whole, to summarise and understand the data
    this is what tends to be done in the Analysis
    stage of Data Cleaning.. E.g.
  • For each field, generate a histogram of the
    values
  • Run one or more clustering algorithms to find the
    clusters in the data.

5
Data Mining has Many meanings (cont)
  • Build Predictive models that may then be useful
    for business or research. For example
  • Based on stock market data, we can construct a
    model that attempts to predict tomorrows Dow
    Index closing price, given the previous few days
    prices.
  • Based on blood test data from past patients, we
    can construct a model that attempts to predict
    whether or not a patient is developing hepatitis.
  • Based on historic data on vibrations, we can
    build a model that tries to predict beforehand if
    an aircraft wing is likely to fail.

6
Data Mining has Many meanings (cont)
  • Discover INTERESTING and USEFUL rules that are
    hidden in the data. For example
  • An analysis of supermarket basket data will show
    a surprising amount of baskets that contain both
    beer and nappies..
  • Analysis of crime records data may find that the
    violent crimes rate in newcastle seems to reduce
    significantly whenever the violent crimes rate in
    Sunderland increases significantly. .

7
All Together
  • Accessing
  • Reporting
  • Clustering/Histograms
  • Predictive models
  • Discovery of interesting/surprising things

8
Data Mining
  • Accessing
  • Reporting
  • Clustering/Histograms
  • Predictive models
  • Discovery of interesting/surprising things
  • When you hear the term data mining, it can mean
    any of 2, 3, 4 and 5. In business/industry, 2
    and 3 are called data mining. In academia we
    usually take data mining to mean mainly 4 and
    5

9
Notes on 3/4
  • Accessing
  • Reporting
  • Clustering/Histograms
  • Predictive models
  • Discovery of interesting/surprising things
  • These are the things that you would look at more
    closely in a machine learning course. The
    predictive models are things like neural
    networks, decision trees and rulesets

10
What DM means for us
  • Accessing
  • Reporting
  • Clustering/Histograms
  • Predictive models
  • Discovery of interesting/surprising things
  • So, 3 and 4 are dealt with in another course. 5
    could be an entire MSc course on its own, but
    that is what DM means for us.
  • . In particular, we take a small bite of it that
    is relevant to practical discovery of interesting
    things in very large DBs. We look at a fast
    algorithm that can discover interesting rules in
    transaction databases, and that is a component in
    several advanced commercial systems..

11
First, some important motivational/explanatory
notes
  • Why do we need something like type 5 data
    mining at all? Couldnt the beer and nappies
    thing have been found by types 2 or 3 DM?

The next slide shows a tiny supermarket basket
database. E.g. Record 11 is a customer who
bought eggs and glue only record 12 Records a
transaction where the basket contained only
apples.
12
ID apples, beer, cheese, dates, eggs,
fish, glue, honey, ice-cream
13
Numbers
  • Our example DB has 20 records of supermarket
    transactions, from a supermarket that only sells
    9 things
  • One month in a large supermarket with five stores
    spread around a reasonably sized city might
    easily yield a DB of 20,000,000 baskets, each
    containing a set of products from a pool of
    around 1,000

14
Rules
  • A rule is something like this
  • If a basket contains apples and cheese, then it
    also contains beer
  • Any such rule has two associated measures
  • confidence when the if part is true, how
    often is the then bit true?
  • coverage or support how much of the database
    contains the if part?

15
Example
  • What is the confidence and coverage of
  • If the basket contains beer and cheese, then it
    also contains honey

2/20 of the records contain both beer and cheese,
so coverage is 10
Of these 2, 1 contains honey, so confidence is 50
16
Interesting/Useful rules
  • Statistically, anything that is interesting is
    something that happens significantly more than
    you would expect by chance.
  • E.g. basic statistical analysis of basket data
    may show that 10 of baskets contain bread, and
    4 of baskets contain washing-up powder. I.e
  • There is a probability 0.1 that a basket contains
    bread.
  • There is a probability 0.04 that a basket
    contains washing-up powder.

17
Bread and washing up powder
  • What is the probability of a basket containing
    both bread and washing-up powder? The laws of
    probability say
  • If these two things are independent, chance is
    0.1 0.04 0.004
  • That is, we would expect 0.4 of baskets to
    contain both bread and washing up powder

18
Interesting means surprising
  • We therefore have a prior expectation that just
    4 in 1,000 baskets should contain both bread and
    washing up powder.
  • If we investigate, and discover that really it
    is 20 in 1,000 baskets, then we will be very
    surprised. It tells us that
  • Something is going on in shoppers minds bread
    and washing-up powder are connected in some way.
  • There may be ways to exploit this discovery put
    the powder and bread at opposite ends of the
    supermarket?

19
Finding surprising rules
  • Suppose we ask what is the most surprising rule
    in this database? This would be, presumably, a
    rule whose accuracy is more different from its
    expected accuracy than any others. But it also
    has to have a suitable level of coverage, or else
    it may be just a statistical blip, and/or
    unexploitable.
  • Looking only at rules of the form
  • if basket contains X and Y, then it also
    contains Z
  • our realistic numbers tell us that there may be
    around 500,000,000 distinct possible rules. For
    each of these we need to work out its accuracy
    and coverage, by trawling through a database of
    around 20,000,000 basket records. c 1016
    operations
  • Yes, its easy to use type 2 DM, say, to work
    out the confidence and coverage of a given rule.
    But type 5 DM is all about searching through,
    somehow, 500,000,000 (or usually immensely more)
    rules to sniff out what may be the interesting
    ones.

20
Here are some interesting onesin our mini basket
DB
  • If a basket contains glue, then it also contains
    either beer or eggs
  • confidence 100 coverage 25
  • If a basket contains apples and dates, then it
    also contains honey
  • confidence 100 coverage 20

21
What this lecture was about
  • The many different meanings of data mining
  • Warming up for the next lecture, via gentle
    discussion on transaction databases, rules,
    confidence, coverage, and what it takes for a
    rule to be interesting.

22
Next
  • A classic fast algorithm for finding useful rules
    in large databases,
Write a Comment
User Comments (0)
About PowerShow.com