David Corne, Heriot-Watt University - dwcorne@gmail.com - PowerPoint PPT Presentation

About This Presentation
Title:

David Corne, Heriot-Watt University - dwcorne@gmail.com

Description:

Data Mining (and machine learning) Lecture 2: Some common data processing tasks David Corne, Heriot-Watt University - dwcorne_at_gmail.com These s and related ... – PowerPoint PPT presentation

Number of Views:207
Avg rating:3.0/5.0
Slides: 81
Provided by: macsHwAc5
Category:

less

Transcript and Presenter's Notes

Title: David Corne, Heriot-Watt University - dwcorne@gmail.com


1
Data Mining(and machine learning)
  • Lecture 2 Some common data processing tasks

2
Today
  • About classification
  • A simple classification method (1-NN)
  • Normalisation of instances and fields
  • Discretization of fields
  • Coursework B

3
Some notation
  • A dataset has N records R1,R2,,RN,
  • each of which has F fields A1,A2,,AF.
  • Field AF is the target field the one we want to
    predict.
  • v(i,j) is the value of field Aj in record Ri
  • So, for example, what is

4
Some more terminology
  • Sometimes the values of a field are numeric. We
    call this a numeric field (!)
  • Sometimes the values of a field are words or
    categories. E.g. red, blue, green
  • true, false
  • male, female
  • low, medium, high
  • this type of field is called a categorical
    field, or a nominal field.

5
So, what is this mostly about?
You have some data
3 7 2 9 1 A
5 8 2 9 2 B
4 1 1 1 3 B
6 8 2 1 4 B
6
So, what is this mostly about?
You have some data
3 7 2 9 1 A
5 8 2 9 2 B
4 1 1 1 3 B
6 8 2 1 4 B
You get a new line of data, but it is not
classified
3 6 1 8 1 ?
7
So, what is this mostly about?
You have some data
3 7 2 9 1 A
5 8 2 9 2 B
4 1 1 1 3 B
6 8 2 1 4 B
You get a new line of data, but it is not
classified
3 6 1 8 1 ?
You need to predict the class value
8
Broadly how its done
In order to be able to predict the class value of
any new line of data that comes along, this is
what happens
9
Broadly how its done
In order to be able to predict the class value of
any new line of data that comes along, this is
what happens
Get your already-classified dataset
Run a machine learning method on these data,
which produces a classifier
Use the classifier to make predictions of the
class value for new instances
10
E.g. here is some data we saw before
3 7 2 9 1 A
5 8 2 9 2 B
4 1 1 1 3 B
6 8 2 1 4 B
If we run a machine learning method like RIPPER,
the classifer that comes out is a rule (or a set
of rules), e.g.
If (A1lt5) AND (A5lt2) THEN class is A
If (A2gt1) THEN class is B
11
E.g. here is some data we saw before
3 7 2 9 1 A
5 8 2 9 2 B
4 1 1 1 3 B
6 8 2 1 4 B
If instead we do machine learning with a Neural
Network, or Naive Bayes, or Regression, etc ,
then the output classifier is (more or less) a
thresholded mathematical function of the
attribute values. The type of function varies a
lot between methods, but, it might be e.g.
With the convention that if this is lt 0 the
class is A, otherwise B
12
  • Or the classifier could be a decision tree, such
    as

A1gt2 ?
no
yes
A4gt7 ?
A4lt5 ?
yes
no
Class is A
Class is B
Class is B
13
Applications
  • Just a reminder you should already be developing
    an awareness
  • of this from previous lectures, from browsing
    data repositories, and
  • from using your imagination.
  • Business, Commerce, Engineering, Medicine, and
    Science provide
  • endless important applications for this process,
    in which
  • there are existing classified datasets, that we
    can learn from
  • and for which producing a classifier that can
    predict class
  • values for new data instances is highly
    valuable.

14
An Important Issue
  • Overfitting It is very possible to learn a
    classifier that gives accurate performance on the
    existing classified dataset, but which performs
    very poorly in its predictions on new, previously
    unseen data.
  • I will have more to say on overfitting, and how
    to combat it, in later lectures.

15
The main issue for today
  • Dataset preparation
  • There are various things we can and should do to
    our dataset so that we make it easier, or indeed
    possible, to apply our chosen machine learning
    method.
  • E.g. we may want to apply the ID3 Decision Tree
    learning algorithm but this only works with
    categorical data.
  • We may want to apply a Neural Network, but this
    needs the data to be all numeric, and for all
    values to be relatively small (e.g. between -1
    and 1).
  • Etc..

16
Useful interlude 1-NN classification
  • Almost the simplest possible classification
    method.
  • The classifier is just the dataset itself
  • Predict the class of a new instance new by
    finding the instance c in the dataset that is
    closest to new, and predicting that it will have
    the same class value as c

17
How to work out distance between two instances?
  • Depends on the data, and on your common sense.
    More on that later but for now lets say we are
    only dealing with numeric data. One way to work
    out distance is the SSD (sum-squared distance)

If you have two records a and b, then the
distance between them can be given by
Just add up the squared differences of the
fields, omitting the class field. Why dont I
take the square root of this?
18
Visualisation of 1-NN
Suppose you have data with two numeric fields and
one class field, which is either red or green. We
can treat the data as 2D points, coloured by the
class. Suppose this is the pre-classified
dataset
19
Visualisation of 1-NN
Suppose you have data with two numeric fields and
one class field, which is either red or green. We
can treat the data as 2D points, coloured by the
class. Suppose this is the pre-classified
dataset
A new unclassified instance comes along what is
the class?
20
Visualisation of 1-NN
Suppose you have data with two numeric fields and
one class field, which is either red or green. We
can treat the data as 2D points, coloured by the
class. Suppose this is the pre-classified
dataset
The way you answer this visually is very
similar to the way 1-NN works
21
Life is difficult / interesting
Just an aside to point out that, in many
interesting and important classification
problems, the situation is more like this

22
The performance of 1-NN classification
  • Calculate it for any dataset of R records like
    this
  • Initialise accuracy to 0.
  • For each record r
  • Find the closest other record, o, in the
    dataset
  • If the class value of o is the same as the
    class
  • value of r, than add 1 to accuracy
  • Return 100 accuracy/R
  • Thereby, you obtain a percentage accuracy result

23
Scaling, Normalisation and Discretization
  • So we will learn these things
  • Ways to do normalisation and scaling (which may
    be necessary or sensible, depending on how the
    data were generated, and on what DM/ML methods
    you want to use)
  • Ways to do discretization (for turning numeric
    fields into categorical fields)

24
Normalisation
  • You have data about word-counts for specific
    words on web pages. You want to
  • be able to predict whether the page is about
    sport or business.

winner game team sales run Page content
16 22 81 75 10 business
12 14 44 16 12 business
4 7 20 0 2 sport
2 3 7 6 1 ???

What category would 1-NN predict for the 4th
record?
25
Normalisation
  • The closest record to the 4th is the 3rd, so it
    would be predicted to be sport. But this is
    probably wrong. Why?

winner game team sales run Page content
16 22 81 75 10 business
12 14 44 16 12 business
4 7 20 0 2 sport
2 3 7 6 1 ???


26
Normalisation
  • The closest record to the 4th is the 3rd, so it
    would be predicted to be sport. But this is
    probably wrong. Why?

winner game team sales run Page content
16 22 81 75 10 business
12 14 44 16 12 business
4 7 20 0 2 sport
2 3 7 6 1 ???

How could you pre-process the data to help ML
methods?

27
Record-wise Normalisation
  • In these data, the sensible thing to do is
    transform each record so that it is normalised by
    the total number of words. This makes them more
    comparable, each providing a fingerprint in
    terms of the relative proportions of the probe
    words.
  • Sometimes this is necessary, sometimes it is
    useful it just takes common sense.

28
Record-wise normalised version

Each row is simply scaled so that he numeric
fields add up to 1
winner game team sales run Page content
0.0784 0.1078 0.397 0.368 0.049 business
0.12 0.14 0.45 0.16 0.12 business
0.12 0.21 0.61 0 0.061 sport
0.105 0.158 0.368 0.316 0.053 ???

1-NN now shows record 4 closer to 1 and 2

29
Record-wise Normalisation
  • Using the notation in slide 3 given a dataset
    of N numeric fields, how would you represent
    record-wise normalization of record r ?

30
Min-Max Normalisation
  • This is columnwise normalisation, done separately
    for each field.
  • For each numeric attribute, scale it so that each
    value is in a specific range
  • a, b. E.g. usually 0,1, sometimes -1, 1,
    etc. E.g. if we use 0,1, we do
  • this for each record r.

Where min(i) is the smallest value for field i in
the dataset, max(i) is the largest, and range(i)
is max(i) min(i).
31
When min-max normalisation might be useful
Height (mm) Weight (kg) IQ Mark in DMML module
1856 75 95 70
1502 56 101 82
1904 86 112 90
1775 61 110 90
1901 81 92 ???
32

Height (mm) Weight (kg) IQ Mark in DMML module
1856 75 95 70
1502 56 101 82
1904 86 112 90
1775 61 110 90
1901 81 92 ???
What is the closest record to record 5?
33

Height (mm) Weight (kg) IQ Mark in DMML module
1856 75 95 70
1502 56 101 82
1904 86 112 90
1775 61 110 90
1901 81 92 ???
What is the closest record to record 5? Will this
be a good predictor for DMML Mark?
34

Height (mm) Weight (kg) IQ Mark in DMML module
1856 75 95 70
1502 56 101 82
1904 86 112 90
1775 61 110 90
1901 81 92 ???
What is the closest record to record 5? Will this
be a good predictor for DMML Mark? Which field
will be most important for predicting Mark?
35

Height (mm) Weight (kg) IQ Mark in DMML module
1856 75 95 70
1502 56 101 82
1904 86 112 90
1775 61 110 90
1901 81 92 ???
What is the closest record to record 5? Will this
be a good predictor for DMML Mark? Which field
will be most important for predicting Mark? How
does min-max normalisation help?
36
Discretization
Height (mm) Weight (kg) IQ Mark in DMML module
1856 75 95 70
1502 56 101 82
1904 86 112 90
1775 61 110 90
1901 81 92 65

We cant run the ID3 decision tree algorithm on
this.
37
Discretization
Height (mm) Weight (kg) IQ Mark in DMML module
tall heavy medium A
short light medium A
tall heavy smart A
medium medium smart A
tall heavy medium B

We can run the ID3 decision tree algorithm on
this.
38
Discretization
Discretization is simply a process that converts
a numerical field into a class field. How
might you do this?

39
Equal Width Binning (EWB)
  • EWB(5), for example, means you discretize into 5
    values, where each value has equal width.
  • If field f values range from 0 to 100, then, then
    each bin has width 20. In the converted dataset,
    we can just label this bins 1,2,,5, or we can
    use appropriate linguistic terms, such as
  • very low, low, medium, high, very high

40
Equal Frequency Binning (EFB)
  • EFB(3), for example, means you discretize into 3
    values, where each bin covers around the same
    amount of data.

41
Contrived Example for EWB(2)
Height (mm) Weight (kg) IQ DMML mark
1856 75 95 70
1502 74 101 82
1904 77 112 90
1775 61 110 90
1901 76 92 70
Height (mm) Weight (kg) IQ DMML mark
high high low low
low high low high
high high high high
high low high high
high high low low
42
Other discretization methods
  • There are others, which almost all differ from
    EWB and EFB in one respect they use class value
    information in attempt to find bins that make
    good sense with regard to classification.
  • We may look at those later

43
Finally Missing values
  • Often, a dataset has missing values for some
    fields in some records. E.g. a certain blood test
    was not taken for some patients. A questionnaire
    response for some question was unreadable, etc

44
Coursework B
  • There are several ways to deal with missing
    values. Find out yourself
  • Do some research with google, or google scholar
    (or your favourite other search engine).
  • Write a report of only 100 words, entitled
    Methods for dealing with missing values in
    datasets
  • Provide at least 2 relevant references (to papers
    or URLs) at the end of your report.
  • The whole thing should only take 30mins1hr

45
Next week
  • Basic statistics
  • Histograms
  • A little more normalisation and discretization
  • Coursework 1

46
Extra slides on Data Quality / Data Cleaning
  • I am not teaching Data Quality / Data Cleaning as
    part of F21DL this year. But here are some slides
    you may be interested in anyway

47
On Data Quality
  • Suppose you have a database sitting in front of
    you, and I ask Is it a good quality database?
  • What is your answer? What does quality depend
    on?
  • Note this is about the data themselves, not the
    system in use to access it.

48
A Conventional Definition of Data Quality
  • Good quality data are
  • Accurate, Complete, Unique,
  • Up-to-date, and Consistent
  • meaning

49
A Conventional Definition of Data Quality,
continued
  • Accurate This refers to how the data were
    recorded in the first place. What might be the
    inaccurately recorded datum in the following
    table?

Barratt John 22 Maths BSc Male
Burns Robert 24 CS BSc Male
Carter Laura 20 Physics MSc Female
Davies Michael 12 CS BSc Male
50
A Conventional Definition of Data Quality,
continued
  • Complete This refers to whether or not the
    database really contains everything it is
    supposed to contain.
  • E.g. a patients medical records should contain
    references to all medication prescribed to date
    for that patient.
  • The BBC TV Licensing DB should contain an entry
    for every address in the country. Does it?

51
A Conventional Definition of Data Quality,
continued
  • Unique Every separate datum appears only once.
    How many Data Quality errors can you find in
    the following table, and what types are they?

Surname Firstname DoB Driving test passed
Smith J. 17/12/85 17/12/05
Smith Jack 17/12/85 17/12/2005
Smith Jock 17/12/95 17/12/2005
52
A Conventional Definition of Data Quality,
continued
  • Up-to-date The data are kept up to date.
  • The post office recently changed my postcode from
    EX6 8RA to EX6 8NU. Why does this make it
    difficult for me to get a sensible quote for home
    insurance or car insurance?
  • Can you think of a DB where it doesnt matter
    whether or not the data are kept up to date??

53
A Conventional Definition of Data Quality,
continued
  • Consistent The data contains no logical errors
    or impossibilities. It makes sense in and of
    itself.
  • Why is the following mini DB inconsistent?

Date Sales Returns Net income
23rd Nov 25,609 1,003 24,506
24th Nov 26,202 1,601 24,601
25th Nov 28,936 1,178 25,758
54
Note This definition of data quality is not much
use, since there is no way to measure DQ sensibly
  • Completeness How will we know??
  • Uniqueness It is hard to tell whether two
    entries are similar, or duplicates!
  • Up-to-date-ness How do we know?
  • Consistent consistency errors can be very
    hard to find, especially in a very large DB

The database research world is actively engaged
in finding ways to measure data quality sensibly.
In the meantime, we just use common sense to
avoid dirty data at all points of the DQ
continuum
55
The Data Quality Continuum
  • Its rare that a datum is entered once into a DB
    and then left alone. Usually, a datum has a long
    and varied life, into which errors can arise at
    each and every stage. The continuum is
  • Data gathering
  • Data delivery
  • Data storage
  • Data integration
  • Data retrieval
  • Data analysis
  • So, if we want to monitor DQ, we need to monitor
    it at each of these stages

56
DQ Continuum Example
  • This is an example I am familiar with, helping to
    illustrate the DQ continuum.
  • The International Seismological Centre (ISC) is
    in Thatcham, in Berkshire. Its a charity funded
    by various governments. Their role is to be the
    repository for recording all earthquake events on
    the planet.

57
DQ Continuum ISC example gathering
ISC
Data gathering centres

58
See 2006 earthquake data via my dmml page
59
DQ Continuum ISC example delivery
  • Raw seismograph data from local collection points
    to DG centres.
  • Email or ftp to ISC some centres provide raw
    data, some provide interpreted data (e.g. maybe
    wont send some data if they believe it in error
    in the first place)

60
DQ Continuum ISC example integration
  • The ISCs role is actually to figure out where
    and when the Earth tremors were (there are
    hundreds per month) based on reverse engineering
    from seismograph readings. They integrate the raw
    data and attempt to do this, largely by hand and
    brain, and record their findings in archival CDs

61
DQ Continuum ISC example retrieval/analysis
  • You can get a CD from ISC anytime, for the earth
    tremor activity on any particular day.
  • Im not sure whether you can get the raw data
    from them.
  • Naturally, you can analyse the data and see if
    you can find inconsistencies or errors.

62
The ISC DQ Continuum

Where might there occur errors, of Accuracy?
Completeness? Uniqueness? Timeliness?
Consisency? What else is important in this case?
63
Where DQ problems occur (gathering)
  • Manual data entry (how can we improve this?)
  • Lack of uniform standards for format and content.
  • Duplicates arising from parallel entry
  • Approximations, alternatives, entries altered in
    order to cope with s/w and/or h/w constraints.
  • Measurement errors.

64
Where DQ problems occur (delivery)
  • Multiple hops from source to DB problems can
    happen anywhere
  • Inappropriate pre-processing (e.g. removing some
    small seismograph readings before sending on to
    ISC rounding up or down, when the destination
    needs more accurate data).
  • Transmission problems buffer overflows, checks
    (did all files arrive, and all correctly?)

65
Where DQ problems occur (storage)
  • Poor, out of date or inappropriate metadata
  • Missing timestamps
  • conversion to storage format (e.g. to excel
    files, to higher/lower precision

66
Where DQ problems occur (integration)
  • This is the business of combining datasets e.g.
    from different parts of a company, from
    (previously) different companies following an
    acquisition from different government agencies,
    etc.
  • Different keys, different fields, different
    formats
  • Different definitions (customer, income, )
  • Sociological factors reluctance to share!

67
Where DQ problems occur (retrieval/analysis)
  • The problem here is usually the quality of DBs
    that store the retrieved data, or the use of the
    retrieved data in general. Problems arise
    because
  • The source DB is not properly understood!
  • Straightforward mistakes in the queries that
    retrieve the relevant data.
  • E.g. A database of genes contains entries that
    indicate whether or not each gene has a known or
    suspected link with cancer. A retrieval/analysis
    task leads to publishing a list of genes that are
    not relevant to cancer. What is the problem here?

68
What Keeps DBs Dirty
  • A good DBMS will have built in tools for
  • Consistency in data types
  • Consistency in field values
  • Constraints and checks that deal with Null
    values, Outliers, Duplication.
  • Automatic timestamps
  • Powerful query language (makes retrieval logic
    errors less likely)
  • so, why are you refused a loan, have mail
    delivered to the wrong address, and get charged
    too much for your mobile calls?

69
all this
  • Consistency constraints are often not applied, or
    are applied!
  • suppose height is not allowed to go over 2
    metres in a school student DB
  • My postcode problem
  • The data are just too numerous, complex and
    ill-understood. Cleaning it would cost too
    much!
  • Undetectable problems incorrect values, missing
    entries
  • Metadata not maintained properly

70
Single Source vs Multiple Source Schema Level v
Instance Level
  • One useful way to categorize problems,
    independent of how we did so in the last lecture,
    is according to whether the problems are the sort
    we can get if we have just one source of data, or
    whether the problem arises directly from trying
    to combine data from multiple sources
  • Problems can also be schema level or instance
    level

71
Single Source / Schema level examples
Scope Problem Unclean Notes
attribute Illegal values DoB30.02.88 Values out of range
record Violated attribute dependencies Car-owner No, make Toyota Make should clearly have a Null value here.
Record type Uniqueness violations Name Jo Smith, NUS no. 3067 Name Ed Brown, NUS no. 2124 NUS no.s should be unique
Source Referential integrity violation Name D Corne, Office EM G.92 Where is G. 92 ?

72
Single Source / Instance level examples
Scope Problem Unclean Notes
attribute missing values, mis-spellings, abbreviations, Misfields, Embedded vals Top speed 0 mph, Title Dark Side of the Moan FullName J. Smith Colour 160mph Phone Dan Jones 0207 308653 Dummy entries values unavailable at entry time, human error
record Violated attribute dependencies City Edinburgh. Postcode EX6
Record type Word transposition, Duplicates, contradictions Name Jo Smith, Name Carr, Jim Name J. Smith, Name Joe Smith Name Jo Smith, DoB 17/12/62 Name Jo Smith, DoB 17/11/62
Source Wrong references Name D Corne, Office EM G.46 EM G. 46 exists, but is not my office.

73
Multiple Source Problems/ Instance and Schema
level examples
The Smiths buy books and music online from
company A

Customer ID Name Street City Sex
102 Luke Smith 5 Chewie Rd Dawlish, Devon 0
175 Leia Smith Chewie St, 5 Dawlish 1
They also buy books and music online from company
B
Client ID LastName Other names Phone Gender
23 Smith Luke Michael 378988 Male
35 Smith Leia S. 44(0)1626 378988 F
74
When Companies A and B merge, various problems
arise when they merge their DBs
Combining customer fields and client fields are
they really the same things? How to ensure that
Company As customer 37 and Company Bs client
37 get separate entries in the new DB. Are Luke
Smith and Luke Michael Smith the same
person?? Do Luke and Leia live at the same
address? Etc A forced fast resolution to
these problems will usually lead to errors in the
new integrated DB

75
A Special but Common type of Problem Semantic
Complexity
  • Semantic Complexity (SC) is the state of play
    where different users of a DB have different
    conceptions of what the data represent.
  • E.g. Local Police DB keep record of all crimes in
    an area, where the key is the victims name.
    When someone who was a victim moves to a
    different area, they remove all records relating
    to that person.The local council use this DB to
    produce a report of the total amount of crime
    every month. Why does it give figures that are
    too low?

76
Semantic Complexity Missing/Default Values
  • One source of semantic complexity is the
    different meanings that missing values can have.
  • E.g. Suppose the histogram of value types in
    mobile phone no. field is

77
What does NULL mean?
  • A. This record is of someone who does not have a
    mobile phone?
  • B. This record is of someone who has a mobile
    phone, but chose not to supply the number?
  • C. This record is of someone who has a mobile
    phone, but who forgot to supply the number, or
    it was hard to decipher and recorded as NULL?
  • Maybe some are of type A and some are of type B
    and some are of type C. For some
    applications/analyses, we may wish to know the
    breakdown into types.
  • What about the All zero and All nine entries?
    Precisely the same can be said of them. Or,
    perhaps the protocols for recording the entries
    indicated NULL for type A, 0000000 for type B and
    9999999 for type C.
  • The above relate to a quite simple form of
    semantic complexity but what if someone uses
    this DB to estimate the proportion of people who
    have never had a mobile phone?

78
Cleaning via basic data analysis
  • Data Profiling examine the instances to see how
    the attributes vary. E.g. Automatically generate
    a histogram of values for that attribute.

How does the histogram help us in finding
problems in this case?
79
What problems does this analysis alert us to?
80
Acknowledgements
  • I adapted the data quality/cleaning material from
    various sources, most notably
  • A ppt presentation called Data Quality and Data
    Cleaning An Overview by Tamrapani Dasu and
    Theodore Johnson, at AT T Labs
  • A paper called Data Cleaning Problems and
    Current Approaches, by Erhard Rahm and Hong Hai
    Do, University of Leipzig, Germany.
  • My thanks to these researchers for making their
    materials freely available online.
Write a Comment
User Comments (0)
About PowerShow.com