Title: Categorisation of Short Text Fields
1Categorisation of Short Text Fields Using Text
Mining
Mark Bentley and Peter Yeates
? Mark Bentley and Peter Yeates
2Contents
- Information from Data
- Text Mining Primer
- Case Study
3The Problem - Information or Just Data?
- Legacy computer systems
- Store data as free-form text
- Every imaginable description, spelling
- Data becomes information only if
- the free-form text can be decoded
4The Desire - Reconstruct the Information
- Want
- to get the information that
- should have been captured
- to do this quickly, easily, and objectively
5The Desire - Reconstruct the Information
- Method
- Manually classify?
- Manually identify and apply rules?
- Automate identification of rules
- and application?
6Automated Rule Discovery
Can use data-mining techniques to discover rules
automatically Knowledge of a business-expert can
be used to classify the data Easy, quick, and
can do a better job than other methods
7Decision Trees
Models that aim to allocate data points into
homogenous groups Operate by successively
partitioning data The tree fitting process
chooses the best partitions to use
8Decision Trees
9Pruning
Decision trees can be small or large depending
on the amount of data and the degree of fit
required Various statistical tools are used for
this pruning
10Categorisation of Farm Assets
Want to set robust premium rates for categories
of farm assets Asset data captured only as
free-form text 30K short-text descriptions 6,000
unique words Assets - buildings, tools, animals,
etc
11Framing the Problem
Categorical variables Classify by starting
letter Extra field for Shed
12Framing the Problem
- A common alternative
- Indicator variables
- Most frequent words
- Stemming
- Drop connector words
13Training Data
- Information is needed to train the model
- This was created by
- Business-expert manually classifying
- a 1 random-sample of the data
- Dont need a why for each
- classification
- Categories made-up as needed
14Impression ?
- First model
- from inspection appeared to work well
- some assets types often misclassified
- addition sample of problem assets was
- taken and manually classified.
- Model was re-built using the expanded training
data
15Second Cut
Substantially improved Close to manual
classification? Classification errors
Tractor and Hay Barn One category required
Two assets described ? Unclassifiable
16Testing Results
Testing models work is essential Visual
inspection ? models worked well Numerical test ?
Gains chart
17Time and Effort
Proved to be very time efficient A few hours of
actuarial time Most was business expert
classifying data Surprisingly good result given
the effort. More than good enough to solve problem
18(No Transcript)