Categorisation of Short Text Fields - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

Categorisation of Short Text Fields

Description:

... and can do a better job. than other methods. Automated Rule ... Asset data captured only as free-form text. 30K short-text descriptions. 6,000 unique words ... – PowerPoint PPT presentation

Number of Views:28
Avg rating:3.0/5.0
Slides: 19
Provided by: simonp2
Category:

less

Transcript and Presenter's Notes

Title: Categorisation of Short Text Fields


1
Categorisation of Short Text Fields Using Text
Mining
Mark Bentley and Peter Yeates
? Mark Bentley and Peter Yeates
2
Contents
  • Information from Data
  • Text Mining Primer
  • Case Study

3
The Problem - Information or Just Data?
  • Legacy computer systems
  • Store data as free-form text
  • Every imaginable description, spelling
  • Data becomes information only if
  • the free-form text can be decoded

4
The Desire - Reconstruct the Information
  • Want
  • to get the information that
  • should have been captured
  • to do this quickly, easily, and objectively

5
The Desire - Reconstruct the Information
  • Method
  • Manually classify?
  • Manually identify and apply rules?
  • Automate identification of rules
  • and application?

6
Automated Rule Discovery
Can use data-mining techniques to discover rules
automatically Knowledge of a business-expert can
be used to classify the data Easy, quick, and
can do a better job than other methods
7
Decision Trees
Models that aim to allocate data points into
homogenous groups Operate by successively
partitioning data The tree fitting process
chooses the best partitions to use
8
Decision Trees

9
Pruning
Decision trees can be small or large depending
on the amount of data and the degree of fit
required Various statistical tools are used for
this pruning
10
Categorisation of Farm Assets
Want to set robust premium rates for categories
of farm assets Asset data captured only as
free-form text 30K short-text descriptions 6,000
unique words Assets - buildings, tools, animals,
etc
11
Framing the Problem
Categorical variables Classify by starting
letter Extra field for Shed
12
Framing the Problem
  • A common alternative
  • Indicator variables
  • Most frequent words
  • Stemming
  • Drop connector words

13
Training Data
  • Information is needed to train the model
  • This was created by
  • Business-expert manually classifying
  • a 1 random-sample of the data
  • Dont need a why for each
  • classification
  • Categories made-up as needed

14
Impression ?
  • First model
  • from inspection appeared to work well
  • some assets types often misclassified
  • addition sample of problem assets was
  • taken and manually classified.
  • Model was re-built using the expanded training
    data

15
Second Cut
Substantially improved Close to manual
classification? Classification errors
Tractor and Hay Barn One category required
Two assets described ? Unclassifiable
16
Testing Results
Testing models work is essential Visual
inspection ? models worked well Numerical test ?
Gains chart
17
Time and Effort
Proved to be very time efficient A few hours of
actuarial time Most was business expert
classifying data Surprisingly good result given
the effort. More than good enough to solve problem
18
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com