Title: comp3776: Data Mining and Text Analytics
1comp3776Data Mining and Text Analytics
- Intro to Data Mining
- By Eric Atwell, School of Computing, University
of Leeds - (including re-use of teaching resources from
other sources, esp. Comp3740 Knowledge Management
and Adaptive Systems - School of Computing, University of Leeds)
2What has Machine Learning got to do with
Computing / Information Systems?
- Most international organizations produce more
information in a week than many people could read
in a lifetime Adriaans and Zantinge
3Objectives of knowledge discovery or machine
learning or data mining
- Data mining is about discovering patterns in
data. - For this we need
- KD/DM techniques, algorithms, tools, eg BootCat,
WEKA - A methodological framework to guide us, in
collecting data and applying the best algorithms
CRISP-DM
4Data Mining, Machine Learning, Knowledge
Discovery, Text Mining
- Data Mining was originally about learning
patterns from DataBases, data structured as
Records, Fields - Knowledge Discovery is exotic term for DM???
- Increasingly, data is unstructured text (WWW), so
- Text Mining is a new subfield of DM, focussing on
Knowledge Discovery from unstructured text data
5 define data mining
- Data mining, also known as knowledge-discovery in
databases (KDD), is the practice of automatically
searching large stores of data for patterns. To
do this, data mining uses computational
techniques from artificial intelligence,
statistics and pattern recognition.
en.wikipedia.org/wiki/Data_mining
6define text mining
- Text mining, also known as intelligent text
analysis, text data mining or knowledge-discovery
in text (KDT), refers generally to the process of
extracting interesting and non-trivial
information and knowledge from unstructured text.
Text mining is a young interdisciplinary field
which draws on information retrieval, data
mining, machine learning, statistics and
computational linguistics. ...en.wikipedia.org/wi
ki/Text_mining
7define knowledge discovery
- Knowledge discovery is the process of finding
novel, interesting, and useful patterns in data.
Data mining is a subset of knowledge discovery.
It lets the data suggest new hypotheses to
test.www.purpleinsight.com/downloads/docs/visuali
zer_tutorial/glossary/go01.html - Data mining, also known as knowledge-discovery in
databases (KDD), is the practice of automatically
searching large stores of data for patterns. To
do this, data mining uses computational
techniques from AI, statistics and pattern
recognition. en.wikipedia.org/wiki/Knowledge_disc
overy
8Data Mining Overview
Concepts, Instances or examples, Attributes
Data Mining
Concept Descriptions
Each instance is an example of the concept to be
learned or described. The instance may be
described by the values of its attributes.
9Instances
- Input to a data mining algorithm is in the form
of a set of examples, or instances. - Each instance is represented as a set of features
or attributes. - Usually in DB Data-Mining this set takes the form
of a flat file each instance is a record in the
file, each attribute is a field in the record. - In text-mining, instance may be word/term in
context (surrounding words/document) - The concepts to be learned are formed from
patterns discovered within the set of instances.
10concepts
- The types of concepts we try to learn include
- Key indicators features or terms specific to
our domain - Clusters or Natural partitions
- Eg we might cluster customers according to their
shopping habits. - Eg is this web-page British or American English?
- Rules for classifying examples into pre-defined
classes. - Eg Mature students studying information systems
with high grade for General Studies A level are
likely to get a 1st class degree - General Associations
- Eg People who buy nappies are in general likely
also to buy beer
11More concepts
- The types of concepts we try to learn include
- Unexpected (suspicious?) associations or
coincidences - Eg known suspects A, B, C all phoned D last week
- Numerical prediction
- Eg look for rules to predict what salary a
graduate will get, given A level results, age,
gender, programme of study and degree result
this may give us an equation - Salary aA-level bAge cGender dProg
eDegree - (but are Gender, Programme really numbers???)
12DB Example weather to play?
13(No Transcript)
14/usr/local/weka-3-4-13/data/weather.nominal.arff
- _at_relation weather.symbolic
- _at_attribute outlook sunny, overcast, rainy
- _at_attribute temperature hot, mild, cool
- _at_attribute humidity high, normal
- _at_attribute windy TRUE, FALSE
- _at_attribute play yes, no
_at_data sunny,hot,high,FALSE,no sunny,hot,high,TRUE,
no overcast,hot,high,FALSE,yes rainy,mild,high,FAL
SE,yes rainy,cool,normal,FALSE,yes rainy,cool,norm
al,TRUE,no overcast,cool,normal,TRUE,yes sunny,mil
d,high,FALSE,no sunny,cool,normal,FALSE,yes rainy,
mild,normal,FALSE,yes sunny,mild,normal,TRUE,yes o
vercast,mild,high,TRUE,yes overcast,hot,normal,FAL
SE,yes rainy,mild,high,TRUE,no
15/usr/local/weka-3-4-13/data/weather.arff
- _at_relation weather
- _at_attribute outlook sunny,overcast,rainy
- _at_attribute temperature real
- _at_attribute humidity real
- _at_attribute windy TRUE, FALSE
- _at_attribute play yes, no
- _at_data
- sunny,85,85,FALSE,no
- sunny,80,90,TRUE,no
- overcast,83,86,FALSE,yes
- rainy,70,96,FALSE,yes
- rainy,68,80,FALSE,yes
- rainy,65,70,TRUE,no
- overcast,64,65,TRUE,yes
- sunny,72,95,FALSE,no
- sunny,69,70,FALSE,yes
- rainy,75,80,FALSE,yes
16Text mining example Which English dominates the
WWW, UK or US?
- First catch your rabbit (Mrs Beatons
cookbook) Other tools are possible, but
WWW-BootCat was easier to use - First sign up for Domain, SketchEngine account,
Google key download seeds-en from
http//corpus.leeds.ac.uk/internet.html - (see comp3740 specifications and lecture notes )
17Example 2 Data Mining for an ontology
- Ontology the concepts in a discipline, and
meaning-relationships between these concepts
(01.ppt) - concepts roughly equates to terminology
specialist words and phrases in a discipline - WordNet is freely-available for general English
- What about other languages? EuroWordnet,
BalkaNet, (but not ALL languages!) - What about specific domains? Domain-specific
ONTOLOGIES have to be devised (by experts) - What about my own specific domain/language?
- Automatic extraction of key words / concepts from
example documents (machine learning / knowledge
discovery)
18Automatic terminology extraction
- Terminology extraction thesaurus construction
- based on documents (either retrieved set or the
whole collection) as Corpus training text set - define a measure of how close one index term is
to another in meaning-space, ?or literal
distance? - for each term, form a neighbourhood comprising
the nearest n terms - treat these neighbourhoods like related
thesaurus classes - terms with similar neighbourhoods are treated as
synonyms.
19Finding coordinate terms
- One attempt to define how close a term is to
another - If two terms are both used to index the same
document many times in the collection, then they
are deemed to be close. - From document-term matrix, compute
term-correlation matrix - The term correlation matrix can be normalised so
that terms that index a lot of documents dont
have an unfair chance reduce weight of common
words
20Other ways to find specialist terms
- Other ways to find domain-specific terms and
relations - Collect a domain corpus, find terms different
from a generic gold standard corpus British
National Corpus - Collocation-groups For each term, collect its
collocations in the Corpus other words it
appears next to (or near to). If two terms have
similar collocation-sets, then they are deemed to
be close. - Association matrix based on proximity compute
average distance between pairs of terms (no. of
words between them, literally), use this as
closeness metric
21Why build a thesaurus?
- a thesaurus or ontology can be used to normalise
a vocabulary and queries (?or documents?) - it can be used (with some human intervention) to
increase recall and precision - generic thesaurus/ontology may not be effective
in specialized collections and/or queries - Semi-automatic construction of thesaurus/ontology
based on the retrieved set of documents has
produced some promising results, e.g. Semantic
Web
22Data Mining Key points
- Knowledge Discovery (Data Mining) tools
semi-automate the process of discovering patterns
in data. - Tools differ in terms of what concepts they
discover (differences, key-terms, clusters,
decision-trees, rules) - and in terms of the output they provide (eg
clustering algorithms provide a set of
subclasses) - Selecting the right tools for the job is based on
business objectives what is the USE for the
knowledge discovered
23A Data Mining consultant
- You should be able to
- Decide which is the appropriate data mining
technique for a given a problem defined in terms
of business objectives. - Decide which is the most appropriate form of
input (which attributes/features will be useful
for learning) and output (what does your client
want to see?)