Title: DATA, TEXT,
1Chapter 7
Pages 304-309, 311, Sections 7.3, 7.5, 7.6
- DATA, TEXT,
- AND WEB MINING
2Data mining
- A process that uses statistical, mathematical,
artificial intelligence and machine-learning
techniques to extract and identify new knowledge
from large databases - Recognizes the untapped value of data in large
databases - You may unexpectedly strike rich in understanding
relationships among data
3Example
Task Find the best route to cover the territory
4Challenge of finding relationships in large
databases
5Connect equal elevation points to make a contour
map
The dark vertical line shows the best route to
cross the territory without falling off a cliff.
6Once relationships are discovered, they can be
used for prediction
7Uses of Data Mining-1
- Classification
- Identify attribute of interest (eg. You want to
classify who is likely to pay late) - Examine all other attribute values of customer
from data warehouse and locate the one that is
most related to the attribute of interest (eg.
monthly income level) - Mining Algorithm
- The most common algorithm used for
Classification is Decision trees - Gini Index helps to determine where to find the
split between two classes (eg. at what income
level) - - used in developing decision trees
- (see example on page 316)
8Which product class is the best seller?
Conclusion Clay products with a price below 25!
9Uses of Data Mining-2
- Segmentation
- Partitioning a database into groups in which the
members of each group share similar
characteristics - Mining Algorithm
- Clustering The object is to sort cases into
groups so that the similarities within the group
are strong among members of the same cluster and
weak between members of different clusters - Eg. Companies with over 100 employees may share
similar characteristics (eg. revenue size) than
those with less than 100 employees. - Knowledge can help with developing different
policies when dealing with different type of
companies
10Uses of Data Mining-3
- Association
- A category of data mining algorithm that
establishes relationships about items that occur
together in a given record - Eg. You may discover from data that senior
students take elective courses together in the
final semester - Can be helpful to schedule courses
- People who buy a suit may also buy dress shirt
- People who buy swimwear may buy fins, goggles,
cap, etc.
11Uses of Data Mining-4
- Sequence discovery
- The identification of associations over time.
Discovering the order in which events occur. - The algorithm can examine data and predict what
event is most likely to occur next. - Widely used in studying how visitors navigate a
Web site. Helps to improve chances of making a
sale.
12Uses of Data Mining-5
- Regression is a statistical technique that is
used to map data to a prediction value - Forecasting estimates future values based on
patterns within large sets of data - Eg. Gasoline prices this month may predict next
months sales of SUVs
13Data Mining Concepts and Applications
Data mining applications
- Marketing
- Banking
- Retailing and sales
- Manufacturing and production
- Brokerage and securities trading
- Insurance
- Computer hardware and software
- Government and defense
- Airlines
- Health care
- Broadcasting
- Police
- Homeland security
14Text Mining
- Application of data mining to text files,
typically freestyle text material - Discovers new knowledge that is not obvious
- Examples
- Examine all news services, cluster similar
topics, create a new summary for each topic - Find the hidden content of documents,
including additional useful relationships, eg.
Lies, deceptions, scams - Not same as the search engine on the Web.
15Text Mining how is it done?
- It entails the generation of meaningful numerical
indices/factors from the unstructured text and
then processing these indices using various data
mining algorithms - Example
- Extract each word from the document being text
mined - Eliminate commonly used words (the, and, other,
etc) - Combine synonyms and phrases
- Calculate weights for each term
- tf factor (term frequency) actual number of
times a word appears in a document - idf factor (inter document frequency) across
multiple documents - High tf factor value of a given term indicates
that the document topic is probably around the
meaning of that term!
16Text Mining - applications
- Automatic detection of e-mail spam or phishing
through analysis of the document content - Automatic processing of messages or e-mails to
route a message to the most appropriate party to
process that message - Analysis of warranty claims, help desk
calls/reports, and so on to identify the most
common problems and relevant responses
17Web Mining
- The discovery and analysis of interesting and
useful information from the Web
18Web content mining
- The extraction of useful information from Web
pages - Eg. Search with the help of keywords in the Meta
tags of the web page - You can analyze the document content of the
first 10 links of Google in a search response -
- You can generate a summary of the contents
automatically in a new document!
19Web structure mining
- The development of useful information from the
links included in the Web documents - If a web sites pages predominantly link to each
other, you may consider the site to exist
independent - If a collection of web sites are linked to each
other heavily, it points to a web community or
clan that share common interests - Example application Web structure mining can
lead to better understanding of extremist groups
20Web usage mining
- The extraction of useful information from the
data being generated through webpage visits,
transaction, etc. - Clickstream analysis
- Uses cookies, number of logs, time of log, etc
- Can help profile users
21Uses for Web mining
- Determine the lifetime value of clients
- Design cross-marketing strategies across products
- Evaluate promotional campaigns
- Target electronic ads and coupons at user groups
- Predict user behavior
- Present dynamic information to users
22Data Mining Project Processes
23Steps for Data Mining
- Problem definition Decide the measure to study
and the suitable mining algorithm (see Exercise
11) - Data preparation Design the cube and populate it
relevant data from the data warehouse - Training Run the mining algorithm on a subset of
the data warehouse data for the system to learn
to find segments, associations, etc among data - Validation Run the learnt model from previous
step to the remaining subset of data and try to
predict. Since you have historical data, you
can verify if the learnt model is any good. - Deploy Implement to predict in real environment
where you do not know the actual results.