Title: Data mining
1Data mining
2What is Data Mining?
- Different perspectives CS, Business, IT
- As a field of research in CS
- Science of extracting useful information from
large data sets or databases - Also known as
- Knowledge Discovery and Data Mining
(KDD)Knowledge Discovery in Databases (KDD)
3Knowledge Discovery and Data Mining (KDD)
- KDD can be said to lie at the intersection of
statistics, machine learning, data bases, pattern
recognition, information retrieval and artificial
intelligence.
4Data Mining Definitions
- Analysis of datasets to find unsuspected
relationships - Summarize data in novel ways that are
understandable useful to data owner - Extraction of knowledge from data
- non-trivial extraction of implicit, previously
unknown potentially useful knowledge from data - Process of discovering patterns
- automatically or semi-automatically, in large
quantities of data - Patterns discovered must be useful meaningful in
that they lead to some advantage, usually
economic
5Why Data Mining?
- Large datasets are common due to advances in
digital data acquisition and storage
technology. - Automatic data production leads to need for
automatic data consumption - Large databases mean vast amounts of information
- Difficulty lies in accessing it
- Business
- Supermarket transactions
- Credit card usage records
- Telephone call details
- Government statistics
- Scientific
- Images of astronomical bodies
- Molecular databases
- Medical records
6Why Data Mining?
- Data mining is ready for application in the
business community because it is supported by
three technologies that are now sufficiently
mature - Massive data collection
- Powerful multiprocessor computers
- Data mining algorithms
7Example of Data Mining
- If a store tracks the purchases of a customer and
notices that a customer buys a lot of silk
shirts, the data mining system will make a
correlation between that customer and silk
shirts. - The store may begin direct mail marketing of silk
shirts to that customer or it may alternatively
attempt to get the customer to buy a wider range
of products . - Another example analysts found that beers and
diapers were often bought together . - So place the high-profit diapers next to the
high-profit beers. - This technique is often referred to as "Market
Basket Analysis".
8Steps in the Evolution of Data Mining
Evolutionary Step Business Question Enabling Technologies
Data Collection (1960s) "What was my total revenue in the last five years?" Computers, tapes, disks
Data Access (1980s) "What were unit sales in New England last March?" Relational databases (RDBMS), Structured Query Language (SQL), ODBC
Data Warehousing Decision Support (1990s) "What were unit sales in New England last March? Drill down to Boston." On-line analytic processing (OLAP), multidimensional databases, data warehouses
Data Mining (Emerging Today) "Whats likely to happen to Boston unit sales next month? Why?" Advanced algorithms, multiprocessor computers, massive databases
9The Scope of Data Mining
- Automated prediction of trends and behaviors.
- Data mining uses data on past promotional
mailings to identify the targets most likely to
maximize return on investment in future mailings.
- Automated discovery of previously unknown
patterns. - An example of pattern discovery is the analysis
of retail sales data to identify seemingly
unrelated products that are often purchased
together. - More columns.
- High performance data mining allows users to
explore the full depth of a database, without
pre-selecting a subset of variables. - More rows.
- Larger samples yield lower estimation errors and
variance, and allow users to make inferences
about small but important segments of a
population.
10Data Mining vs. Statistics
- Objective of data mining exercise plays no role
in data collection strategy - In this way it differs from much of statistics
- For this reason, data mining is referred to as
secondary data analysis - KDD more complicated than initially thought
- 80 preparing data
- 20 mining data
11Query Data Base vs. Data Mining
- Data Base When you know exactly what you are
looking for. - Data Mining When you only vaguely know what you
are looking for.
12Data Mining Tasks and Techniques
- Not so much a single technique
- Idea that there is more knowledge hidden in the
data than shows itself on the surface - Any technique that helps to extract more out of
data is useful - Five major task types
- 1. Exploratory Data Analysis (Visualization)
- 2. Descriptive Modeling (Density estimation,
Clustering) - 3. Predictive Modeling (Classification and
Regression) - 4. Discovering Patterns and Rules (Association
rules) - 5. Retrieval by Content (Retrieve items similar
to pattern of interest)
13Privacy concerns
- For example, if an employer has access to medical
records, they may screen out people who have
diabetes or have had a heart attack. Screening
out such employees will cut costs for insurance,
but it creates ethical and legal problems. - Essentially, data mining gives information that
would not be available otherwise. It must be
properly interpreted to be useful. When the data
collected involves individual people, there are
many questions concerning privacy, legality, and
ethics.
14Notable Uses of Data Mining
- Data mining has been cited as the method by which
the U.S. Army intelligence unit, Able Danger,
supposedly had identified the 9/11 attack leader,
Mohamed Atta, and three other 9/11 hijackers as
possible members of an al Qaeda cell operating in
the U.S. more than a year before the attack.
15References
- http//www.cedar.buffalo.edu/srihari/CSE626
- http//en.wikipedia.org/wiki/Data_Mining
- http//www.thearling.com/text/dmwhite/dmwhite.htm