Title: Kein Folientitel
1Last update 15 November 2007
Advanced databases Inferring new knowledge
from data(bases) Knowledge Discovery in
Databases
Bettina Berendt
Katholieke Universiteit Leuven, Department of
Computer Science http//www.cs.kuleuven.be/berend
t/teaching/2007w/adb/
2Agenda
Motivation Application examples
The process of knowledge discovery
Origins and context
Major issues in knowledge discovery
A short overview of key techniques
3What is the impact of genetically modified
organisms?
4Is our school system good for immigrants and/or
children from poor backgrounds?
5What are the effects of teaching in English at
universities?
6What makes people happy?
7What do men and women like?
8Is this a man or a woman?
9 Primary Tasks of Data Mining
finding the description of several predefined
classes and classify a data item into one of
them.
identifying a finite set of categories or
clusters to describe the data.
Clustering
Classification
finding a model which describes significant
dependencies between variables.
maps a data item to a real-valued prediction
variable.
Dependency Modeling
Regression
discovering the most significant changes in the
data
finding a compact description for a subset of
data
Deviation and change detection
Summarization
10Agenda
Motivation Application examples
The process of knowledge discovery
Origins and context
Major issues in knowledge discovery
A short overview of key techniques
11Data mining and knowledge discovery
- (informal definition)
- data mining is about discovering knowledge in
(huge amounts of) data - Therefore, it is clearer to speak about
knowledge discovery in data(bases)
12Recall Data, information, and knowledge
- Data represents a fact or statement of event
- without relation to other things.
- Ex It is raining.
- Information embodies the understanding of a
relationship of some sort, possibly cause and
effect. - Ex The temperature dropped 15 degrees and then
it started raining. - Knowledge represents a pattern that connects and
generally provides a high level of predictability
as to what is described or what will happen next. - Ex If the humidity is very high and the
temperature drops substantially the atmospheres
is often unlikely to be able to hold the moisture
so it rains. - (This is from knowledge-management theory. If you
want to know about wisdom, check the Web page - G. Bellinger, D. Castro, A. Mills Data,
Information, Knowledge, and Wisdom.
http//www.systems-thinking.org/dikw/dikw.htm )
13Why Data Mining?
- The Explosive Growth of Data from terabytes to
petabytes - Data collection and data availability
- Automated data collection tools, database
systems, Web, computerized society - Major sources of abundant data
- Business Web, e-commerce, transactions, stocks,
- Science Remote sensing, bioinformatics,
scientific simulation, - Society and everyone news, digital cameras,
- We are drowning in data, but starving for
knowledge! - Necessity is the mother of inventionData
miningAutomated analysis of massive data sets
14Background Evolution of Database Technology
- 1960s
- Data collection, database creation, IMS and
network DBMS - 1970s
- Relational data model, relational DBMS
implementation - 1980s
- RDBMS, advanced data models (extended-relational,
OO, deductive, etc.) - Application-oriented DBMS (spatial, scientific,
engineering, etc.) - 1990s
- Data mining, data warehousing, multimedia
databases, and Web databases - 2000s
- Stream data management and mining
- Data mining and its applications
- Web technology (XML, data integration) and global
information systems
15The KDD process
The non-trivial process of identifying valid,
novel, potentially useful, and ultimately
understandable patterns in data - Fayyad,
Platetsky-Shapiro, Smyth (1996)
16The process part of knowledge discovery
- CRISP-DM
- CRoss Industry Standard Process for Data Mining
- a data mining process model that describes
commonly used approaches that expert data miners
use to tackle problems.
17Knowledge discovery, machine learning, data mining
- Knowledge discovery
- the whole process
- Machine learning
- the application of induction algorithms and other
algorithms that can be said to learn. - modeling phase
- Data mining
- sometimes KD, sometimes ML
18 The KDD Process
Data organized by function
Create/select target database
Data warehousing
1
Select sampling technique and sample data
Supply missing values
Eliminate noisy data
2
Normalize values
Transform values
Create derived attributes
Find important attributes value ranges
4
3
Select DM task (s)
Select DM method (s)
Extract knowledge
Test knowledge
Refine knowledge
Query report generation Aggregation
sequences Advanced methods
Transform to different representation
5
19Agenda
Motivation Application examples
The process of knowledge discovery
Origins and context
Major issues in knowledge discovery
A short overview of key techniques
20 Main Contributing Areas of KDD
Statistics
Infer info from data (deduction induction,
mainly numeric data)
data warehouses integrated data
OLAP On-Line Analytical Processing
KDD
Databases
Machine Learning
Store, access, search, update data (deduction)
Computer algorithms that improve automatically
through experience (mainly induction, symbolic
data)
21Data Mining Classification Schemes
- General functionality
- Descriptive data mining
- Predictive data mining
- Different views lead to different classifications
- Data view Kinds of data to be mined
- Knowledge view Kinds of knowledge to be
discovered - Method view Kinds of techniques utilized
- Application view Kinds of applications adapted
22Data Mining Confluence of Multiple Disciplines
23Why Not Traditional Data Analysis?
- Tremendous amount of data
- Algorithms must be highly scalable to handle such
as tera-bytes of data - High-dimensionality of data
- Micro-array may have tens of thousands of
dimensions - High complexity of data
- Data streams and sensor data
- Time-series data, temporal data, sequence data
- Structure data, graphs, social networks and
multi-linked data - Heterogeneous databases and legacy databases
- Spatial, spatiotemporal, multimedia, text and Web
data - Software programs, scientific simulations
- New and sophisticated applications
24Agenda
Motivation Application examples
The process of knowledge discovery
Origins and context
Major issues in knowledge discovery
A short overview of key techniques
25Data Mining On What Kinds of Data?
- Database-oriented data sets and applications
- Relational database, data warehouse,
transactional database - Advanced data sets and advanced applications
- Data streams and sensor data
- Time-series data, temporal data, sequence data
(incl. bio-sequences) - Structure data, graphs, social networks and
multi-linked data - Object-relational databases
- Heterogeneous databases and legacy databases
- Spatial data and spatiotemporal data
- Multimedia database
- Text databases
- The World-Wide Web
26Data Mining Functionalities
- Multidimensional concept description
Characterization and discrimination - Generalize, summarize, and contrast data
characteristics, e.g., dry vs. wet regions - Frequent patterns, association, correlation vs.
causality - Diaper ? Beer 0.5, 75 (Correlation or
causality?) - Classification and prediction
- Construct models (functions) that describe and
distinguish classes or concepts for future
prediction - E.g., classify countries based on (climate), or
classify cars based on (gas mileage) - Predict some unknown or missing numerical values
27Data Mining Functionalities (2)
- Cluster analysis
- Class label is unknown Group data to form new
classes, e.g., cluster houses to find
distribution patterns - Maximizing intra-class similarity minimizing
interclass similarity - Outlier analysis
- Outlier Data object that does not comply with
the general behavior of the data - Noise or exception? Useful in fraud detection,
rare events analysis - Trend and evolution analysis
- Trend and deviation e.g., regression analysis
- Sequential pattern mining e.g., digital camera ?
large SD memory - Periodicity analysis
- Similarity-based analysis
- Other pattern-directed or statistical analyses
28Are All the Discovered Patterns Interesting?
- Data mining may generate thousands of patterns
Not all of them are interesting - Suggested approach Human-centered, query-based,
focused mining - Interestingness measures
- A pattern is interesting if it is easily
understood by humans, valid on new or test data
with some degree of certainty, potentially
useful, novel, or validates some hypothesis that
a user seeks to confirm - Objective vs. subjective interestingness measures
- Objective based on statistics and structures of
patterns, e.g., support, confidence, etc. - Subjective based on users belief in the data,
e.g., unexpectedness, novelty, actionability, etc.
29Find All and Only Interesting Patterns?
- Find all the interesting patterns Completeness
- Can a data mining system find all the interesting
patterns? Do we need to find all of the
interesting patterns? - Heuristic vs. exhaustive search
- Association vs. classification vs. clustering
- Search for only interesting patterns An
optimization problem - Can a data mining system find only the
interesting patterns? - Approaches
- First general all the patterns and then filter
out the uninteresting ones - Generate only the interesting patternsmining
query optimization
30Other Pattern Mining Issues
- Precise patterns vs. approximate patterns
- Association and correlation mining possible find
sets of precise patterns - But approximate patterns can be more compact and
sufficient - How to find high quality approximate patterns??
- Gene sequence mining approximate patterns are
inherent - How to derive efficient approximate pattern
mining algorithms?? - Constrained vs. non-constrained patterns
- Why constraint-based mining?
- What are the possible kinds of constraints? How
to push constraints into the mining process?
31Data Mining Query Languages
- Automated vs. query-driven?
- Finding all the patterns autonomously in a
database?unrealistic because the patterns could
be too many but uninteresting - Data mining should be an interactive process
- User directs what to be mined
- Users must be provided with a set of primitives
to be used to communicate with the data mining
system - Incorporating these primitives in a data mining
query language - More flexible user interaction
- Foundation for design of graphical user interface
- Standardization of data mining industry and
practice
32Primitives that Define a Data Mining Task
- Task-relevant data
- Type of knowledge to be mined
- Background knowledge
- Pattern interestingness measurements
- Visualization/presentation of discovered patterns
33Primitive 1 Task-Relevant Data
- Database or data warehouse name
- Database tables or data warehouse cubes
- Condition for data selection
- Relevant attributes or dimensions
- Data grouping criteria
34Primitive 2 Types of Knowledge to Be Mined
- Characterization
- Discrimination
- Association
- Classification/prediction
- Clustering
- Outlier analysis
- Other data mining tasks
35Primitive 3 Background Knowledge
- A typical kind of background knowledge Concept
hierarchies - Schema hierarchy
- E.g., street lt city lt province_or_state lt country
- Set-grouping hierarchy
- E.g., 20-39 young, 40-59 middle_aged
- Operation-derived hierarchy
- email address hagonzal_at_cs.uiuc.edu
- login-name lt department lt university lt country
- Rule-based hierarchy
- low_profit_margin (X) lt price(X, P1) and cost
(X, P2) and (P1 - P2) lt 50
36Primitive 4 Pattern Interestingness Measure
- Simplicity
- e.g., (association) rule length, (decision) tree
size - Certainty
- e.g., confidence, P(AB) (A and B)/ (B),
classification reliability or accuracy, certainty
factor, rule strength, rule quality,
discriminating weight, etc. - Utility
- potential usefulness, e.g., support
(association), noise threshold (description) - Novelty
- not previously known, surprising (used to remove
redundant rules, e.g., Illinois vs. Champaign
rule implication support ratio)
37Primitive 5 Presentation of Discovered Patterns
- Different backgrounds/usages may require
different forms of representation - E.g., rules, tables, crosstabs, pie/bar chart,
etc. - Concept hierarchy is also important
- Discovered knowledge might be more understandable
when represented at high level of abstraction - Interactive drill up/down, pivoting, slicing and
dicing provide different perspectives to data - Different kinds of knowledge require different
representation association, classification,
clustering, etc.
38Architecture Typical Data Mining System
39Major Issues in Data Mining
- Mining methodology
- Mining different kinds of knowledge from diverse
data types, e.g., bio, stream, Web - Performance efficiency, effectiveness, and
scalability - Pattern evaluation the interestingness problem
- Incorporation of background knowledge
- Handling noise and incomplete data
- Parallel, distributed and incremental mining
methods - Integration of the discovered knowledge with
existing one knowledge fusion - User interaction
- Data mining query languages and ad-hoc mining
- Expression and visualization of data mining
results - Interactive mining of knowledge at multiple
levels of abstraction - Applications and social impacts
- Domain-specific data mining invisible data
mining - Protection of data security, integrity, and
privacy
40Agenda
Motivation Application examples
The process of knowledge discovery
Origins and context
Major issues in knowledge discovery
A short overview of key techniques
41Classification
What factors determine cancerous cells?
Examples
General patterns
Data
Mining Algorithm
- Rule Induction - Decision tree - Neural Network
Classification Algorithm
Cancerous Cell Data
42 Classification Rule Induction
What factors determine a cell is cancerous?
If Color light and Tails 1 and
Nuclei 2 Then Healthy Cell (certainty
92) If Color dark and Tails 2 and
Nuclei 2 Then Cancerous Cell (certainty
87)
43Classification Decision Trees
Color dark
Color light
nuclei1
nuclei2
nuclei1
nuclei2
cancerous
healthy
tails1
tails2
tails1
tails2
healthy
cancerous
healthy
cancerous
44Classification Neural Networks
What factors determine a cell is cancerous?
Color dark nuclei 1 tails 2
Healthy
Cancerous
45 Clustering
Are there clusters of similar cells?
Light color with 1 nucleus
Dark color with 2 tails 2 nuclei
1 nucleus and 1 tail
Dark color with 1 tail and 2 nuclei
46 Association Rule Discovery
Task Discovering association rules among items
in a transaction database. An association among
two items A and B means that the presence of A in
a record implies the presence of B in the same
record A gt B. In general A1, A2, gt B
47 Association Rule Discovery
Are there any associations between the
characteristics of the cells?
If color light and nuclei 1 then tails
1 (support 12.5 confidence
50) If nuclei 2 and Cell Cancerous then
tails 2 (support 25 confidence
100) If tails 1 then Color light
(support 37.5 confidence 75)
48Many Other Data Mining Techniques
Genetic Algorithms
Statistics
Bayesian Networks
Text Mining
Time Series
Rough Sets
49A goal From databases to deductive databases to
inductive databases
- A deductive database system is a database system
which can make deductions (ie conclude
additional facts) based on rules and facts stored
in the (deductive) database. - inductive databases
- contain not only data, but also patterns.
- In an IDB, inductive queries can be used to
generate (mine), manipulate, and apply patterns. - The IDB framework supports the process of
knowledge discovery in databases (KDD) - the results of one (inductive) query can be used
as input for another - nontrivial multi-step KDD scenarios can be
supported, rather than just single data mining
operations.
50Next lecture
Motivation Application examples
The process of knowledge discovery
Origins and context
Major issues in knowledge discovery
A short overview of key techniques
Deductive databases
51References / background reading acknowledgements
- Knowledge discovery is now an established area
with some excellent general textbooks. I
recommend the following as examples of the 3 main
perspectives - a databases / data warehouses perspective Han,
J. Kamber, M. (2001). Data Mining Concepts and
Techniques. San Francisco,CA Morgan Kaufmann.
http//www.cs.sfu.ca/7Ehan/dmbook - a machine learning perspective Witten, I.H.,
Frank, E.(2005). Data Mining. Practical Machine
Learning Tools and Techniques with Java
Implementations. 2nd ed. Morgan Kaufmann.
http//www.cs.waikato.ac.nz/7Eml/weka/book.html
- a statistics perspective Hand, D.J., Mannila,
H., Smyth, P. (2001). Principles of Data
Mining. Cambridge, MA MIT Press.
http//mitpress.mit.edu/catalog/item/default.asp?t
id3520ttype2 - pp. 9, 15, 18, 20, 41-44 were taken from
- Tzacheva, A.A. (2006). SIMS 422. Knowledge
Inference Systems Applications.
http//faculty.uscupstate.edu/atzacheva/SIMS422/Ov
erviewI.ppt - pp. 45-48 were taken from
- Tzacheva, A.A. (2006). Knowledge Discovery and
Data Mining. http//faculty.uscupstate.edu/atzache
va/SIMS422/OverviewII.ppt - pp. 13, 14, 22, 23, 25-39 were taken from
- Han, J. Kamber, M. (2006). Data Mining
Concepts and Techniques Chapter 1
Introduction. http//www.cs.sfu.ca/7Ehan/bk/1intr
o.ppt
52Picture credits CRISP-DM reference
- p. 3 http//www.siu-weeds.com/publications/Wheat_
field.jpg - p. 4 http//www.dkimages.com/discover/previews/88
9/30039025.JPG - p. 5 http//www.viebahnfinearts.com/website/Pages
/Photos/Furniture/Mirror201005.jpg - p. 6 http//charles.robinsontwins.org/twinsdays_9
6/john/smiley.jpg - p. 16 http//www.palagems.com/Images/ceylon_minin
g.jpg, - http//www.crisp-dm.org/Images/187343_CRISPart.jpg
- The CRISP-DM phase model can be found at
http//www.crisp-dm.org