Knowledge discovery

About This Presentation

Title:

Knowledge discovery

Description:

Title: Integration of Deduction and Induction for Mining Supermarket Sales Data Author: Dino Pedreschi Last modified by: Fosca Giannotti Created Date – PowerPoint PPT presentation

Number of Views:268

Avg rating:3.0/5.0

Slides: 64

Provided by: DinoPed9

Category:

more less

Transcript and Presenter's Notes

Title: Knowledge discovery

1
Knowledge discovery data mining Towards KD
Support Environments

Fosca Giannotti and
Dino Pedreschi
Pisa KDD Lab
CNUCE-CNR Univ. Pisa
http//www-kdd.di.unipi.it/

A tutorial _at_ EDBT2000
2
Module outline

Data analysis and KD Support Environments
Data mining technology trends
from tools
to suites
to solutions
Towards data mining query languages
DATASIFT a logic-based KDSE
Future research challenges

3
Vertical applications

We outlined three classes of vertical data
analysis applications that can be tackled using
KDD DM techniques
Fraud detection
Market basket analysis
Customer segmentation

4
Why are these applications challenging?

Require manipulation and reasoning over knowledge
and data at different abstraction levels
conceptual
semantic integration of domain knowledge, expert
(business) rules and extracted knowledge
semantic integration of different analysis
paradigms
logical/physical
interoperability with external components
DBMSs, data mining tools, desktop tools
querying/mining optimization loose vs. tight
coupling between query language and specialized
mining tools

5
Why are these applications challenging?

The associated KDD processneeds to be carefully
specified, tuned and controlled

6
Why are these applications challenging?

Still not properly supported by available KDD
technology
what is offered horizontal, customizable
toolkits/suites of data mining primitives
what is needed KD support environments for
vertical applications

7
Datamining vs. traditional Sw development process

Traditional
Focus on knowledge transfer, design and coding
30 - analysis and design
70 - program design, coding and testing
Prototyping - expensive
Development process has few loops
Maintenance requires human analysis

Data mining
Focus on data selection, representation and
search
70 - data preparation
30 - model generation and testing
Prototyping - cheap
Development process is inherently iterative
Maintenance requires re-learning model

8
From R. Agrawals invited lecture _at_ KDD99
Chasm
Early Market
Mainstream Market
The greatest peril in the development of a
high-tech market lies in making the transition
from an early market dominated by a few
visionaries to a mainstream market dominated by
pragmatists.
9
Is data mining in the chasm?

Perceived to be sophisticated technology, usable
only by specialists
Long, expensive projects
Stand-alone, loosely-coupled with data
infrastructures
Difficult to infuse into existing
mission-critical applications

10
Module outline

Data analysis and KD Support Environments
Data mining technology trends
from tools
to suites
to solutions
Towards data mining query languages
DATASIFT a logic-based KDSE
Future research challenges

11
Generation 1 data mining tools

1980 first generation of DM systems
research-driven tools for single tasks, e.g.
build a decision tree - say C4.5
find clusters - say Autoclass (Cheeseman 88)
Difficult to use more than one tool on the same
data lots of data/metadata transformation
Intended user a specialist, technically
sophisticated.

12
Generation 2 data mining suites

1995 second generation of DM systems
toolkits for multiple tasks with support for data
preparation and interoperability with DBMS, e.g.
SPSS Clementine
IBM Intelligent Miner
SAS Enterprise Miner
SFU DBMiner
Intended user data analyst suites require
significant knowledge of statistics and databases

13
Growth of DM tools (source kdnuggets.com)

From G. Piatetsky-Shapiro. The data-mining
industry coming of age. IEEE Intelligent Systems,
Dec. 1999.

14
Generation 3 data mining solutions

Beginning end of 1990s
vertical data mining-based applications and
solutions oriented to solving one specific
business problem, e.g.
detecting credit card fraud
customer retention
Address entire KDD process, and push result into
a front-end application
Intended user business user the interfaces hid
the data mining complexity

15
Emerging short-term technology trends

Tighter interoperability by means of standards
which facilitate the integration of data mining
with other applications
KDD process, e.g. the Cross-Industry Standard
Process for Data Mining model (www.crisp-dm.org)
representation of mining models e.g., the PMML -
predictive modeling markup language (www.dmg.org)
DB interoperability the Microsoft OLE DB for
data mining interface

16
Approaches in data mining suites

Database-oriented approach
IBM Intelligent Miner
OLAP-based mining
DBMiner - Jiawei Hans group _at_ SFU
Machine learning
CART, ID3/C4.5/C5.0, Angoss Knowledge Studio
Statistical approaches
The SAS Institute Enterprise Miner.
Visualization approach
SGI MineSet, VisDB (Keim et al. 94).

17
Other approaches in data mining suites

Neural network approach
Cognos 4thoughts, NeuroRule (Lu et al.95).
Deductive DB integration
KnowlegeMiner (Shen et al.96)
Datasift (Pisa KDD Lab - see refs).
Rough sets, fuzzy sets
Datalogic/R, 49er
Multi-strategy mining
INLEN, KDW, Explora

18
SFU DBMiner OLAP-centric mining
Active Object Elements
Warehouse Workplace
Active Object
19
IBM Intelligent Miner DB-centric mining
Contents Container
Mining Base Container
Work Area
20
IBM IM architecture
21
Angoss Knowledge Studio ML-centric mining
Work Area
Project Outline
Additional Visualizations
22
KS project outline tool

(Limited) support to the KDD process

23
Support for data consolidation step

DBMiner
ODBC databases SQL SmartDrives
Single database multiple tables
Consolidation of heterogeneous sources
unsupported
Intelligent Miner
DB2 and text SQL without SmartDrives
Multiple databases
Consolidation of heterogeneous sources supported
Knowledge Studio
ODBC databases and text
Single table
Consolidation of heterogeneous sources
unsupported

24
Support for selection and preprocessing

DBMiner
SQL only
Intelligent Miner
SQL standard and advanced statistical
functionalities
Knowledge Studio
descriptive statistics

25
Support for data mining step

DBMiner
Association rules
Decision trees
Prediction
Intelligent Miner
Associations rules
Sequential patterns
Clustering
Classification
Prediction
Similar time series

Knowledge Studio
Decision trees
Clustering
Prediction

26
Support for interpretation and evaluation

Predefined interestingness measures
Emphasis on visualization
Limited export capability of analysis results
Gain charts for comparison of predictive models
(KS and IM)
Limited model combination capabilities (KS)

27
Module outline

Data analysis and KD Support Environments
Data mining technology trends
from tools
to suites
to solutions
Towards data mining query languages
DATASIFT a logic-based KDSE
Future research challenges

28
Data Mining Query Languages

A DMQL can provide the ability to support ad-hoc
and interactive data mining
Hope achieve the same effect that SQL had on
relational databases.
Various proposals
DMQL (Han et al 96)
mine operator (Meo et el 96)
M-SQL (Imielinski et al 99)
query flocks (Tsur et al 98)

29
MINE operator of (Meo et al 96)
30
References - DMQL

J. Han, Y. Fu, W. Wang, K. Koperski, and O. R.
Zaiane. DMQL A Data Mining Query Language for
Relational Databases. In Proc. 1996 SIGMOD'96
Workshop on Research Issues on Data Mining and
Knowledge Discovery (DMKD'96), pp. 27-33,
Montreal, Canada, June 1996.
R. Meo, G. Psaila, S. Ceri. A New SQL-like
Operator for Mining Association Rules. In Proc.
VLDB96, 1996 Int. Conf. Very Large Data Bases,
Bombay, India, pp. 122-133, Sept. 1996.
T. Imielinski and A. Virmani. MSQL a query
language for database mining. Data Mining and
Knowledge Discovery, 3373-408, 1999.
S. Tsur, J. Ulman, S. Abiteboul, C. Clifton, R.
Motwani, S. Nestorov. Query flocks a
generalization of association rule mining. In
Proc. 1998 ACM-SIGMOD, p. 1-12, 1998.

31
Module outline

Data analysis and KD Support Environments
Data mining technology trends
from tools
to suites
to solutions
Towards data mining query languages
DATASIFT a logic-based KDSE
Future research challenges

32
DATASIFT - towards a logic-based KDSE

DATASIFT is LDL (Logic Data Language, MCC
UCLA) extended with mining primitives (decision
trees association rules)
LDL syntax Prolog-like deductive rules
LDL semantics SQL extended with recursion (and
more)
Integration of deduction and induction
Employed to systematically develop the
methodology for MBA and audit planning
See Pisa KDD Lab references

33
Our position

A suitable integration of
deductive reasoning (logic database languages)
inductive reasoning (association rules decision
trees)
provides a viable solution to high-level problems
in knowledge-intensive data analysis applications

34
Our goal

Demonstrate how we support design and control of
the overall KDD process and the incorporation of
background knowledge
data preparation
knowledge extraction
post-processing and knowledge evaluation
business rules
autofocus datamining

35
With respect to other DMQLs

extending logic query languages yields extra
expressiveness, needed to bridge the gap between
data mining (e.g., association rule mining)
vertical applications (e.g., market basket
analysis)

36
Architecture - client agent

User interface
Access to business rules and visualization of
results through
web browser to control interaction
MS Excel objects (sheets and charts) to represent
output of analysis (association rules)

37
Architecture - server agent

A query engine (mediator)
record previous analyses
Metadata/meta knowledge
interaction with other components
LDL server
extended with external calls to DBMSs and to
Inductive modules
Apriori
classifiers (decision trees)
Coupling with DBMS using the Cache-mine approach
Performance comparable with SQL-based approaches
on same mining queries (Giannotti at el 2000)

38
Deductive rules in LDL
A small database of cash register
transactions basket(1,fish). basket(2,bread).
basket(3,bread). basket(1,bread). basket(2,milk).
basket(3,orange). basket(2,onions).
basket(3,milk). basket(2,fish).

E.g. select transactions involving milk
milk_basket(T,I) ? basket(T,I),basket(T,milk).
Querying ?- milk_basket(T,I)
milk_basket(2,bread). milk_basket(3,bread).
milk_ basket(2,milk). milk_basket(3,orange).
milk_ basket(2,onions). milk_basket(3,milk).
milk_ basket(2,fish).

39
Aggregates in LDL
A small database of cash register
transactions basket(1,fish). basket(2,bread).
basket(3,bread). basket(1,bread). basket(2,milk).
basket(3,orange). basket(2,onions).
basket(3,milk). basket(2,fish).

E.g. count occurrences of pairs of distinct
items in all transactions
pair(I1,I2,countltTgt)? basket(T,I1),basket(T,I2),I
1? I2.

aggregate

Querying ?- pair(fish,bread,N)
pair(fish,bread,2) (i.e., N2)
Aggregates are the logical interface between
deductive and inductive environment.

40
Association rules in LDL
basket(1,fish). basket(2,bread).
basket(3,bread). basket(1,bread). basket(2,milk).
basket(3,orange). basket(2,onions).
basket(3,milk). basket(2,fish).

E.g., compute one-to-one association rules with
at least 40 support
rules(patternslt0.4,0,I1,I2gt)?basket(T,I1),baske
t(T,I2).

patterns

is the aggregate interfacing the
computation of association rules
patternsltmin_supp, min_conf, trans_setgt

41
Association rules in LDL
basket(1,fish). basket(2,bread).
basket(3,bread). basket(1,bread). basket(2,milk).
basket(3,orange). basket(2,onions).
basket(3,milk). basket(2,fish).

Result of the query ?- rules(X,Y,S,C)
rules(milk,bread,0.66,1)
i.e. milk ? bread 0.66,1
rules(bread,milk,0.66,0.66)
rules(fish,bread,0.66,1)
rules(bread,fish,0.66,0.66)
Same status for data and induced rules

42
Reasoning on item hierarchies

Which rules survive/decay up/down the item
hierarchy?
rules_at_level(I,patternltS,C,Itemsetgt) ?
itemset_abstraction(I,Tid,Itemset).
preserved_rules(Left,Right)
?
rules_at_level(I,Left,Right,_,_),
rules_at_level(I1,Left,Right,_,_).

43
Business rules reasoning on promotions

Which rules are established by a promotion?
interval(before, -?, 3/7/1998).
interval(promotion, 3/8/1998, 3/30/1998).
interval(after, 3/31/1998, ?).
established_rules(Left, Right) ?
not rules_partition(before, Left, Right, _, _),
rules_partition(promotion, Left, Right, _, _),
rules_partition(after, Left, Right, _, _).

44
Business rules temporal reasoning

How does rule support change along time?

45
Decision tree construction in DATASIFT

construct training and test set using rules
training_set(P,Case_list) ? ...
test_tuple(ID,F1,...,F20,Rec,Act_rec,CAR)
? ...
construct classifier using external call to C5.0
tree_rules(Tree_name,P,PF,MC,BO,Rule_list) ?
training_set(P,Case_list),tree_induction(Case_li
st,PF,MC,BO,Rule_list).
parameters
pruning factor PF
misclassification costs MC
boosting BO

external call
induced classifier
46
Putting decision trees at work

prediction of target variable
prediction(Tree_name,ID,CAR,Predicted_CAR) ?
tree_rules(Tree_name, _ ,_ , _ ,
Rule_list),test_subject(ID, F1, , F20, _, _,
CAR),classify(Rule_list ,F1, , F20,
Predicted_CAR).
Model evaluation actual recovery of a classifier
(sum recovery of tuples classified as positive)
actual_recovery(Tree_name,sumltActual_Recoverygt)
?prediction(Tree_name, ID, _ ,
pos),test_subject(ID, F1, , F20,
_,Actual_Recovery, _).

aggregate
47
Combining decision trees

Model conjunction
tree_conjunction(T1,T2,ID,CAR,pos)
? prediction(T1, ID, CAR, pos), prediction(T2,
ID, CAR, pos).
tree_conjunction (T1, T2, ID, CAR, neg)
? test_subject(ID, F1, , F20, _, _, CAR),
tree_conjunction(T1, T2, ID, CAR, pos).
More interesting combinations readily
expressible
e.g. meta learning (Chan and Stolfo 93)

48
We proposed ...

a KDD methodology for audit planning
define an audit cost model
monitor training- and test-set construction
assess the quality of a classifier
tune classifier construction to specific policies
and its formalization in a prototype logic-based
KDSE, supporting
integration of deduction and induction
integration of domain and induced knowledge
separation of conceptual and implementation level

49
Module outline

Data analysis and KD Support Environments
Data mining technology trends
from tools
to suites
to solutions
Towards data mining query languages
DATASIFT a logic-based KDSE
Future research challenges

50
A data mining research agenda

Integration with data warehouse and relational DB
Scalable, parallel/distributed and incremental
mining
Data mining query language optimization
Multiple, integrated data mining methods
KDSE and methodological support for vertical
appl.
Interactive, exploratory data mining environments
Mining on other forms of data
spatio-temporal databases
text
multimedia
web

51
Scale up!

Scaling up existing algorithms (AI, ML, IR)
Association rules
Correlation rules
Causal relationship
Classification
Clustering
Bayesian networks

52
Background knowledge constraints

Incorporating background knowledge and
constraints into existing data mining techniques
Double benefit for DMQL semantics and
optimization!
traditional algorithms
Disproportionate computational cost for selective
users
Overwhelming volume of potentially useless
results
need user-controlled focus in mining process
Association rules containing certain items
Sequential patterns containing certain patterns
Classification?

53
Vertical applications of data mining

More success stories needed!
Current data mining systems lack a thick semantic
layer (similarly to the early relational database
systems)
Verticalized data mining systems, e.g.
Market analysis systems
Fraud detection systems
Automated mining and interactive mining how far
are they?

54
Autofocus data mining

policy options, business rules

selection of data mining function fine parameter
tuning of mining function
55
DBMS coupling

Tight-coupling with DBMS
Most data mining algorithms are based on flat
file data (i.e. loose-coupling with DBMS)
A set of standard data mining operators
(e.g. sampling operator)

56
Web mining why?

No standards on the web, enormous blob of
unstructured and heterogeneous info
Very dynamic
One new WWW server every 2 hours
5 million documents in 1995
320 million documents in 1998
Indices get obsolete very quickly
Better means needed for discovering resources and
extracting knowledge

57
Web mining challenges

Todays search engines are plagued by problems
the abundance problem 99 of info of no
interest to 99 of people!
limited coverage of the Web
limited query interface based on keyword-oriented
search
limited customization to individual users

58
Web mining

Web content mining
mining what Web search engines find
Web document classification (Chakrabarti et al
99)
warehousing a Meta-Web (Zaïane and Han 98)
intelligent query answering in Web search
Web usage mining
Web log mining find access patterns and trends
(Zaiane et al 98)
customized user tracking and adaptive sites
(Perkowitz et al 97)
Web structure mining
discover authoritative pages a page is important
if important pages point to it (Chakrabarti et
al 99, Kleinberg 98)

59
Warehousing a Meta-Web (Zaïane Han 98)

Meta-Web summarizes the contents and structure
of the Web, which evolves with the Web
Layer0 the Web itself
Layer1 the lowest layer of the Meta-Web
an entry a Web page summary, including class,
time, URL, contents, keywords, popularity,
weight, links, etc.
Layer2 and up summary/classification/clustering
Meta-Web is warehoused and incrementally updated
Querying and mining is performed on or assisted
by meta-Web
Is it feasible/sustainable? Is XML of any help?

60
Meta-Web from Jiawei Hans panel talk _at_ SIGMOD99
More Generalized Descriptions
Layern
...
Generalized Descriptions
Layer1
Layer0
61
Weblog mining

Web servers register a log entry for every single
access.
A huge number of accesses (hits) are registered
and collected in an ever-growing web log.
Why warehousing/mining web logs?
Enhance server performance by learning access
patterns of general or particular users (guess
what user will ask next and pre-cache!)
Improve system design of web applications
Identify potential prime advertisement locations
Greatest peril the privacy pitfall
See e.g. (Markoff 99) the rise of the Little
Brother.

62
Some web mining references

M. Perkowitz and O. Etzioni. Adaptive sites
Automatically learning from user access patterns.
In Proc. 6th Int. World Wide Web Conf., Santa
Clara, California, April 1997.
J. Pitkow. In search of reliable usage data on
the www. In Proc. 6th Int. World Wide Web Conf.,
Santa Clara, California, April 1997.
T. Sullivan. Reading reader reaction A proposal
for inferential analysis of web server log files.
In Proc. 3rd Conf. Human Factors the Web,
Denver, Colorado, June 1997.
O. R. Zaiane, M. Xin, and J. Han. Discovering Web
access patterns and trends by applying OLAP and
data mining technology on Web logs. In Proc.
Advances in Digital Libraries Conf. (ADL'98),
pages 19-29, Santa Barbara, CA, April 1998.
O. R. Zaiane, and J. Han. Resource and knowledge
discovery in global information systems a
preliminary design and experiment. In Proc.
KDD95, p.331-336, 1995.
O. R. Zaiane, and J. Han. WebML querying the
world-wide web for resources and knowledge. In
Proc. Int. Workshop on Web informtion and Data
management (WIDM98), p. 9-12, 1998.
S. Chakrabarti, B. E. Dom, S. R. Kumar, P.
Raghavan, et al. Mining the webs link structure.
COMPUTER, 3260-67, 1999.
S. Chakrabarti, B. E. Dom, P. Indik. Enhanced
hypertext classification using hyperlinks. In
Proc. 1998 ACM-SIGMOD, p. 307-318, 1999.
J. Kleinberg. Autohoritative sources in a
hyperlinked environment. In Proc. ACM-SIAM Symp.
on Discrete Algorithms, 1998.
J. Markoff. The Rise of Little Brother. Upside,
Apr. 1999 http//www.upside.com/texis/mvm/story?i
d36d4613c0

63
Pisa KDD Lab references

F. Giannotti and G. Manco. Making Knowledge
Extraction and Reasoning Closer. In Proc.
PAKDD'99, The Fourth Pacific-Asia Conference on
Knowledge Discovery and Data Mining, Kyoto, 2000.
F. Giannotti and G. Manco. Querying Inductive
Databases via Logic-Based User Defined
Aggregates. In Proc. PKDD'99, The Third Europ.
Conf. on Principles and Practice of Knowledge
Discovery in Databases. Prague, Sept. 1999.
F. Bonchi, F. Giannotti, G. Mainetto, D.
Pedreschi. Using Data Mining Techniques in Fiscal
Fraud Detection. In Proc. DaWak'99, First Int.
Conf. on Data Warehousing and Knowledge
Discovery. Florence, Italy, Sept. 1999.
F. Bonchi , F. Giannotti, G. Mainetto, D.
Pedreschi. A Classification-based Methodology for
Planning Audit Strategies in Fraud Detection. In
Proc. KDD-99, ACM-SIGKDD Int. Conf. on Knowledge
Discovery Data Mining, San Diego (CA), August
1999.
F. Giannotti, G. Manco, D. Pedreschi and F.
Turini. Experiences with a logic-based knowledge
discovery support environment. In Proc. 1999 ACM
SIGMOD Workshop on Research Issues in Data Mining
and Knowledge Discovery (SIGMOD'99 DMKD).
Philadelphia, May 1999.
F. Giannotti, M. Nanni, G. Manco, D. Pedreschi
and F. Turini. Integration of Deduction and
Induction for Mining Supermarket Sales Data. In
Proc. PADD'99, Practical Application of Data
Discovery, Int. Conference, London, April 1999.
F. Giannotti, G. Manco, M. Nanni, D. Pedreschi.
Nondeterministic, Nonmonotonic Logic Databases.
IEEE Trans. on Knowledge and Data Engineering.
2000.
F. Giannotti, M. Nanni, G. Manco, D. Pedreschi
and F. Turini. Using deduction for intelligent
data analysis. Submitted, 2000.
http//www-kdd.di.unipi.it/
P. Becuzzi, M. Coppola, S. Ruggieri and M.
Vanneschi. Parallelisation of C4.5 as a
particular divide and conquer computation.
Proc.3rd Workshop on High Performance Data
Mining, Springer-Verlag LNCS, 2000.

Write a Comment

User Comments (0)

About PowerShow.com

Knowledge discovery - PowerPoint PPT Presentation

Knowledge discovery

Title: Integration of Deduction and Induction for Mining Supermarket Sales Data Author: Dino Pedreschi Last modified by: Fosca Giannotti Created Date – PowerPoint PPT presentation