Title: Knowledge discovery
1Knowledge discovery data mining Towards KD
Support Environments
- Fosca Giannotti and
- Dino Pedreschi
- Pisa KDD Lab
- CNUCE-CNR Univ. Pisa
- http//www-kdd.di.unipi.it/
A tutorial _at_ EDBT2000
2Module outline
- Data analysis and KD Support Environments
- Data mining technology trends
- from tools
- to suites
- to solutions
- Towards data mining query languages
- DATASIFT a logic-based KDSE
- Future research challenges
3Vertical applications
- We outlined three classes of vertical data
analysis applications that can be tackled using
KDD DM techniques - Fraud detection
- Market basket analysis
- Customer segmentation
4Why are these applications challenging?
- Require manipulation and reasoning over knowledge
and data at different abstraction levels - conceptual
- semantic integration of domain knowledge, expert
(business) rules and extracted knowledge - semantic integration of different analysis
paradigms - logical/physical
- interoperability with external components
DBMSs, data mining tools, desktop tools - querying/mining optimization loose vs. tight
coupling between query language and specialized
mining tools
5Why are these applications challenging?
- The associated KDD processneeds to be carefully
specified, tuned and controlled
6Why are these applications challenging?
- Still not properly supported by available KDD
technology - what is offered horizontal, customizable
toolkits/suites of data mining primitives - what is needed KD support environments for
vertical applications
7Datamining vs. traditional Sw development process
- Traditional
- Focus on knowledge transfer, design and coding
- 30 - analysis and design
- 70 - program design, coding and testing
- Prototyping - expensive
- Development process has few loops
- Maintenance requires human analysis
- Data mining
- Focus on data selection, representation and
search - 70 - data preparation
- 30 - model generation and testing
- Prototyping - cheap
- Development process is inherently iterative
- Maintenance requires re-learning model
8From R. Agrawals invited lecture _at_ KDD99
Chasm
Early Market
Mainstream Market
The greatest peril in the development of a
high-tech market lies in making the transition
from an early market dominated by a few
visionaries to a mainstream market dominated by
pragmatists.
9Is data mining in the chasm?
- Perceived to be sophisticated technology, usable
only by specialists - Long, expensive projects
- Stand-alone, loosely-coupled with data
infrastructures - Difficult to infuse into existing
mission-critical applications
10Module outline
- Data analysis and KD Support Environments
- Data mining technology trends
- from tools
- to suites
- to solutions
- Towards data mining query languages
- DATASIFT a logic-based KDSE
- Future research challenges
11Generation 1 data mining tools
- 1980 first generation of DM systems
- research-driven tools for single tasks, e.g.
- build a decision tree - say C4.5
- find clusters - say Autoclass (Cheeseman 88)
-
- Difficult to use more than one tool on the same
data lots of data/metadata transformation - Intended user a specialist, technically
sophisticated.
12Generation 2 data mining suites
- 1995 second generation of DM systems
- toolkits for multiple tasks with support for data
preparation and interoperability with DBMS, e.g. - SPSS Clementine
- IBM Intelligent Miner
- SAS Enterprise Miner
- SFU DBMiner
- Intended user data analyst suites require
significant knowledge of statistics and databases
13Growth of DM tools (source kdnuggets.com)
- From G. Piatetsky-Shapiro. The data-mining
industry coming of age. IEEE Intelligent Systems,
Dec. 1999.
14Generation 3 data mining solutions
- Beginning end of 1990s
- vertical data mining-based applications and
solutions oriented to solving one specific
business problem, e.g. - detecting credit card fraud
- customer retention
-
- Address entire KDD process, and push result into
a front-end application - Intended user business user the interfaces hid
the data mining complexity
15Emerging short-term technology trends
- Tighter interoperability by means of standards
which facilitate the integration of data mining
with other applications - KDD process, e.g. the Cross-Industry Standard
Process for Data Mining model (www.crisp-dm.org) - representation of mining models e.g., the PMML -
predictive modeling markup language (www.dmg.org) - DB interoperability the Microsoft OLE DB for
data mining interface
16Approaches in data mining suites
- Database-oriented approach
- IBM Intelligent Miner
- OLAP-based mining
- DBMiner - Jiawei Hans group _at_ SFU
- Machine learning
- CART, ID3/C4.5/C5.0, Angoss Knowledge Studio
- Statistical approaches
- The SAS Institute Enterprise Miner.
- Visualization approach
- SGI MineSet, VisDB (Keim et al. 94).
17Other approaches in data mining suites
- Neural network approach
- Cognos 4thoughts, NeuroRule (Lu et al.95).
- Deductive DB integration
- KnowlegeMiner (Shen et al.96)
- Datasift (Pisa KDD Lab - see refs).
- Rough sets, fuzzy sets
- Datalogic/R, 49er
- Multi-strategy mining
- INLEN, KDW, Explora
18SFU DBMiner OLAP-centric mining
Active Object Elements
Warehouse Workplace
Active Object
19IBM Intelligent Miner DB-centric mining
Contents Container
Mining Base Container
Work Area
20IBM IM architecture
21Angoss Knowledge Studio ML-centric mining
Work Area
Project Outline
Additional Visualizations
22KS project outline tool
- (Limited) support to the KDD process
23Support for data consolidation step
- DBMiner
- ODBC databases SQL SmartDrives
- Single database multiple tables
- Consolidation of heterogeneous sources
unsupported - Intelligent Miner
- DB2 and text SQL without SmartDrives
- Multiple databases
- Consolidation of heterogeneous sources supported
- Knowledge Studio
- ODBC databases and text
- Single table
- Consolidation of heterogeneous sources
unsupported
24Support for selection and preprocessing
- DBMiner
- SQL only
- Intelligent Miner
- SQL standard and advanced statistical
functionalities - Knowledge Studio
- descriptive statistics
25Support for data mining step
- DBMiner
- Association rules
- Decision trees
- Prediction
- Intelligent Miner
- Associations rules
- Sequential patterns
- Clustering
- Classification
- Prediction
- Similar time series
- Knowledge Studio
- Decision trees
- Clustering
- Prediction
26Support for interpretation and evaluation
- Predefined interestingness measures
- Emphasis on visualization
- Limited export capability of analysis results
- Gain charts for comparison of predictive models
(KS and IM) - Limited model combination capabilities (KS)
27Module outline
- Data analysis and KD Support Environments
- Data mining technology trends
- from tools
- to suites
- to solutions
- Towards data mining query languages
- DATASIFT a logic-based KDSE
- Future research challenges
28Data Mining Query Languages
- A DMQL can provide the ability to support ad-hoc
and interactive data mining - Hope achieve the same effect that SQL had on
relational databases. - Various proposals
- DMQL (Han et al 96)
- mine operator (Meo et el 96)
- M-SQL (Imielinski et al 99)
- query flocks (Tsur et al 98)
29MINE operator of (Meo et al 96)
30References - DMQL
- J. Han, Y. Fu, W. Wang, K. Koperski, and O. R.
Zaiane. DMQL A Data Mining Query Language for
Relational Databases. In Proc. 1996 SIGMOD'96
Workshop on Research Issues on Data Mining and
Knowledge Discovery (DMKD'96), pp. 27-33,
Montreal, Canada, June 1996. - R. Meo, G. Psaila, S. Ceri. A New SQL-like
Operator for Mining Association Rules. In Proc.
VLDB96, 1996 Int. Conf. Very Large Data Bases,
Bombay, India, pp. 122-133, Sept. 1996. - T. Imielinski and A. Virmani. MSQL a query
language for database mining. Data Mining and
Knowledge Discovery, 3373-408, 1999. - S. Tsur, J. Ulman, S. Abiteboul, C. Clifton, R.
Motwani, S. Nestorov. Query flocks a
generalization of association rule mining. In
Proc. 1998 ACM-SIGMOD, p. 1-12, 1998.
31Module outline
- Data analysis and KD Support Environments
- Data mining technology trends
- from tools
- to suites
- to solutions
- Towards data mining query languages
- DATASIFT a logic-based KDSE
- Future research challenges
32DATASIFT - towards a logic-based KDSE
- DATASIFT is LDL (Logic Data Language, MCC
UCLA) extended with mining primitives (decision
trees association rules) - LDL syntax Prolog-like deductive rules
- LDL semantics SQL extended with recursion (and
more) - Integration of deduction and induction
- Employed to systematically develop the
methodology for MBA and audit planning - See Pisa KDD Lab references
33Our position
- A suitable integration of
- deductive reasoning (logic database languages)
- inductive reasoning (association rules decision
trees) - provides a viable solution to high-level problems
in knowledge-intensive data analysis applications
34Our goal
- Demonstrate how we support design and control of
the overall KDD process and the incorporation of
background knowledge - data preparation
- knowledge extraction
- post-processing and knowledge evaluation
- business rules
- autofocus datamining
35With respect to other DMQLs
- extending logic query languages yields extra
expressiveness, needed to bridge the gap between - data mining (e.g., association rule mining)
- vertical applications (e.g., market basket
analysis)
36Architecture - client agent
- User interface
- Access to business rules and visualization of
results through - web browser to control interaction
- MS Excel objects (sheets and charts) to represent
output of analysis (association rules)
37Architecture - server agent
- A query engine (mediator)
- record previous analyses
- Metadata/meta knowledge
- interaction with other components
- LDL server
- extended with external calls to DBMSs and to
- Inductive modules
- Apriori
- classifiers (decision trees)
- Coupling with DBMS using the Cache-mine approach
- Performance comparable with SQL-based approaches
on same mining queries (Giannotti at el 2000)
38Deductive rules in LDL
A small database of cash register
transactions basket(1,fish). basket(2,bread).
basket(3,bread). basket(1,bread). basket(2,milk).
basket(3,orange). basket(2,onions).
basket(3,milk). basket(2,fish).
- E.g. select transactions involving milk
- milk_basket(T,I) ? basket(T,I),basket(T,milk).
- Querying ?- milk_basket(T,I)
- milk_basket(2,bread). milk_basket(3,bread).
- milk_ basket(2,milk). milk_basket(3,orange).
- milk_ basket(2,onions). milk_basket(3,milk).
- milk_ basket(2,fish).
39Aggregates in LDL
A small database of cash register
transactions basket(1,fish). basket(2,bread).
basket(3,bread). basket(1,bread). basket(2,milk).
basket(3,orange). basket(2,onions).
basket(3,milk). basket(2,fish).
- E.g. count occurrences of pairs of distinct
items in all transactions - pair(I1,I2,countltTgt)? basket(T,I1),basket(T,I2),I
1? I2.
aggregate
- Querying ?- pair(fish,bread,N)
- pair(fish,bread,2) (i.e., N2)
- Aggregates are the logical interface between
deductive and inductive environment.
40Association rules in LDL
basket(1,fish). basket(2,bread).
basket(3,bread). basket(1,bread). basket(2,milk).
basket(3,orange). basket(2,onions).
basket(3,milk). basket(2,fish).
- E.g., compute one-to-one association rules with
at least 40 support - rules(patternslt0.4,0,I1,I2gt)?basket(T,I1),baske
t(T,I2).
patterns
- is the aggregate interfacing the
computation of association rules - patternsltmin_supp, min_conf, trans_setgt
41Association rules in LDL
basket(1,fish). basket(2,bread).
basket(3,bread). basket(1,bread). basket(2,milk).
basket(3,orange). basket(2,onions).
basket(3,milk). basket(2,fish).
- Result of the query ?- rules(X,Y,S,C)
- rules(milk,bread,0.66,1)
- i.e. milk ? bread 0.66,1
- rules(bread,milk,0.66,0.66)
- rules(fish,bread,0.66,1)
- rules(bread,fish,0.66,0.66)
- Same status for data and induced rules
42Reasoning on item hierarchies
- Which rules survive/decay up/down the item
hierarchy? - rules_at_level(I,patternltS,C,Itemsetgt) ?
- itemset_abstraction(I,Tid,Itemset).
-
- preserved_rules(Left,Right)
- ?
- rules_at_level(I,Left,Right,_,_),
- rules_at_level(I1,Left,Right,_,_).
43Business rules reasoning on promotions
- Which rules are established by a promotion?
- interval(before, -?, 3/7/1998).
- interval(promotion, 3/8/1998, 3/30/1998).
- interval(after, 3/31/1998, ?).
- established_rules(Left, Right) ?
- not rules_partition(before, Left, Right, _, _),
- rules_partition(promotion, Left, Right, _, _),
- rules_partition(after, Left, Right, _, _).
44Business rules temporal reasoning
- How does rule support change along time?
45Decision tree construction in DATASIFT
- construct training and test set using rules
- training_set(P,Case_list) ? ...
- test_tuple(ID,F1,...,F20,Rec,Act_rec,CAR)
- ? ...
- construct classifier using external call to C5.0
- tree_rules(Tree_name,P,PF,MC,BO,Rule_list) ?
training_set(P,Case_list),tree_induction(Case_li
st,PF,MC,BO,Rule_list). - parameters
- pruning factor PF
- misclassification costs MC
- boosting BO
external call
induced classifier
46Putting decision trees at work
- prediction of target variable
- prediction(Tree_name,ID,CAR,Predicted_CAR) ?
tree_rules(Tree_name, _ ,_ , _ ,
Rule_list),test_subject(ID, F1, , F20, _, _,
CAR),classify(Rule_list ,F1, , F20,
Predicted_CAR). - Model evaluation actual recovery of a classifier
(sum recovery of tuples classified as positive) - actual_recovery(Tree_name,sumltActual_Recoverygt)
?prediction(Tree_name, ID, _ ,
pos),test_subject(ID, F1, , F20,
_,Actual_Recovery, _).
aggregate
47Combining decision trees
- Model conjunction
- tree_conjunction(T1,T2,ID,CAR,pos)
? prediction(T1, ID, CAR, pos), prediction(T2,
ID, CAR, pos). - tree_conjunction (T1, T2, ID, CAR, neg)
? test_subject(ID, F1, , F20, _, _, CAR),
tree_conjunction(T1, T2, ID, CAR, pos). - More interesting combinations readily
expressible - e.g. meta learning (Chan and Stolfo 93)
48We proposed ...
- a KDD methodology for audit planning
- define an audit cost model
- monitor training- and test-set construction
- assess the quality of a classifier
- tune classifier construction to specific policies
- and its formalization in a prototype logic-based
KDSE, supporting - integration of deduction and induction
- integration of domain and induced knowledge
- separation of conceptual and implementation level
49Module outline
- Data analysis and KD Support Environments
- Data mining technology trends
- from tools
- to suites
- to solutions
- Towards data mining query languages
- DATASIFT a logic-based KDSE
- Future research challenges
50A data mining research agenda
- Integration with data warehouse and relational DB
- Scalable, parallel/distributed and incremental
mining - Data mining query language optimization
- Multiple, integrated data mining methods
- KDSE and methodological support for vertical
appl. - Interactive, exploratory data mining environments
- Mining on other forms of data
- spatio-temporal databases
- text
- multimedia
- web
51Scale up!
- Scaling up existing algorithms (AI, ML, IR)
- Association rules
- Correlation rules
- Causal relationship
- Classification
- Clustering
- Bayesian networks
52Background knowledge constraints
- Incorporating background knowledge and
constraints into existing data mining techniques - Double benefit for DMQL semantics and
optimization! - traditional algorithms
- Disproportionate computational cost for selective
users - Overwhelming volume of potentially useless
results - need user-controlled focus in mining process
- Association rules containing certain items
- Sequential patterns containing certain patterns
- Classification?
53Vertical applications of data mining
- More success stories needed!
- Current data mining systems lack a thick semantic
layer (similarly to the early relational database
systems) - Verticalized data mining systems, e.g.
- Market analysis systems
- Fraud detection systems
- Automated mining and interactive mining how far
are they?
54Autofocus data mining
- policy options, business rules
selection of data mining function fine parameter
tuning of mining function
55DBMS coupling
- Tight-coupling with DBMS
- Most data mining algorithms are based on flat
file data (i.e. loose-coupling with DBMS) - A set of standard data mining operators
- (e.g. sampling operator)
56Web mining why?
- No standards on the web, enormous blob of
unstructured and heterogeneous info - Very dynamic
- One new WWW server every 2 hours
- 5 million documents in 1995
- 320 million documents in 1998
- Indices get obsolete very quickly
- Better means needed for discovering resources and
extracting knowledge
57Web mining challenges
- Todays search engines are plagued by problems
- the abundance problem 99 of info of no
interest to 99 of people! - limited coverage of the Web
- limited query interface based on keyword-oriented
search - limited customization to individual users
58Web mining
- Web content mining
- mining what Web search engines find
- Web document classification (Chakrabarti et al
99) - warehousing a Meta-Web (Zaïane and Han 98)
- intelligent query answering in Web search
- Web usage mining
- Web log mining find access patterns and trends
(Zaiane et al 98) - customized user tracking and adaptive sites
(Perkowitz et al 97) - Web structure mining
- discover authoritative pages a page is important
if important pages point to it (Chakrabarti et
al 99, Kleinberg 98)
59Warehousing a Meta-Web (Zaïane Han 98)
- Meta-Web summarizes the contents and structure
of the Web, which evolves with the Web - Layer0 the Web itself
- Layer1 the lowest layer of the Meta-Web
- an entry a Web page summary, including class,
time, URL, contents, keywords, popularity,
weight, links, etc. - Layer2 and up summary/classification/clustering
- Meta-Web is warehoused and incrementally updated
- Querying and mining is performed on or assisted
by meta-Web - Is it feasible/sustainable? Is XML of any help?
60Meta-Web from Jiawei Hans panel talk _at_ SIGMOD99
More Generalized Descriptions
Layern
...
Generalized Descriptions
Layer1
Layer0
61Weblog mining
- Web servers register a log entry for every single
access. - A huge number of accesses (hits) are registered
and collected in an ever-growing web log. - Why warehousing/mining web logs?
- Enhance server performance by learning access
patterns of general or particular users (guess
what user will ask next and pre-cache!) - Improve system design of web applications
- Identify potential prime advertisement locations
- Greatest peril the privacy pitfall
- See e.g. (Markoff 99) the rise of the Little
Brother.
62Some web mining references
- M. Perkowitz and O. Etzioni. Adaptive sites
Automatically learning from user access patterns.
In Proc. 6th Int. World Wide Web Conf., Santa
Clara, California, April 1997. - J. Pitkow. In search of reliable usage data on
the www. In Proc. 6th Int. World Wide Web Conf.,
Santa Clara, California, April 1997. - T. Sullivan. Reading reader reaction A proposal
for inferential analysis of web server log files.
In Proc. 3rd Conf. Human Factors the Web,
Denver, Colorado, June 1997. - O. R. Zaiane, M. Xin, and J. Han. Discovering Web
access patterns and trends by applying OLAP and
data mining technology on Web logs. In Proc.
Advances in Digital Libraries Conf. (ADL'98),
pages 19-29, Santa Barbara, CA, April 1998. - O. R. Zaiane, and J. Han. Resource and knowledge
discovery in global information systems a
preliminary design and experiment. In Proc.
KDD95, p.331-336, 1995. - O. R. Zaiane, and J. Han. WebML querying the
world-wide web for resources and knowledge. In
Proc. Int. Workshop on Web informtion and Data
management (WIDM98), p. 9-12, 1998. - S. Chakrabarti, B. E. Dom, S. R. Kumar, P.
Raghavan, et al. Mining the webs link structure.
COMPUTER, 3260-67, 1999. - S. Chakrabarti, B. E. Dom, P. Indik. Enhanced
hypertext classification using hyperlinks. In
Proc. 1998 ACM-SIGMOD, p. 307-318, 1999. - J. Kleinberg. Autohoritative sources in a
hyperlinked environment. In Proc. ACM-SIAM Symp.
on Discrete Algorithms, 1998. - J. Markoff. The Rise of Little Brother. Upside,
Apr. 1999 http//www.upside.com/texis/mvm/story?i
d36d4613c0
63Pisa KDD Lab references
- F. Giannotti and G. Manco. Making Knowledge
Extraction and Reasoning Closer. In Proc.
PAKDD'99, The Fourth Pacific-Asia Conference on
Knowledge Discovery and Data Mining, Kyoto, 2000. - F. Giannotti and G. Manco. Querying Inductive
Databases via Logic-Based User Defined
Aggregates. In Proc. PKDD'99, The Third Europ.
Conf. on Principles and Practice of Knowledge
Discovery in Databases. Prague, Sept. 1999. - F. Bonchi, F. Giannotti, G. Mainetto, D.
Pedreschi. Using Data Mining Techniques in Fiscal
Fraud Detection. In Proc. DaWak'99, First Int.
Conf. on Data Warehousing and Knowledge
Discovery. Florence, Italy, Sept. 1999. - F. Bonchi , F. Giannotti, G. Mainetto, D.
Pedreschi. A Classification-based Methodology for
Planning Audit Strategies in Fraud Detection. In
Proc. KDD-99, ACM-SIGKDD Int. Conf. on Knowledge
Discovery Data Mining, San Diego (CA), August
1999. - F. Giannotti, G. Manco, D. Pedreschi and F.
Turini. Experiences with a logic-based knowledge
discovery support environment. In Proc. 1999 ACM
SIGMOD Workshop on Research Issues in Data Mining
and Knowledge Discovery (SIGMOD'99 DMKD).
Philadelphia, May 1999. - F. Giannotti, M. Nanni, G. Manco, D. Pedreschi
and F. Turini. Integration of Deduction and
Induction for Mining Supermarket Sales Data. In
Proc. PADD'99, Practical Application of Data
Discovery, Int. Conference, London, April 1999. - F. Giannotti, G. Manco, M. Nanni, D. Pedreschi.
Nondeterministic, Nonmonotonic Logic Databases.
IEEE Trans. on Knowledge and Data Engineering.
2000. - F. Giannotti, M. Nanni, G. Manco, D. Pedreschi
and F. Turini. Using deduction for intelligent
data analysis. Submitted, 2000.
http//www-kdd.di.unipi.it/ - P. Becuzzi, M. Coppola, S. Ruggieri and M.
Vanneschi. Parallelisation of C4.5 as a
particular divide and conquer computation.
Proc.3rd Workshop on High Performance Data
Mining, Springer-Verlag LNCS, 2000.