Knowledge discovery - PowerPoint PPT Presentation

About This Presentation
Title:

Knowledge discovery

Description:

Title: Integration of Deduction and Induction for Mining Supermarket Sales Data Author: Dino Pedreschi Last modified by: Fosca Giannotti Created Date – PowerPoint PPT presentation

Number of Views:264
Avg rating:3.0/5.0
Slides: 64
Provided by: DinoPed9
Category:

less

Transcript and Presenter's Notes

Title: Knowledge discovery


1
Knowledge discovery data mining Towards KD
Support Environments
  • Fosca Giannotti and
  • Dino Pedreschi
  • Pisa KDD Lab
  • CNUCE-CNR Univ. Pisa
  • http//www-kdd.di.unipi.it/

A tutorial _at_ EDBT2000
2
Module outline
  • Data analysis and KD Support Environments
  • Data mining technology trends
  • from tools
  • to suites
  • to solutions
  • Towards data mining query languages
  • DATASIFT a logic-based KDSE
  • Future research challenges

3
Vertical applications
  • We outlined three classes of vertical data
    analysis applications that can be tackled using
    KDD DM techniques
  • Fraud detection
  • Market basket analysis
  • Customer segmentation

4
Why are these applications challenging?
  • Require manipulation and reasoning over knowledge
    and data at different abstraction levels
  • conceptual
  • semantic integration of domain knowledge, expert
    (business) rules and extracted knowledge
  • semantic integration of different analysis
    paradigms
  • logical/physical
  • interoperability with external components
    DBMSs, data mining tools, desktop tools
  • querying/mining optimization loose vs. tight
    coupling between query language and specialized
    mining tools

5
Why are these applications challenging?
  • The associated KDD processneeds to be carefully
    specified, tuned and controlled

6
Why are these applications challenging?
  • Still not properly supported by available KDD
    technology
  • what is offered horizontal, customizable
    toolkits/suites of data mining primitives
  • what is needed KD support environments for
    vertical applications

7
Datamining vs. traditional Sw development process
  • Traditional
  • Focus on knowledge transfer, design and coding
  • 30 - analysis and design
  • 70 - program design, coding and testing
  • Prototyping - expensive
  • Development process has few loops
  • Maintenance requires human analysis
  • Data mining
  • Focus on data selection, representation and
    search
  • 70 - data preparation
  • 30 - model generation and testing
  • Prototyping - cheap
  • Development process is inherently iterative
  • Maintenance requires re-learning model

8
From R. Agrawals invited lecture _at_ KDD99
Chasm
Early Market
Mainstream Market
The greatest peril in the development of a
high-tech market lies in making the transition
from an early market dominated by a few
visionaries to a mainstream market dominated by
pragmatists.
9
Is data mining in the chasm?
  • Perceived to be sophisticated technology, usable
    only by specialists
  • Long, expensive projects
  • Stand-alone, loosely-coupled with data
    infrastructures
  • Difficult to infuse into existing
    mission-critical applications

10
Module outline
  • Data analysis and KD Support Environments
  • Data mining technology trends
  • from tools
  • to suites
  • to solutions
  • Towards data mining query languages
  • DATASIFT a logic-based KDSE
  • Future research challenges

11
Generation 1 data mining tools
  • 1980 first generation of DM systems
  • research-driven tools for single tasks, e.g.
  • build a decision tree - say C4.5
  • find clusters - say Autoclass (Cheeseman 88)
  • Difficult to use more than one tool on the same
    data lots of data/metadata transformation
  • Intended user a specialist, technically
    sophisticated.

12
Generation 2 data mining suites
  • 1995 second generation of DM systems
  • toolkits for multiple tasks with support for data
    preparation and interoperability with DBMS, e.g.
  • SPSS Clementine
  • IBM Intelligent Miner
  • SAS Enterprise Miner
  • SFU DBMiner
  • Intended user data analyst suites require
    significant knowledge of statistics and databases

13
Growth of DM tools (source kdnuggets.com)
  • From G. Piatetsky-Shapiro. The data-mining
    industry coming of age. IEEE Intelligent Systems,
    Dec. 1999.

14
Generation 3 data mining solutions
  • Beginning end of 1990s
  • vertical data mining-based applications and
    solutions oriented to solving one specific
    business problem, e.g.
  • detecting credit card fraud
  • customer retention
  • Address entire KDD process, and push result into
    a front-end application
  • Intended user business user the interfaces hid
    the data mining complexity

15
Emerging short-term technology trends
  • Tighter interoperability by means of standards
    which facilitate the integration of data mining
    with other applications
  • KDD process, e.g. the Cross-Industry Standard
    Process for Data Mining model (www.crisp-dm.org)
  • representation of mining models e.g., the PMML -
    predictive modeling markup language (www.dmg.org)
  • DB interoperability the Microsoft OLE DB for
    data mining interface

16
Approaches in data mining suites
  • Database-oriented approach
  • IBM Intelligent Miner
  • OLAP-based mining
  • DBMiner - Jiawei Hans group _at_ SFU
  • Machine learning
  • CART, ID3/C4.5/C5.0, Angoss Knowledge Studio
  • Statistical approaches
  • The SAS Institute Enterprise Miner.
  • Visualization approach
  • SGI MineSet, VisDB (Keim et al. 94).

17
Other approaches in data mining suites
  • Neural network approach
  • Cognos 4thoughts, NeuroRule (Lu et al.95).
  • Deductive DB integration
  • KnowlegeMiner (Shen et al.96)
  • Datasift (Pisa KDD Lab - see refs).
  • Rough sets, fuzzy sets
  • Datalogic/R, 49er
  • Multi-strategy mining
  • INLEN, KDW, Explora

18
SFU DBMiner OLAP-centric mining
Active Object Elements
Warehouse Workplace
Active Object
19
IBM Intelligent Miner DB-centric mining
Contents Container
Mining Base Container
Work Area
20
IBM IM architecture
21
Angoss Knowledge Studio ML-centric mining
Work Area
Project Outline
Additional Visualizations
22
KS project outline tool
  • (Limited) support to the KDD process

23
Support for data consolidation step
  • DBMiner
  • ODBC databases SQL SmartDrives
  • Single database multiple tables
  • Consolidation of heterogeneous sources
    unsupported
  • Intelligent Miner
  • DB2 and text SQL without SmartDrives
  • Multiple databases
  • Consolidation of heterogeneous sources supported
  • Knowledge Studio
  • ODBC databases and text
  • Single table
  • Consolidation of heterogeneous sources
    unsupported

24
Support for selection and preprocessing
  • DBMiner
  • SQL only
  • Intelligent Miner
  • SQL standard and advanced statistical
    functionalities
  • Knowledge Studio
  • descriptive statistics

25
Support for data mining step
  • DBMiner
  • Association rules
  • Decision trees
  • Prediction
  • Intelligent Miner
  • Associations rules
  • Sequential patterns
  • Clustering
  • Classification
  • Prediction
  • Similar time series
  • Knowledge Studio
  • Decision trees
  • Clustering
  • Prediction

26
Support for interpretation and evaluation
  • Predefined interestingness measures
  • Emphasis on visualization
  • Limited export capability of analysis results
  • Gain charts for comparison of predictive models
    (KS and IM)
  • Limited model combination capabilities (KS)

27
Module outline
  • Data analysis and KD Support Environments
  • Data mining technology trends
  • from tools
  • to suites
  • to solutions
  • Towards data mining query languages
  • DATASIFT a logic-based KDSE
  • Future research challenges

28
Data Mining Query Languages
  • A DMQL can provide the ability to support ad-hoc
    and interactive data mining
  • Hope achieve the same effect that SQL had on
    relational databases.
  • Various proposals
  • DMQL (Han et al 96)
  • mine operator (Meo et el 96)
  • M-SQL (Imielinski et al 99)
  • query flocks (Tsur et al 98)

29
MINE operator of (Meo et al 96)
30
References - DMQL
  • J. Han, Y. Fu, W. Wang, K. Koperski, and O. R.
    Zaiane. DMQL A Data Mining Query Language for
    Relational Databases. In Proc. 1996 SIGMOD'96
    Workshop on Research Issues on Data Mining and
    Knowledge Discovery (DMKD'96), pp. 27-33,
    Montreal, Canada, June 1996.
  • R. Meo, G. Psaila, S. Ceri. A New SQL-like
    Operator for Mining Association Rules. In Proc.
    VLDB96, 1996 Int. Conf. Very Large Data Bases,
    Bombay, India, pp. 122-133, Sept. 1996.
  • T. Imielinski and A. Virmani. MSQL a query
    language for database mining. Data Mining and
    Knowledge Discovery, 3373-408, 1999.
  • S. Tsur, J. Ulman, S. Abiteboul, C. Clifton, R.
    Motwani, S. Nestorov. Query flocks a
    generalization of association rule mining. In
    Proc. 1998 ACM-SIGMOD, p. 1-12, 1998.

31
Module outline
  • Data analysis and KD Support Environments
  • Data mining technology trends
  • from tools
  • to suites
  • to solutions
  • Towards data mining query languages
  • DATASIFT a logic-based KDSE
  • Future research challenges

32
DATASIFT - towards a logic-based KDSE
  • DATASIFT is LDL (Logic Data Language, MCC
    UCLA) extended with mining primitives (decision
    trees association rules)
  • LDL syntax Prolog-like deductive rules
  • LDL semantics SQL extended with recursion (and
    more)
  • Integration of deduction and induction
  • Employed to systematically develop the
    methodology for MBA and audit planning
  • See Pisa KDD Lab references

33
Our position
  • A suitable integration of
  • deductive reasoning (logic database languages)
  • inductive reasoning (association rules decision
    trees)
  • provides a viable solution to high-level problems
    in knowledge-intensive data analysis applications

34
Our goal
  • Demonstrate how we support design and control of
    the overall KDD process and the incorporation of
    background knowledge
  • data preparation
  • knowledge extraction
  • post-processing and knowledge evaluation
  • business rules
  • autofocus datamining

35
With respect to other DMQLs
  • extending logic query languages yields extra
    expressiveness, needed to bridge the gap between
  • data mining (e.g., association rule mining)
  • vertical applications (e.g., market basket
    analysis)

36
Architecture - client agent
  • User interface
  • Access to business rules and visualization of
    results through
  • web browser to control interaction
  • MS Excel objects (sheets and charts) to represent
    output of analysis (association rules)

37
Architecture - server agent
  • A query engine (mediator)
  • record previous analyses
  • Metadata/meta knowledge
  • interaction with other components
  • LDL server
  • extended with external calls to DBMSs and to
  • Inductive modules
  • Apriori
  • classifiers (decision trees)
  • Coupling with DBMS using the Cache-mine approach
  • Performance comparable with SQL-based approaches
    on same mining queries (Giannotti at el 2000)

38
Deductive rules in LDL
A small database of cash register
transactions basket(1,fish). basket(2,bread).
basket(3,bread). basket(1,bread). basket(2,milk).
basket(3,orange). basket(2,onions).
basket(3,milk). basket(2,fish).
  • E.g. select transactions involving milk
  • milk_basket(T,I) ? basket(T,I),basket(T,milk).
  • Querying ?- milk_basket(T,I)
  • milk_basket(2,bread). milk_basket(3,bread).
  • milk_ basket(2,milk). milk_basket(3,orange).
  • milk_ basket(2,onions). milk_basket(3,milk).
  • milk_ basket(2,fish).

39
Aggregates in LDL
A small database of cash register
transactions basket(1,fish). basket(2,bread).
basket(3,bread). basket(1,bread). basket(2,milk).
basket(3,orange). basket(2,onions).
basket(3,milk). basket(2,fish).
  • E.g. count occurrences of pairs of distinct
    items in all transactions
  • pair(I1,I2,countltTgt)? basket(T,I1),basket(T,I2),I
    1? I2.

aggregate
  • Querying ?- pair(fish,bread,N)
  • pair(fish,bread,2) (i.e., N2)
  • Aggregates are the logical interface between
    deductive and inductive environment.

40
Association rules in LDL
basket(1,fish). basket(2,bread).
basket(3,bread). basket(1,bread). basket(2,milk).
basket(3,orange). basket(2,onions).
basket(3,milk). basket(2,fish).
  • E.g., compute one-to-one association rules with
    at least 40 support
  • rules(patternslt0.4,0,I1,I2gt)?basket(T,I1),baske
    t(T,I2).

patterns
  • is the aggregate interfacing the
    computation of association rules
  • patternsltmin_supp, min_conf, trans_setgt

41
Association rules in LDL
basket(1,fish). basket(2,bread).
basket(3,bread). basket(1,bread). basket(2,milk).
basket(3,orange). basket(2,onions).
basket(3,milk). basket(2,fish).
  • Result of the query ?- rules(X,Y,S,C)
  • rules(milk,bread,0.66,1)
  • i.e. milk ? bread 0.66,1
  • rules(bread,milk,0.66,0.66)
  • rules(fish,bread,0.66,1)
  • rules(bread,fish,0.66,0.66)
  • Same status for data and induced rules

42
Reasoning on item hierarchies
  • Which rules survive/decay up/down the item
    hierarchy?
  • rules_at_level(I,patternltS,C,Itemsetgt) ?
  • itemset_abstraction(I,Tid,Itemset).
  • preserved_rules(Left,Right)
  • ?
  • rules_at_level(I,Left,Right,_,_),
  • rules_at_level(I1,Left,Right,_,_).

43
Business rules reasoning on promotions
  • Which rules are established by a promotion?
  • interval(before, -?, 3/7/1998).
  • interval(promotion, 3/8/1998, 3/30/1998).
  • interval(after, 3/31/1998, ?).
  • established_rules(Left, Right) ?
  • not rules_partition(before, Left, Right, _, _),
  • rules_partition(promotion, Left, Right, _, _),
  • rules_partition(after, Left, Right, _, _).

44
Business rules temporal reasoning
  • How does rule support change along time?

45
Decision tree construction in DATASIFT
  • construct training and test set using rules
  • training_set(P,Case_list) ? ...
  • test_tuple(ID,F1,...,F20,Rec,Act_rec,CAR)
  • ? ...
  • construct classifier using external call to C5.0
  • tree_rules(Tree_name,P,PF,MC,BO,Rule_list) ?
    training_set(P,Case_list),tree_induction(Case_li
    st,PF,MC,BO,Rule_list).
  • parameters
  • pruning factor PF
  • misclassification costs MC
  • boosting BO

external call
induced classifier
46
Putting decision trees at work
  • prediction of target variable
  • prediction(Tree_name,ID,CAR,Predicted_CAR) ?
    tree_rules(Tree_name, _ ,_ , _ ,
    Rule_list),test_subject(ID, F1, , F20, _, _,
    CAR),classify(Rule_list ,F1, , F20,
    Predicted_CAR).
  • Model evaluation actual recovery of a classifier
    (sum recovery of tuples classified as positive)
  • actual_recovery(Tree_name,sumltActual_Recoverygt)
    ?prediction(Tree_name, ID, _ ,
    pos),test_subject(ID, F1, , F20,
    _,Actual_Recovery, _).

aggregate
47
Combining decision trees
  • Model conjunction
  • tree_conjunction(T1,T2,ID,CAR,pos)
    ? prediction(T1, ID, CAR, pos), prediction(T2,
    ID, CAR, pos).
  • tree_conjunction (T1, T2, ID, CAR, neg)
    ? test_subject(ID, F1, , F20, _, _, CAR),
    tree_conjunction(T1, T2, ID, CAR, pos).
  • More interesting combinations readily
    expressible
  • e.g. meta learning (Chan and Stolfo 93)

48
We proposed ...
  • a KDD methodology for audit planning
  • define an audit cost model
  • monitor training- and test-set construction
  • assess the quality of a classifier
  • tune classifier construction to specific policies
  • and its formalization in a prototype logic-based
    KDSE, supporting
  • integration of deduction and induction
  • integration of domain and induced knowledge
  • separation of conceptual and implementation level

49
Module outline
  • Data analysis and KD Support Environments
  • Data mining technology trends
  • from tools
  • to suites
  • to solutions
  • Towards data mining query languages
  • DATASIFT a logic-based KDSE
  • Future research challenges

50
A data mining research agenda
  • Integration with data warehouse and relational DB
  • Scalable, parallel/distributed and incremental
    mining
  • Data mining query language optimization
  • Multiple, integrated data mining methods
  • KDSE and methodological support for vertical
    appl.
  • Interactive, exploratory data mining environments
  • Mining on other forms of data
  • spatio-temporal databases
  • text
  • multimedia
  • web

51
Scale up!
  • Scaling up existing algorithms (AI, ML, IR)
  • Association rules
  • Correlation rules
  • Causal relationship
  • Classification
  • Clustering
  • Bayesian networks

52
Background knowledge constraints
  • Incorporating background knowledge and
    constraints into existing data mining techniques
  • Double benefit for DMQL semantics and
    optimization!
  • traditional algorithms
  • Disproportionate computational cost for selective
    users
  • Overwhelming volume of potentially useless
    results
  • need user-controlled focus in mining process
  • Association rules containing certain items
  • Sequential patterns containing certain patterns
  • Classification?

53
Vertical applications of data mining
  • More success stories needed!
  • Current data mining systems lack a thick semantic
    layer (similarly to the early relational database
    systems)
  • Verticalized data mining systems, e.g.
  • Market analysis systems
  • Fraud detection systems
  • Automated mining and interactive mining how far
    are they?

54
Autofocus data mining
  • policy options, business rules

selection of data mining function fine parameter
tuning of mining function
55
DBMS coupling
  • Tight-coupling with DBMS
  • Most data mining algorithms are based on flat
    file data (i.e. loose-coupling with DBMS)
  • A set of standard data mining operators
  • (e.g. sampling operator)

56
Web mining why?
  • No standards on the web, enormous blob of
    unstructured and heterogeneous info
  • Very dynamic
  • One new WWW server every 2 hours
  • 5 million documents in 1995
  • 320 million documents in 1998
  • Indices get obsolete very quickly
  • Better means needed for discovering resources and
    extracting knowledge

57
Web mining challenges
  • Todays search engines are plagued by problems
  • the abundance problem 99 of info of no
    interest to 99 of people!
  • limited coverage of the Web
  • limited query interface based on keyword-oriented
    search
  • limited customization to individual users

58
Web mining
  • Web content mining
  • mining what Web search engines find
  • Web document classification (Chakrabarti et al
    99)
  • warehousing a Meta-Web (Zaïane and Han 98)
  • intelligent query answering in Web search
  • Web usage mining
  • Web log mining find access patterns and trends
    (Zaiane et al 98)
  • customized user tracking and adaptive sites
    (Perkowitz et al 97)
  • Web structure mining
  • discover authoritative pages a page is important
    if important pages point to it (Chakrabarti et
    al 99, Kleinberg 98)

59
Warehousing a Meta-Web (Zaïane Han 98)
  • Meta-Web summarizes the contents and structure
    of the Web, which evolves with the Web
  • Layer0 the Web itself
  • Layer1 the lowest layer of the Meta-Web
  • an entry a Web page summary, including class,
    time, URL, contents, keywords, popularity,
    weight, links, etc.
  • Layer2 and up summary/classification/clustering
  • Meta-Web is warehoused and incrementally updated
  • Querying and mining is performed on or assisted
    by meta-Web
  • Is it feasible/sustainable? Is XML of any help?

60
Meta-Web from Jiawei Hans panel talk _at_ SIGMOD99
More Generalized Descriptions
Layern
...
Generalized Descriptions
Layer1
Layer0
61
Weblog mining
  • Web servers register a log entry for every single
    access.
  • A huge number of accesses (hits) are registered
    and collected in an ever-growing web log.
  • Why warehousing/mining web logs?
  • Enhance server performance by learning access
    patterns of general or particular users (guess
    what user will ask next and pre-cache!)
  • Improve system design of web applications
  • Identify potential prime advertisement locations
  • Greatest peril the privacy pitfall
  • See e.g. (Markoff 99) the rise of the Little
    Brother.

62
Some web mining references
  • M. Perkowitz and O. Etzioni. Adaptive sites
    Automatically learning from user access patterns.
    In Proc. 6th Int. World Wide Web Conf., Santa
    Clara, California, April 1997.
  • J. Pitkow. In search of reliable usage data on
    the www. In Proc. 6th Int. World Wide Web Conf.,
    Santa Clara, California, April 1997.
  • T. Sullivan. Reading reader reaction A proposal
    for inferential analysis of web server log files.
    In Proc. 3rd Conf. Human Factors the Web,
    Denver, Colorado, June 1997.
  • O. R. Zaiane, M. Xin, and J. Han. Discovering Web
    access patterns and trends by applying OLAP and
    data mining technology on Web logs. In Proc.
    Advances in Digital Libraries Conf. (ADL'98),
    pages 19-29, Santa Barbara, CA, April 1998.
  • O. R. Zaiane, and J. Han. Resource and knowledge
    discovery in global information systems a
    preliminary design and experiment. In Proc.
    KDD95, p.331-336, 1995.
  • O. R. Zaiane, and J. Han. WebML querying the
    world-wide web for resources and knowledge. In
    Proc. Int. Workshop on Web informtion and Data
    management (WIDM98), p. 9-12, 1998.
  • S. Chakrabarti, B. E. Dom, S. R. Kumar, P.
    Raghavan, et al. Mining the webs link structure.
    COMPUTER, 3260-67, 1999.
  • S. Chakrabarti, B. E. Dom, P. Indik. Enhanced
    hypertext classification using hyperlinks. In
    Proc. 1998 ACM-SIGMOD, p. 307-318, 1999.
  • J. Kleinberg. Autohoritative sources in a
    hyperlinked environment. In Proc. ACM-SIAM Symp.
    on Discrete Algorithms, 1998.
  • J. Markoff. The Rise of Little Brother. Upside,
    Apr. 1999 http//www.upside.com/texis/mvm/story?i
    d36d4613c0

63
Pisa KDD Lab references
  • F. Giannotti and G. Manco. Making Knowledge
    Extraction and Reasoning Closer. In Proc.
    PAKDD'99, The Fourth Pacific-Asia Conference on
    Knowledge Discovery and Data Mining, Kyoto, 2000.
  • F. Giannotti and G. Manco. Querying Inductive
    Databases via Logic-Based User Defined
    Aggregates. In Proc. PKDD'99, The Third Europ.
    Conf. on Principles and Practice of Knowledge
    Discovery in Databases. Prague, Sept. 1999.
  • F. Bonchi, F. Giannotti, G. Mainetto, D.
    Pedreschi. Using Data Mining Techniques in Fiscal
    Fraud Detection. In Proc. DaWak'99, First Int.
    Conf. on Data Warehousing and Knowledge
    Discovery. Florence, Italy, Sept. 1999.
  • F. Bonchi , F. Giannotti, G. Mainetto, D.
    Pedreschi. A Classification-based Methodology for
    Planning Audit Strategies in Fraud Detection. In
    Proc. KDD-99, ACM-SIGKDD Int. Conf. on Knowledge
    Discovery Data Mining, San Diego (CA), August
    1999.
  • F. Giannotti, G. Manco, D. Pedreschi and F.
    Turini. Experiences with a logic-based knowledge
    discovery support environment. In Proc. 1999 ACM
    SIGMOD Workshop on Research Issues in Data Mining
    and Knowledge Discovery (SIGMOD'99 DMKD).
    Philadelphia, May 1999.
  • F. Giannotti, M. Nanni, G. Manco, D. Pedreschi
    and F. Turini. Integration of Deduction and
    Induction for Mining Supermarket Sales Data. In
    Proc. PADD'99, Practical Application of Data
    Discovery, Int. Conference, London, April 1999.
  • F. Giannotti, G. Manco, M. Nanni, D. Pedreschi.
    Nondeterministic, Nonmonotonic Logic Databases.
    IEEE Trans. on Knowledge and Data Engineering.
    2000.
  • F. Giannotti, M. Nanni, G. Manco, D. Pedreschi
    and F. Turini. Using deduction for intelligent
    data analysis. Submitted, 2000.
    http//www-kdd.di.unipi.it/
  • P. Becuzzi, M. Coppola, S. Ruggieri and M.
    Vanneschi. Parallelisation of C4.5 as a
    particular divide and conquer computation.
    Proc.3rd Workshop on High Performance Data
    Mining, Springer-Verlag LNCS, 2000.
Write a Comment
User Comments (0)
About PowerShow.com