Contributions to MiningMart - PowerPoint PPT Presentation

About This Presentation
Title:

Contributions to MiningMart

Description:

KEX (decision rules) experimental software for building graphical models ... related to KEX. based on information theoretic approach ... – PowerPoint PPT presentation

Number of Views:39
Avg rating:3.0/5.0
Slides: 25
Provided by: ber59
Category:

less

Transcript and Presenter's Notes

Title: Contributions to MiningMart


1
Contributions to MiningMart
  • Petr Berka
  • Laboratory for Intelligent Systems
  • University of Economics, Prague
  • berka_at_vse.cz

2
University of Economics, Prague
  • LISp - Laboratory for Intelligent Systems
  • SALOME - Laboratory for Multidisciplinary
    Approaches to Decision-making Support in
    Economics and Management

3
LISp research
  • probabilistic methods - decomposable probability
    models and bayesian networks
  • symbolic ML methods - 4FT association rules and
    decision rules
  • logical calculi for knowledge discovery in
    databases

4
LISp activities
  • Organized conferences
  • ECML97, PKDD99
  • Organized workshops
  • Discovery Challenge (PKDD99, PKDD2000,
    PKDD20001), WUPES97, WUPES2000
  • International Projects
  • MLNet, Sol-Eu-Net, EUNITE, MUM, MGT
  • KDNet

5
SALOME research
  • Quantitative and AI (pattern recognition, fuzzy,
    neural nets) approaches to support of decision
    making in econmics and management

6
SALOME activities
  • Organized workshops
  • STIPR97, MME99
  • International Projects
  • Univ. Salzburg, Univ. Hokkaido, Univ. Cambridge

7
LISp software
  • LISp-Miner (data mining system)
  • DataSource (for data manipulation)
  • 4FT Miner (4FT association rules) and
  • KEX (decision rules)
  • experimental software for building graphical
    models
  • preprocessing procedures
  • related to KEX
  • based on information theoretic approach

8
LISP-Miner procedures
  • DataSource
  • creating new (virtual) attributes using SQL
  • ekvidistant and equifrequent discretization
  • grouping attribute values
  • computing attribute-value frequencies

9
LISP-Miner procedures
  • 4FT-Miner (GUHA procedure)
  • 4FT association rules in the form
  • Ant Suc / Cond
  • KEX
  • weighted decision rules in the form
  • Ant ? C (weight)

10
4FT-Miner basic idea
  • Generate a (potential) rule, e.g.
  • COLOUR(red) ? SIZE(small) ?0.9, 20 TEMP(high)
  • AGE(21-30) ? SALARY(low) ?0.85,15 PAYMENTS
    (High) ? LOAN(bad)
  • Verify a rule using four-fold table

11
4FT-Miner
Data Matrix CLIENTS
LOANS Id Age Sex Salary District
Amount Payment Months Quality 1 45
F 28 000 Prague 48 000 1 000 48
good ... ... ... ... ...
... ... ... ... ... ...
... ... ... ... ... ...
... 70 000 18 M 12 000 Brno
36 000 2 000 18 bad Problem Are
there segments of clients SC and segments of
loans SL such that To be in SC is
at 90 equivalent to have a loan from SL and

there is at least 100 such
clients Ant is at 90 equivalent to Suc
Ant ?0.90, 100 Suc is true iff
a/(abc) ? 0.9 ? a ? 100
Suc
?Suc a - number of objects
satisfying Ant and Suc Ant a
b b- number of objects
satisfying Ant and not satisfying Suc ?Ant
c d c-
number of objects not satisfying Ant and
satisfying Suc
d- number of objects
satisfying neither Ant nor Suc
12
4FT-Miner
  • Input
  • Data matrix,
  • quantifier ?0.90, 100
  • Derived attributes for SC (possible Ant) Age (7
    values), Sex (2 values),
    Salary (3 values), District (77 values)
  • Derived attributes for SL (possible Suc)
    Amount (6 values), Duration (5 values),
    Quality (2 values)
  • Output
  • All Ant ?0.90, 100 Suc true in data matrix
  • (5 equivalences from about 5 milions possible
    relations)
  • an example
  • Age(20 - 30) ? Sex(F) ? Salary(low) ? District
    (Prague) ?0.90, 100 Amountlt20,50) ?
    Quality(Bad)
  • Suc
    ?Suc
  • a/(abc) 0.95 ? 0.9
    Ant 950 30
  • ? 950 ? 100 ?Ant
    20 69000

13
KEX basic idea
  • Generate a (potential) rule, e.g.
  • YEARS-IN-COMPANY(0-3) ? AGE(0-25) ? LOAN(GOOD)
  • If rule refines current set of rules
  • (validity a/(ab) differs from weight
    inferred during consultation)
  • add into rule base with proper weight

14
KEX - classification
15
KEX - learning
16
LISp-Miner architecture
MetaData (ODBC ACCESS)
Results
Data (ODBC ACCESS)
LM Windows
17
Preprocessing (LISp)
  • KEX-oriented
  • (fuzzy) discretization grouping of values
  • computing the amount of noise in data
  • random sampling balancing of data
  • handling missing values
  • Information theory
  • attribute selection
  • attribute grouping

18
fuzzy discretization
19
amount of noise
  • Amount of noise 20
  • max. possible accuracy 80

20
data sampling
  • random split into training and testing set
  • select random stratified sample
  • balance unbalanced classes

21
handling missing values
  • remove example
  • substitute missing with new value
  • substitute missing with majority value
  • proportional substitution

22
information theory
  • Attribute selection - based on mutual
    information
  • Attribute grouping - based on information content

23
Preprocessing architecture
Input data (ASCII)
Output data (ASCII)
procedure
Results
Data (ASCII)
procedure
24
SALOME software
  • Feature Selection Toolbox (Multi-Purpose Tool for
    Pattern Recognition)
  • feature selection
  • approximation-based modeling
  • classification
  • a consulting system helping to choose the
    most suitable method is being developed

25
Search strategies for FS
  • Search for a subset maximizing a criterion
    function (distance, divergence)
  • with apriori information
  • exhaustive search
  • branch and bound based algorithms
  • floating search algorithms
  • without apriori information
  • approximation method
  • divergence method

26
FST architecture
Data (ASCII)
Results
FST Windows
27
References
  • LISp-Miner
  • Berka,P. - Ivanek,J. Automated Knowledge
    Acquisition for PROSPECTOR-like Expert Systems.
    In (Bergadano, deRaedt eds.) Proc. ECML'94,
    Springer 1994, 339-342.
  • Berka,P. - Rauch,J. Data Mining using GUHA and
    KEX. In (Callaos, Yang, Aguilar eds.) 4th. Int.
    Conf. on Information Systems, Analysis and
    Synthesis ISAS'98, 1998, Vol 2, 238- 244.
  • Rauch,J. Classes of Four Fold Table Quantifiers.
    In (Zytkow, Quafafou eds.) Principles of Data
    Mining and Knowledge Discovery. Springer 1998,
    203 - 211.

28
References
  • Preprocessing
  • Bruha,I. - Berka,P. Discretization and
    Fuzzification of Numerical Attributes in
    Attribute-Based Learning. In Szepaniak, Lisboa,
    Kacprzyk (eds.) Fuzzy Systems in Medicine,
    Physica Verlag, 2000, 112-138.
  • Pudil, P., Novovicová J. Novel Methods for
    Subset Selection with Respect to Problem
    Knowledge, IEEE Transactions on Intelligent
    Systems - Special Issue on Feature
    Transformation and Subset Selection 1998, 66-74
  • J. Zvarova and M. Studeny Information
    theoretical approach to constitution and
    reduction of medical data. International Journal
    of Medical Informatics 45 (1997), n. 1-2, pp.
    65-74.
Write a Comment
User Comments (0)
About PowerShow.com