Data Mining in the Multidimensional Parameter Space - PowerPoint PPT Presentation

1 / 42
About This Presentation
Title:

Data Mining in the Multidimensional Parameter Space

Description:

Decision tree-based classification: Training set vs test set or cross-validation ... Jack reads NY Times at every 9:00am. Given (natural) periods vs. arbitray periods. ... – PowerPoint PPT presentation

Number of Views:110
Avg rating:3.0/5.0
Slides: 43
Provided by: jiaw211
Category:

less

Transcript and Presenter's Notes

Title: Data Mining in the Multidimensional Parameter Space


1
Data Mining in the Multidimensional Parameter
Space
  • Yanxia Zhang
  • National Astronomical Observatories,CAS
  • Nov.27 2003

2
Outline
  • Why and What
  • DM Technology
  • Future Directions

3
Why
Necessity Is the Mother of Invention
Data avalanche
VO
DMKDD
4
What
  • DM (KDD)
  • Extraction of interesting ( non-trivial,
    implicit, previously unknown and potentially
    useful) information from data in large databases
  • Alternative names
  • Data mining a misnomer?
  • Knowledge discovery in databases (KDD SIGKDD),
    knowledge extraction, data archeology, data
    dredging, information harvesting, business
    intelligence, etc.

5
Taxonomy of DM
In large scientific databases, DM in two flavors
  • Event-based mining
  • Relationship-based mining

6
Event-Based Mining for Science
  • Event-based mining is based upon events or trends
    in data.
  • Known events / known algorithms - use existing
    physical models (descriptive models) to locate
    known phenomena of interest either spatially or
    temporally within a large database.
  • Known events / unknown algorithms - use pattern
    recognition and clustering properties of data to
    discover new observational (in our case,
    astrophysical) relationships among known
    phenomena.
  • Unknown events / known algorithms - use expected
    physical relationships (predictive models) among
    observational parameters of astrophysical
    phenomena to predict the presence of previously
    unseen events within a large complex database.
  • Unknown events / unknown algorithms - use
    thresholds or trends to identify transient or
    otherwise unique ("one-of-a-kind") events and
    therefore to discover new phenomena.

7
Relationship-Based Mining for Science
  • Relationship-based mining is based on
    associations.
  • Spatial associations -- identify events
    (astronomical objects) at the same location in
    the sky.
  • Temporal associations -- identify events
    occurring during the same or related periods of
    time.
  • Coincidence associations -- use clustering
    techniques to identify events that are co-located
    within a multi-dimensional parameter space.

8
Science Requirements for DM
  • Cross-Identification - refers to the classical
    problem of associating the source list in one
    database to the source list in another.
  • Cross-Correlation - refers to the search for
    correlations, tendencies, and trends between
    physical parameters in multi-dimensional data,
    usually across databases.
  • Nearest-Neighbor Identification - refers to the
    general application of clustering algorithms in
    multi-dimensional parameter space, usually within
    a database.
  • Systematic Data Exploration - refers to the
    application of the broad range of event-based and
    relationship-based queries to a database in the
    hope of making a serendipitous discovery of new
    objects or a new class of objects.

9
Outline
  • Why and What
  • DM Technology
  • Future Directions

10
DM Confluence of Multiple Disciplines
Database system, Data warehouse, OLAP
statistics
DM
MLAI
Visualization
Information science
Other disciplines
11
DM A KDD Process
Knowledge
  • Data mining the core of knowledge discovery
    process.

Pattern Evaluation
Data Mining
Task-relevant Data
Selection
Data Warehouse
Data Cleaning
Data Integration
Databases
12
DM Functionality
  • Concept description Characterization and
    Comparison
  • Generalize, summarize, and possibly contrast data
    characteristics, e.g., stars vs. galaxies.
  • Association
  • From association, correlation, to causality.
  • finding rules like stars à point sources.
  • Classification and Prediction
  • Classify data based on the values in a
    classifying attribute, e.g., classify objects
    based on spectra, or classify galaxies and stars
    based on images.
  • Predict some unknown or missing attribute values
    based on other information.

13
DM Functionality (Cont.)
  • Clustering
  • Group data to form new classes, e.g., cluster
    spectra data to find distribution patterns.
  • Time-series analysis
  • Trend and deviation analysis Find and
    characterize evolution trend, sequential
    patterns, similar sequences, and deviation data,
    e.g., variable stars.
  • Similarity-based pattern-directed analysis Find
    and characterize user-specified patterns in
    large databases.
  • Cyclicity/periodicity analysis Find
    segment-wise or total cycles or periodic
    behaviours in time-related data.
  • Other pattern-directed or statistical analysis

14
DM On What Kind of Data?
  • Relational databases
  • Data warehouses
  • Transactional databases
  • Advanced DB systems and information repositories
  • Object-oriented and object-relational databases
  • Spatial databases
  • Time-series data and temporal data
  • Text databases and multimedia databases
  • Heterogeneous and legacy databases
  • WWW

15
Challenges in DM
  • Mining methodology issues
  • Mining different kinds of knowledge in databases.
  • Interactive mining of knowledge at multiple
    levels of abstraction.
  • Incorporation of background knowledge
  • DM query languages and ad-hoc DM.
  • Expression and visualization of DM results.
  • Handling noise and incomplete data
  • Pattern evaluation the interestingness problem.
  • Performance issues
  • Efficiency and scalability of DM algorithms.
  • Parallel, distributed and incremental mining
    methods.

16
Challenges in DM (Cont.)
  • Issues related to the variety of data types
  • Handling relational and complex types of data
  • Mining information from heterogeneous databases
    and global information systems.
  • Issues related to applications and social
    impacts
  • Application of discovered knowledge.
  • Domain-specific data mining tools
  • Intelligent query answering
  • Process control and decision making.
  • Integration of the discovered knowledge with
    existing knowledge A knowledge fusion problem.
  • Protection of data security and integrity.

17
Mining Association Rules
  • Assocation rule mining
  • Finding associations or correlations among a set
    of items or objects in transaction databases,
    relational databases, and data warehouses.
  • Applications
  • Basket data analysis, cross-marketing, catalog
    design, loss-leader analysis, clustering, etc.
  • Examples.
  • Rule form LHS RHS support, confidence.
  • buys(x, diapers) buys(x, beers) 0.5, 60
  • major(x, CS) takes(x, DB) grade(x, A) 1,
    75

18
Methods for Mining Associations
  • The Apriori principle Any subset of a frequent
    itemset must be frequent.
  • ( Agrawal Srikant94, Mannila, Klementen, et
    al94)
  • Partition Technique(Savasere, Omiecinski,
    Navathe95)
  • Sampling techique (Toivonen96)
  • Multi-level or generalized association (Agrawal
    Srikant95, Han Fu95)
  • Quantitative association rule mining (Srikant
    Agrawal96, Lent et al.97, Miller97).
  • Constraint-based or query-based association (Ng,
    et al98, Tsur et al98)
  • From association to correlation (Brin et al97)

19
Classification
  • Data categorization based on a set of training
    objects.
  • Applications stars,galaxies,AGN classification
    etc.
  • Example classify AGN and provide the symptoms
    which describe each class or subclass.
  • The classification task Based on the features
    present in the class_labeled training data,
    develop a description or model for each class.
    It is used for
  • classification of future test data,
  • better understanding of each class, and
  • prediction of certain properties and behaviors.

20
Three Schemes in Classification
  • Knowledge to be mined
  • Summarization (characterization), comparison,
    association, classification, clustering, trend,
    deviation and pattern analysis, etc.
  • Mining knowledge at different abstraction levels
  • primitive level, high level, multiple-level,
    etc.
  • Databases to be mined
  • Relational, transactional, object-oriented,
    object-relational, active, spatial, time-series,
    text, multi-media, heterogeneous, legacy, etc.
  • Techniques adopted
  • Database-oriented, data warehouse (OLAP), machine
    learning, statistics, visualization, neural
    network, etc.

21
Major Classification Methods
  • Decision tree-based classification
  • Training set vs test set or cross-validation
  • Overfitting problem and tree pruning
  • Boosting techniques.
  • Bayesian classification
  • Naïve Bayesian classification
  • Bayesian belief networks
  • Boosting techniques (e.g., AdaBoosting).
  • Neural network approach
  • Multi-layer networks and back-propagation.
  • Genetic algorithms
  • Genetic operators and fitness function selection.

22
Predictive Modeling in Databases
  • Predictive modeling Predict data values or
    construct generalized linear models based on
    the database data.
  • One can only predict value ranges or category
    distributions.
  • Method outline
  • Minimal generalization
  • Attribute relevance analysis
  • Generalized linear model construction
  • Prediction.
  • Determine the major factors which influence the
    prediction.
  • Data relevance analysis uncertainty measurement,
    entropy analysis, expert judgement, etc.
  • Multi-level prediction drill-down and roll-up
    analysis.

23
Data Clustering Analysis
  • Clustering
  • Partitioning a set of data (or objects) into a
    set of classes, called clusters, such that
    members of each class sharing some interesting
    common properties.
  • High quality clusters
  • the intra-class similarity is high.
  • the inter-class similarity is low.
  • Measuring data clustering quality
  • Distance functions

24
Three Categories of Clustering Techniques
  • Partitioning-based
  • Basically enumerate various partitions and then
    score them by some criterion.
  • K-means, K-medoids, etc.
  • Hierarchy-based
  • Create a hierarchical decomposition of the set of
    data (or objects) using some criterion.
  • Model-based
  • A model is hypothesized for each of the clusters
  • Find the best fit of that model to each other.
  • E.g., Bayesian classification (AutoClass), Cobweb.

25
Database Clustering Methods
  • CLARANS (Ng Han94)
  • An extension to k-medoid algorithm based on
    randomized search.
  • BIRCH (Zhang et al96)
  • CF tree (a balanced tree structure).
  • DBSCAN (EKXS96)
  • connects regions of sufficiently high desity into
    clusters.
  • STING (WYM97)
  • A hierarchical cell structure that store
    statistical information.
  • CLIQUE (Agrawal et al98)
  • Cluster high dimensional data.

26
Time-Series DM
  • Trend and deviation analysis
  • Find trend (data evolution regularity) and
    deviations.
  • Regression analysis, visualization techniques.
  • Subsequence analysis similarity search
  • Subsequence matching normalization matching
  • Template specification shape and macro
    specification.
  • Sequential pattern analysis
  • Sequential association rules
  • Periodicity analysis
  • full periods vs. partial periods, cyclic
    association rules.

27
Similarity Search in DM
  • Faloutsos et al. (1994)
  • Extract features from each window
  • Fourier Transform R-tree structure.
  • Agrawal et al. (1995)
  • Amplitude scaling, offset translation
  • Distance is determined from the sequence
    envelopes
  • Agrawal et al. (1995)
  • SDL pattern language to encode queries about
    shapes
  • Jagadish et al. (1997)
  • domain-independent framework
  • find all objects that are similar to some
    objects in class A and are not similar to any
    object in class B

28
Periodic Pattern Search in Time-Related Data Sets
  • Full cycle analysis
  • Fourier transformation, other statistical
    analysis methods
  • Fragment-wise cyclic behavior analysis
  • Example. Jack reads NY Times at every 900am.
  • Given (natural) periods vs. arbitray periods.
  • A data cube and OLAP-based technique (Han, Gong
    and Yin98)
  • Cyclic association rules
  • Associations which form cycles.
  • Cyclic Association Rules (B. Özden, S. Ramawamy,
    A. Silberschatz, 1998)

29
Conclusions
  • Data warehouse An industry trend
  • DW stores a huge amount of subject-oriented,
    cleansed, integrated, consolidated, time-related
    data.
  • OLAP provides an interactive data analysis
    environment
  • OLAM Integration of mining with OLAP
  • Take advantages of data warehouse infrastructure.
  • From batch mining to interactive,
    multi-dimensional mining
  • Many interesting research and implementation
    issues
  • Database mining and warehouse mining are both
    important directions to pursue.

30
Outline
  • Why and What
  • DM Technology
  • Future Directions

31
Future Work on DM Research
  • Integration with data warehouse, OLAP, and
    relational technology
  • Scalability efficient algorithms,
    parallel/distributed and incremental mining
  • Ad-hoc mining query language and its optimization
  • Multiple, integrated DM functions and methods
  • Mining on new kinds of data time-series data,
    text, multimedia, spatial and Web
  • Visual DM and knowledge visualization
  • Application exploration
  • Interactive, exploratory DM environment

32
Integration with Data Warehouse and OLAP
Technology
  • Data warehouse A strong industry trend
  • huge amount of subject-oriented, cleansed,
    integrated, consolidated, time-related data are
    stored in data warehouses
  • OLAP provides an interactive data analysis
    environment
  • Integrate mining with OLAP leads to multiple
    dimensional DM
  • On-Line Analytical Mining (OLAM -).

33
Efficiency and Scalability in DM
  • Efficient algorithms in every DM function
  • Class description Summarization and comparison
  • Classification and prediction
  • Clustering
  • Time-series and trend analysis
  • Real-time, fast response in exploratory DM
  • Progressive, multiple precision data analysis
  • Parallel and distributed DM algorithms
  • Incremental DM methods

34
Why Parallel and Distributed DM?
  • Massive amounts of data sets
  • From mega-bytes to giga- and tera-bytes.
  • Costly data mining algorithms
  • Association, classification, clustering,
    prediction.
  • Data in applications are geographically
    distributed.
  • Parallel and networked computers are widely
    available.

35
DM Query Optimization
  • Ad-hoc DM query language DM SQL.
  • DM query optimization
  • How to carve a DM view?
  • How to push user-specified rule constraints?
  • How to integrate interestingness measures in
    mining?
  • Interactivity of DM New challenge.

36
Multiple, Integrated DM Functions and Methods
  • Multiple mining functions
  • Concept description characterization and
    discrimination
  • Classification and prediction
  • Clustering
  • Association and correlation analysis
  • Multiple mining methods
  • Statistical approaches
  • Machine learning approaches
  • Neural network approach
  • Other approaches mathematical models, etc.

37
Mining Complex Types of Data
  • Text data mining
  • Library database, e-mails, book stores, Web
    pages.
  • Spatial data mining
  • geographic information systems, engineering
    databases, medical image database.
  • Multimedia data mining
  • image and video/audio databases.
  • Web mining
  • unstructured and semi-structured data
  • Web access pattern analysis easily doable.

38
Visual DM
  • The power of visual comprehension
  • A picture a thousand words
  • Pattern recognition and exploratory mining
  • Visual mining techniques
  • Data visualization
  • Integration with other mining methods
  • Visual representation of knowledge
  • Charts, graphs, trees, curves, cubes
  • Multi-dimensional representation color, shape,
    texture, gray-level, etc.

39
Exploration of DM Applications
  • Need more success stories
  • Insurance and market analysis, NBA strategy
    analysis.
  • Most current DM systems are lack of a thick
    semantic layer
  • like the early relational database systems
    without application software.
  • Customized data mining systems
  • Market analysis DM systems
  • Insurance and customer analysis systems

40
Towards an Integrated, Exploratory DM Environment
  • Exploratory DM
  • Interactive, user-centered, exploratory mining
    process
  • High performance and fast response
  • Integrated multiple DM functions and methods
  • Try different approaches to see which one is
    better
  • Try different functions to see which patterns are
    more interesting
  • Automated mining and interactive mining not too
    far apart!

41
Towards VO-based DM
Success of DM
Success of VO
Hope the day comes earlier
42
  • Thank you !!!
Write a Comment
User Comments (0)
About PowerShow.com