Title: Data Mining in the Multidimensional Parameter Space
1Data Mining in the Multidimensional Parameter
Space
- Yanxia Zhang
- National Astronomical Observatories,CAS
- Nov.27 2003
2 Outline
- Why and What
- DM Technology
- Future Directions
3Why
Necessity Is the Mother of Invention
Data avalanche
VO
DMKDD
4What
- DM (KDD)
- Extraction of interesting ( non-trivial,
implicit, previously unknown and potentially
useful) information from data in large databases - Alternative names
- Data mining a misnomer?
- Knowledge discovery in databases (KDD SIGKDD),
knowledge extraction, data archeology, data
dredging, information harvesting, business
intelligence, etc.
5Taxonomy of DM
In large scientific databases, DM in two flavors
- Event-based mining
- Relationship-based mining
6Event-Based Mining for Science
- Event-based mining is based upon events or trends
in data. - Known events / known algorithms - use existing
physical models (descriptive models) to locate
known phenomena of interest either spatially or
temporally within a large database. - Known events / unknown algorithms - use pattern
recognition and clustering properties of data to
discover new observational (in our case,
astrophysical) relationships among known
phenomena. - Unknown events / known algorithms - use expected
physical relationships (predictive models) among
observational parameters of astrophysical
phenomena to predict the presence of previously
unseen events within a large complex database. - Unknown events / unknown algorithms - use
thresholds or trends to identify transient or
otherwise unique ("one-of-a-kind") events and
therefore to discover new phenomena.
7Relationship-Based Mining for Science
- Relationship-based mining is based on
associations. - Spatial associations -- identify events
(astronomical objects) at the same location in
the sky. - Temporal associations -- identify events
occurring during the same or related periods of
time. - Coincidence associations -- use clustering
techniques to identify events that are co-located
within a multi-dimensional parameter space.
8Science Requirements for DM
- Cross-Identification - refers to the classical
problem of associating the source list in one
database to the source list in another. - Cross-Correlation - refers to the search for
correlations, tendencies, and trends between
physical parameters in multi-dimensional data,
usually across databases. - Nearest-Neighbor Identification - refers to the
general application of clustering algorithms in
multi-dimensional parameter space, usually within
a database. - Systematic Data Exploration - refers to the
application of the broad range of event-based and
relationship-based queries to a database in the
hope of making a serendipitous discovery of new
objects or a new class of objects.
9Outline
- Why and What
- DM Technology
- Future Directions
10DM Confluence of Multiple Disciplines
Database system, Data warehouse, OLAP
statistics
DM
MLAI
Visualization
Information science
Other disciplines
11DM A KDD Process
Knowledge
- Data mining the core of knowledge discovery
process.
Pattern Evaluation
Data Mining
Task-relevant Data
Selection
Data Warehouse
Data Cleaning
Data Integration
Databases
12DM Functionality
- Concept description Characterization and
Comparison - Generalize, summarize, and possibly contrast data
characteristics, e.g., stars vs. galaxies. - Association
- From association, correlation, to causality.
- finding rules like stars à point sources.
- Classification and Prediction
- Classify data based on the values in a
classifying attribute, e.g., classify objects
based on spectra, or classify galaxies and stars
based on images. - Predict some unknown or missing attribute values
based on other information.
13DM Functionality (Cont.)
- Clustering
- Group data to form new classes, e.g., cluster
spectra data to find distribution patterns. - Time-series analysis
- Trend and deviation analysis Find and
characterize evolution trend, sequential
patterns, similar sequences, and deviation data,
e.g., variable stars. - Similarity-based pattern-directed analysis Find
and characterize user-specified patterns in
large databases. - Cyclicity/periodicity analysis Find
segment-wise or total cycles or periodic
behaviours in time-related data. - Other pattern-directed or statistical analysis
14DM On What Kind of Data?
- Relational databases
- Data warehouses
- Transactional databases
- Advanced DB systems and information repositories
- Object-oriented and object-relational databases
- Spatial databases
- Time-series data and temporal data
- Text databases and multimedia databases
- Heterogeneous and legacy databases
- WWW
15Challenges in DM
- Mining methodology issues
- Mining different kinds of knowledge in databases.
- Interactive mining of knowledge at multiple
levels of abstraction. - Incorporation of background knowledge
- DM query languages and ad-hoc DM.
- Expression and visualization of DM results.
- Handling noise and incomplete data
- Pattern evaluation the interestingness problem.
- Performance issues
- Efficiency and scalability of DM algorithms.
- Parallel, distributed and incremental mining
methods.
16Challenges in DM (Cont.)
- Issues related to the variety of data types
- Handling relational and complex types of data
- Mining information from heterogeneous databases
and global information systems. - Issues related to applications and social
impacts - Application of discovered knowledge.
- Domain-specific data mining tools
- Intelligent query answering
- Process control and decision making.
- Integration of the discovered knowledge with
existing knowledge A knowledge fusion problem. - Protection of data security and integrity.
17Mining Association Rules
- Assocation rule mining
- Finding associations or correlations among a set
of items or objects in transaction databases,
relational databases, and data warehouses. - Applications
- Basket data analysis, cross-marketing, catalog
design, loss-leader analysis, clustering, etc. - Examples.
- Rule form LHS RHS support, confidence.
- buys(x, diapers) buys(x, beers) 0.5, 60
- major(x, CS) takes(x, DB) grade(x, A) 1,
75
18Methods for Mining Associations
- The Apriori principle Any subset of a frequent
itemset must be frequent. - ( Agrawal Srikant94, Mannila, Klementen, et
al94) - Partition Technique(Savasere, Omiecinski,
Navathe95) - Sampling techique (Toivonen96)
- Multi-level or generalized association (Agrawal
Srikant95, Han Fu95) - Quantitative association rule mining (Srikant
Agrawal96, Lent et al.97, Miller97). - Constraint-based or query-based association (Ng,
et al98, Tsur et al98) - From association to correlation (Brin et al97)
19Classification
- Data categorization based on a set of training
objects. - Applications stars,galaxies,AGN classification
etc. - Example classify AGN and provide the symptoms
which describe each class or subclass. - The classification task Based on the features
present in the class_labeled training data,
develop a description or model for each class.
It is used for - classification of future test data,
- better understanding of each class, and
- prediction of certain properties and behaviors.
20Three Schemes in Classification
- Knowledge to be mined
- Summarization (characterization), comparison,
association, classification, clustering, trend,
deviation and pattern analysis, etc. - Mining knowledge at different abstraction levels
- primitive level, high level, multiple-level,
etc. - Databases to be mined
- Relational, transactional, object-oriented,
object-relational, active, spatial, time-series,
text, multi-media, heterogeneous, legacy, etc. - Techniques adopted
- Database-oriented, data warehouse (OLAP), machine
learning, statistics, visualization, neural
network, etc.
21Major Classification Methods
- Decision tree-based classification
- Training set vs test set or cross-validation
- Overfitting problem and tree pruning
- Boosting techniques.
- Bayesian classification
- Naïve Bayesian classification
- Bayesian belief networks
- Boosting techniques (e.g., AdaBoosting).
- Neural network approach
- Multi-layer networks and back-propagation.
- Genetic algorithms
- Genetic operators and fitness function selection.
22Predictive Modeling in Databases
- Predictive modeling Predict data values or
construct generalized linear models based on
the database data. - One can only predict value ranges or category
distributions. - Method outline
- Minimal generalization
- Attribute relevance analysis
- Generalized linear model construction
- Prediction.
- Determine the major factors which influence the
prediction. - Data relevance analysis uncertainty measurement,
entropy analysis, expert judgement, etc. - Multi-level prediction drill-down and roll-up
analysis.
23Data Clustering Analysis
- Clustering
- Partitioning a set of data (or objects) into a
set of classes, called clusters, such that
members of each class sharing some interesting
common properties. - High quality clusters
- the intra-class similarity is high.
- the inter-class similarity is low.
- Measuring data clustering quality
- Distance functions
24Three Categories of Clustering Techniques
- Partitioning-based
- Basically enumerate various partitions and then
score them by some criterion. - K-means, K-medoids, etc.
- Hierarchy-based
- Create a hierarchical decomposition of the set of
data (or objects) using some criterion. - Model-based
- A model is hypothesized for each of the clusters
- Find the best fit of that model to each other.
- E.g., Bayesian classification (AutoClass), Cobweb.
25Database Clustering Methods
- CLARANS (Ng Han94)
- An extension to k-medoid algorithm based on
randomized search. - BIRCH (Zhang et al96)
- CF tree (a balanced tree structure).
- DBSCAN (EKXS96)
- connects regions of sufficiently high desity into
clusters. - STING (WYM97)
- A hierarchical cell structure that store
statistical information. - CLIQUE (Agrawal et al98)
- Cluster high dimensional data.
26Time-Series DM
- Trend and deviation analysis
- Find trend (data evolution regularity) and
deviations. - Regression analysis, visualization techniques.
- Subsequence analysis similarity search
- Subsequence matching normalization matching
- Template specification shape and macro
specification. - Sequential pattern analysis
- Sequential association rules
- Periodicity analysis
- full periods vs. partial periods, cyclic
association rules.
27Similarity Search in DM
- Faloutsos et al. (1994)
- Extract features from each window
- Fourier Transform R-tree structure.
- Agrawal et al. (1995)
- Amplitude scaling, offset translation
- Distance is determined from the sequence
envelopes - Agrawal et al. (1995)
- SDL pattern language to encode queries about
shapes - Jagadish et al. (1997)
- domain-independent framework
- find all objects that are similar to some
objects in class A and are not similar to any
object in class B
28Periodic Pattern Search in Time-Related Data Sets
- Full cycle analysis
- Fourier transformation, other statistical
analysis methods - Fragment-wise cyclic behavior analysis
- Example. Jack reads NY Times at every 900am.
- Given (natural) periods vs. arbitray periods.
- A data cube and OLAP-based technique (Han, Gong
and Yin98) - Cyclic association rules
- Associations which form cycles.
- Cyclic Association Rules (B. Özden, S. Ramawamy,
A. Silberschatz, 1998)
29Conclusions
- Data warehouse An industry trend
- DW stores a huge amount of subject-oriented,
cleansed, integrated, consolidated, time-related
data. - OLAP provides an interactive data analysis
environment - OLAM Integration of mining with OLAP
- Take advantages of data warehouse infrastructure.
- From batch mining to interactive,
multi-dimensional mining - Many interesting research and implementation
issues - Database mining and warehouse mining are both
important directions to pursue.
30 Outline
- Why and What
- DM Technology
- Future Directions
31Future Work on DM Research
- Integration with data warehouse, OLAP, and
relational technology - Scalability efficient algorithms,
parallel/distributed and incremental mining - Ad-hoc mining query language and its optimization
- Multiple, integrated DM functions and methods
- Mining on new kinds of data time-series data,
text, multimedia, spatial and Web - Visual DM and knowledge visualization
- Application exploration
- Interactive, exploratory DM environment
32Integration with Data Warehouse and OLAP
Technology
- Data warehouse A strong industry trend
- huge amount of subject-oriented, cleansed,
integrated, consolidated, time-related data are
stored in data warehouses - OLAP provides an interactive data analysis
environment - Integrate mining with OLAP leads to multiple
dimensional DM - On-Line Analytical Mining (OLAM -).
33Efficiency and Scalability in DM
- Efficient algorithms in every DM function
- Class description Summarization and comparison
- Classification and prediction
- Clustering
- Time-series and trend analysis
- Real-time, fast response in exploratory DM
- Progressive, multiple precision data analysis
- Parallel and distributed DM algorithms
- Incremental DM methods
34Why Parallel and Distributed DM?
- Massive amounts of data sets
- From mega-bytes to giga- and tera-bytes.
- Costly data mining algorithms
- Association, classification, clustering,
prediction. - Data in applications are geographically
distributed. - Parallel and networked computers are widely
available.
35DM Query Optimization
- Ad-hoc DM query language DM SQL.
- DM query optimization
- How to carve a DM view?
- How to push user-specified rule constraints?
- How to integrate interestingness measures in
mining? - Interactivity of DM New challenge.
36Multiple, Integrated DM Functions and Methods
- Multiple mining functions
- Concept description characterization and
discrimination - Classification and prediction
- Clustering
- Association and correlation analysis
- Multiple mining methods
- Statistical approaches
- Machine learning approaches
- Neural network approach
- Other approaches mathematical models, etc.
37Mining Complex Types of Data
- Text data mining
- Library database, e-mails, book stores, Web
pages. - Spatial data mining
- geographic information systems, engineering
databases, medical image database. - Multimedia data mining
- image and video/audio databases.
- Web mining
- unstructured and semi-structured data
- Web access pattern analysis easily doable.
38Visual DM
- The power of visual comprehension
- A picture a thousand words
- Pattern recognition and exploratory mining
- Visual mining techniques
- Data visualization
- Integration with other mining methods
- Visual representation of knowledge
- Charts, graphs, trees, curves, cubes
- Multi-dimensional representation color, shape,
texture, gray-level, etc.
39Exploration of DM Applications
- Need more success stories
- Insurance and market analysis, NBA strategy
analysis. - Most current DM systems are lack of a thick
semantic layer - like the early relational database systems
without application software. - Customized data mining systems
- Market analysis DM systems
- Insurance and customer analysis systems
40Towards an Integrated, Exploratory DM Environment
- Exploratory DM
- Interactive, user-centered, exploratory mining
process - High performance and fast response
- Integrated multiple DM functions and methods
- Try different approaches to see which one is
better - Try different functions to see which patterns are
more interesting - Automated mining and interactive mining not too
far apart!
41Towards VO-based DM
Success of DM
Success of VO
Hope the day comes earlier
42