Title: NVOSS08
1THE US NATIONAL VIRTUAL OBSERVATORY
Basic Concepts in Data Mining
Kirk Borne George Mason University
1
2OUTLINE
- The New Face of Science
- Scientific Knowledge Discovery
- Data Mining Examples and Techniques
- Basic Concepts in Data Mining
- Whats next?
3OUTLINE
- The New Face of Science
- Scientific Knowledge Discovery
- Data Mining Examples and Techniques
- Basic Concepts in Data Mining
- Whats next?
4The Scientific Data Flood
Scientific Data Flood
Large Science Project
Pipeline
5The New Face of Science 1
- Big Data (usually geographically distributed)
- High-Energy Particle Physics
- Astronomy and Space Physics
- Earth Observing System (Remote Sensing)
- Human Genome and Bioinformatics
- Numerical Simulations of any kind
- Digital Libraries (electronic publication
repositories) - e-Science
- Built on Web Services (e-Gov, e-Biz) paradigm
- Distributed heterogeneous data are the norm
- Data integration across projects institutions
- One-stop shopping The right data, right now.
6The New Face of Science 2
- Databases enable scientific discovery
- Data Handling and Archiving (management of
massive data resources) - Data Discovery (finding data wherever they exist)
- Data Access (WWW-Database interfaces)
- Data/Metadata Browsing (serendipity)
- Data Sharing and Reuse (within project teams and
by other scientists scientific validation) - Data Integration (from multiple sources)
- Data Fusion (across multiple modalities
domains) - Data Mining (KDD Knowledge Discovery in
Databases)
7OUTLINE
- The New Face of Science
- Scientific Knowledge Discovery
- Data Mining Examples and Techniques
- Basic Concepts in Data Mining
- Whats next?
8So what is Data Mining?
- Data Mining is Knowledge Discovery in Databases
(KDD) - Data mining is defined as an information
extraction activity whose goal is to discover
hidden facts contained in (large) databases. - Note Machine Learning is the field of Computer
Science research that focuses on algorithms that
learn from data. - Data Mining is the application of Machine
Learning algorithms to large databases.
9Scientific Data Mining
- Data Mining is the Killer App for Scientific
Databases. - Scientific Data Mining References
- http//voneural.na.infn.it/
- http//astroweka.sourceforge.net/
- http//www.itsc.uah.edu/f-mass/
- Framework for Mining and Analysis of Space
Science data (F-MASS) - Data mining is used to find patterns and
relationships in data. (EDA Exploratory Data
Analysis) - Patterns can be analyzed via 2 types of models
- Descriptive Describe patterns and create
meaningful subgroups or clusters. (Unsupervised
Learning, Clustering) - Predictive Forecast explicit values, based
upon patterns in known results. (Supervised
Learning, Classification) - How does this apply to Scientific Research?
- through KNOWLEDGE DISCOVERY
- Data ? Information ? Knowledge ?
Understanding / Wisdom!
10Astronomy Example
Data
(a) Imaging data (ones zeroes)
(b) Spectral data (ones zeroes)
- Information (catalogs / databases)
- Measure brightness of galaxies from image (e.g.,
14.2 or 21.7) - Measure redshift of galaxies from spectrum (e.g.,
0.0167 or 0.346)
Knowledge Hubble Diagram ? Redshift-Brightness
Correlation ? Redshift Distance
Understanding the Universe is expanding!!
11Astronomers have been doing Data Mining for
centuries
- The data are mine, and you cant have them!
- Seriously ...
- Astronomers love to classify things ...
(Supervised Learning. e.g., classification) - Astronomers love to characterize things ...
(Unsupervised Learning. e.g., clustering) - And we love to discover new things ...
(Semi-supervised Learning. e.g., outlier
detection)
12This sums it up ...
- Characterize the new (clustering)
- Assign the known (classification)
- Discover the unknown (outlier detection)
Graphic from S. G. Djorgovski
- 2 benefits of very large data sets within a
scientific domain - best statistical analysis of typical events
- automated search for rare events
13OUTLINE
- The New Face of Science
- Scientific Knowledge Discovery
- Data Mining Examples and Techniques
- Basic Concepts in Data Mining
- Whats next?
14Database Systems and Data Mining
- Data mining brings novel non-traditional (Machine
Learning) concepts to large DBMS (e.g.,
association mining neural networks decision
trees link analysis pattern recognition
classification regression self-organizing
maps). For example - Clustering Analysis group together similar
items, and separate the dissimilar items - Classification predict the class label
- Regression predict a numeric attribute value
- Association Analysis detect attribute-value
conditions that occur frequently together
15Data Mining Methods and Some Examples
- Clustering
- Classification
- Associations
- Neural Nets
- Decision Trees
- Pattern Recognition
- Correlation/Trend Analysis
- Principal Component Analysis
- Independent Component Analysis
- Regression Analysis
- Outlier/Glitch Identification
- Visualization
- Autonomous Agents
- Self-Organizing Maps (SOM)
- Link (Affinity Analysis)
Group together similar items and separate
dissimilar items in DB
Classify new data items using the known classes
groups
Find unusual co-occurring associations of
attribute values among DB items
Predict a numeric attribute value
Organize information in the database based on
relationships among key data descriptors
Identify linkages between data items based on
features shared in common
16Some Data Mining Techniques Graphically
Represented
- Self-Organizing Map (SOM)
Clustering
Neural Network
Outlier (Anomaly) Detection
Link Analysis
Decision Tree
17Categories of Machine Learningand some Examples
- Supervised Learning
- Classification
- Unsupervised Learning
- Clustering
- Link Analysis
- Association Analysis
- Semisupervised Learning
- Outlier Detection
- Class Discovery
18Some Classification AlgorithmsClassification
the process of learning and then applying a
function that classifies the data into a set of
predefined classes.
- Bayes Theorem
- Support Vector Machines (SVM)
- Decision Trees
- Regression
- Neural Networks
- Markov Modeling
- K-Nearest Neighbors
19Classification - a 2-Step Process
- Model Construction (Description) describing a
set of predetermined classes Build the Model. - Each data element/tuple/sample is assumed to
belong to a predefined class, as determined by
the class label attribute - The set of tuples used for model construction
the training set - The model is represented by classification rules,
decision trees, or mathematical formulae - Model Usage (Prediction) for classifying future
or unknown objects, or for predicting missing
values Apply the Model. - It is important to estimate the accuracy of the
model - The known labels of the test sample are compared
with the classification results from the model - Accuracy rate is the percentage of test set
samples that are correctly classified by the
model - Test set is chosen completely independent of the
training set, otherwise overfitting will occur
overfitting is a bad thing!
20Classification MethodsDecision Trees, Neural
Networks, SVM (Support Vector Machines)
- There are 2 Classes!
- How do you ...
- Separate them?
- Distinguish them?
- Learn the rules?
- Classify them?
Apply Kernel
(SVM)
21Some Clustering AlgorithmsClustering the
process of partitioning a set of data into
subsets or clusters such that a data element
belonging to a cluster is more similar to data
elements belonging to that same cluster than to
the data elements belonging to other clusters.
- Squared Error
- Nearest Neighbor
- K-Means (most popular)
- Mixture Models (statistical)
22Clustering is used to discover the different
unique groupings (classes) of attribute
values.The case shown below is not obvious one
or two groups?
23This case is easier there are two groups.(in
fact, this is the same set of data elements as
shown on the previous slide, but plotted here
using a different attribute.)
24Semi-supervised Learning Outlier Detection and
Class Discovery
Figure The clustering of data clouds (dc)
within a multidimensional parameter space
(p). Such a mapping can be used to search for
and identify clusters, voids, outliers,
one-of-kinds, relationships, and associations
among arbitrary parameters in a database (or
among various parameters in geographically
distributed databases).
- statistical analysis of typical events
- automated search for rare events
25Outlier Detection Serendipitous Discovery of
Rare or New Objects Events
26Principal Components Analysis Independent
Components Analysis
Cepheid Variables Cosmic Yardsticks -- One
Correlation -- Two Classes!
... Class Discovery!
27Why use Data Mining?Here are 6 reasons...
- Most projects now collect massive quantities of
data. - Because of the enormous potential for new
discoveries in existing huge databases. - Data mining moves beyond the analysis of past
events to predicting future trends and behaviors
that may be missed because they lie outside
experts expectations. - Data mining tools can answer complex questions
that traditionally were too time- consuming to
resolve. - Data mining tools can explore the intricate
interdependencies within databases in order to
discover hidden patterns and relationships. - Data mining allows decision-makers to make
proactive, knowledge-driven decisions.
28OUTLINE
- The New Face of Science
- Scientific Knowledge Discovery
- Data Mining Examples and Techniques
- Basic Concepts in Data Mining
- Whats next?
29Basic Concepts Key Steps
- The key steps in a data mining project usually
invoke and/or follow these basic concepts - Data browse, preview, and selection
- Data cleaning and preparation
- Feature selection
- Data normalization and transformation
- Similarity/Distance metric selection
- ... Select the data mining method
- ... Apply the data mining method
- ... Gather and analyze data mining results
- Accuracy estimation
- Avoiding overfitting
30Key Concept for Data MiningData Previewing
- Data Previewing allows you to get a sense of the
good, bad, and ugly parts of the database - This includes
- Histograms of attribute distributions
- Scatter plots of attribute combinations
- Max-Min value checks (versus expectations)
- Summarizations, aggregations (GROUP BY)
- SELECT UNIQUE values (versus expectations)
- Checking physical units (and scale factors)
- External checks (cross-DB comparisons)
- Verify with input DB
31Key Concept for Data MiningData Preparation
Cleaning the Data
- Data Preparation can take 40-80 (or more) of the
effort in a data mining project - This includes
- Dealing with NULL (missing) values
- Dealing with errors
- Dealing with noise
- Dealing with outliers (unless that is your
science!) - Transformations units, scale, projections
- Data normalization
- Relevance analysis Feature Selection
- Remove redundant attributes
- Dimensionality Reduction
32Key Concept for Data MiningFeature Selection
the Feature Vector
- A feature vector is the attribute vector for a
database record (tuple). - The feature vectors components are database
attributes v w,x,y,z - It contains the set of database attributes that
you have chosen to represent (describe) uniquely
each data element (tuple). - This is only a subset of all possible attributes
in the DB. - Example Sky Survey database object feature
vector - Generic RA, Dec, mag, redshift, color, size
- Specific ra2000, dec2000, r, z, g-r, R_eff
?
33Key Concept for Data MiningData Types
- Different data types
- Continuous
- Numeric (e.g., salaries, ages, temperatures,
rainfall, sales) - Discrete
- Binary (0 or 1 Yes/No Male/Female)
- Boolean (True/False)
- Specific list of allowed values (e.g., zip codes
country names chemical elements amino acids
planets) - Categorical
- Non-numeric (character/text data) (e.g.,
peoples names) - Can be Ordinal (ordered) or Nominal (not ordered)
- Reference http//www.twocrows.com/glossary.htma
nchor311516 - Examples of Data Mining Classification
Techniques - Regression for continuous numeric data
- Logistic Regression for discrete data
- Bayesian Classification for categorical data
34Key Concept for Data MiningData Normalization
Data Transformation
- Data Normalization transforms data values for
different database attributes into a uniform set
of units or into a uniform scale (i.e., to a
common min-max range). - Data Normalization assigns the correct numerical
weighting to the values of different attributes. - For example
- Transform all numerical values from min to max on
a 0 to 1 scale (or 0 to Weight or -1 to 1 or 0
to 100 ). - Convert discrete or character (categorical) data
into numeric values. - Transform ordinal data to a ranked list
(numeric). - Discretize continuous data into bins.
35Key Concept for Data MiningSimilarity and
Distance Metrics
- Similarity between complex data objects is one of
the central notions in data mining. - The fundamental problem is to determine whether
any selected pair of data objects exhibit similar
characteristics. - The problem is both interesting and difficult
because the similarity measures should allow for
imprecise matches. - Similarity and its inverse Distance provide
the basis for all of the fundamental data mining
clustering techniques and for many data mining
classification techniques.
36Similarity and Distance Measures (metrics)
37Similarity and Distance Measures
- Most clustering algorithms depend on a distance
or similarity measure, to determine (a) the
closeness or alikeness of cluster members, and
(b) the distance or unlikeness of members from
different clusters. - General requirements for any similarity or
distance metric - Non-negative dist(A,B) gt 0 and sim(A,B) gt 0
- Symmetric dist(A,B)dist(B,A) and
sim(A,B)sim(B,A) - In order to calculate the distance between
different attribute values, those attributes must
be transformed or normalized (either to the same
units, or else normalized to a similar scale). - The normalization of both categorical
(non-numeric) data and numerical data with units
generally requires domain expertise. This is
part of the pre-processing (data preparation)
step in any data mining activity.
38Popular Similarity and Distance Measures
- General Lp distance x-yp sumx-yp1/p
- Euclidean distance p2
- DE sqrt(x1-y1)2 (x2-y2)2 (x3-y3)2
- Manhattan distance p1 ( of city blocks
walked) - DM x1-y1 x2-y2 x3-y3
- Cosine distance angle between two feature
vectors - d(X,Y) arccos X ? Y / X . Y
- d(X,Y) arccos (x1y1x2y2x3y3) / X .
Y - Similarity function s(x,y) 1 / 1d(x,y)
- s varies from 1 to 0, as distance d varies from 0
to .
8
39Data Mining Clustering and Nearest Neighbor
Algorithms Issues
- Clustering algorithms and nearest neighbor
algorithms (for classification) require a
distance or similarity metric. - You must be especially careful with categorical
data, which can be a problem. For example
- What is the distance between blue and green? Is
it larger than the distance from green to red? - How do you metrify different attributes (color,
shape, text labels)? This is essential in order
to calculate distance in multi-dimensions. Is
the distance from blue to green larger or smaller
than the distance from round to square? Which of
these are most similar?
40Typical Error Matrix
Key Concept for Data Mining Classification
Accuracy
True Positive False Positive False Negative True
Negative
TRAINING DATA (actual classes)
Class-A Class-B Totals
3007
173 (FP)
2834 (TP)
Class-A Class-B Totals
NEURAL NETWORK CLASSIFICATION (output)
3421
318 (FN)
3103 (TN)
3276
3152
6428
41Typical Measures of Accuracy
- Overall Accuracy
(TPTN)/(TPTNFPFN) - Producers Accuracy (Class A) TP/(TPFN)
- Producers Accuracy (Class B) TN/(FPTN)
- Users Accuracy (Class A)
TP/(TPFP) - Users Accuracy (Class B) TN/(TNFN)
Accuracy of our Classification on preceding slide
- Overall Accuracy 92.4
- Producers Accuracy (Class A) 89.9
- Producers Accuracy (Class B) 94.7
- Users Accuracy (Class A) 94.2
- Users Accuracy (Class B) 90.7
42Key Concept for Data MiningOverfitting
d(x)
- g(x) is a poor fit (fitting a straight line
through the points) - h(x) is a good fit
- d(x) is a very poor fit (fitting every point)
Overfitting
43How to Avoid Overfitting in Data Mining Models
- In Data Mining, the problem arises because you
are training the model on a set of training data
(i.e., a subset of the total database). - That training data set is simply intended to be
representative of the entire database, not a
precise exact copy of the database. - So, if you try to fit every nuance in the
training data, then you will probably
over-constrain the problem and produce a bad fit. - This is where a TEST DATA SET comes in very
handy. You can train the data mining model
(Decision Tree or Neural Network) on the TRAINING
DATA, and then measure its accuracy with the TEST
DATA, prior to unleashing the model (e.g.,
Classifier) on some real new data. - Different ways of subsetting the TRAINING and
TEST data sets - 50-50 (50 of data used to TRAIN, 50 used to
TEST) - 10 different sets of 90-10 (90 for TRAINING,
10 for TESTING)
44Schematic Approach to Avoiding Overfitting
To avoid overfitting, you need to know when to
stop training the model. Although the Training
Set error may continue to decrease, you may
simply be overfitting the Training Data. Test
this by applying the model to Test Data (not
part of Training Set). If the Test Set error
starts to increase, then you know that you
are overfitting the Training Set and it is time
to stop!
Test Set error
Error
Training Set error
Training Epoch
STOP Training HERE !
45OUTLINE
- The New Face of Science
- Scientific Knowledge Discovery
- Data Mining Examples and Techniques
- Basic Concepts in Data Mining
- Whats next?
46Scientific Data Mining in Astronomy
2008 NVO Summer School
46