NVOSS08 - PowerPoint PPT Presentation

About This Presentation

Title:

NVOSS08

Description:

THE US NATIONAL VIRTUAL OBSERVATORY Basic Concepts in Data Mining Kirk Borne George Mason University * * – PowerPoint PPT presentation

Number of Views:138

Avg rating:3.0/5.0

Slides: 47

Provided by: calt164

Learn more at: https://sites.astro.caltech.edu

Category:

more less

Transcript and Presenter's Notes

Title: NVOSS08

1
THE US NATIONAL VIRTUAL OBSERVATORY
Basic Concepts in Data Mining
Kirk Borne George Mason University
1
2
OUTLINE

The New Face of Science
Scientific Knowledge Discovery
Data Mining Examples and Techniques
Basic Concepts in Data Mining
Whats next?

3
OUTLINE

The New Face of Science
Scientific Knowledge Discovery
Data Mining Examples and Techniques
Basic Concepts in Data Mining
Whats next?

4
The Scientific Data Flood
Scientific Data Flood
Large Science Project
Pipeline
5
The New Face of Science 1

Big Data (usually geographically distributed)
High-Energy Particle Physics
Astronomy and Space Physics
Earth Observing System (Remote Sensing)
Human Genome and Bioinformatics
Numerical Simulations of any kind
Digital Libraries (electronic publication
repositories)
e-Science
Built on Web Services (e-Gov, e-Biz) paradigm
Distributed heterogeneous data are the norm
Data integration across projects institutions
One-stop shopping The right data, right now.

6
The New Face of Science 2

Databases enable scientific discovery
Data Handling and Archiving (management of
massive data resources)
Data Discovery (finding data wherever they exist)
Data Access (WWW-Database interfaces)
Data/Metadata Browsing (serendipity)
Data Sharing and Reuse (within project teams and
by other scientists scientific validation)
Data Integration (from multiple sources)
Data Fusion (across multiple modalities
domains)
Data Mining (KDD Knowledge Discovery in
Databases)

7
OUTLINE

The New Face of Science
Scientific Knowledge Discovery
Data Mining Examples and Techniques
Basic Concepts in Data Mining
Whats next?

8
So what is Data Mining?

Data Mining is Knowledge Discovery in Databases
(KDD)
Data mining is defined as an information
extraction activity whose goal is to discover
hidden facts contained in (large) databases.
Note Machine Learning is the field of Computer
Science research that focuses on algorithms that
learn from data.
Data Mining is the application of Machine
Learning algorithms to large databases.

9
Scientific Data Mining

Data Mining is the Killer App for Scientific
Databases.
Scientific Data Mining References
http//voneural.na.infn.it/
http//astroweka.sourceforge.net/
http//www.itsc.uah.edu/f-mass/
Framework for Mining and Analysis of Space
Science data (F-MASS)
Data mining is used to find patterns and
relationships in data. (EDA Exploratory Data
Analysis)
Patterns can be analyzed via 2 types of models
Descriptive Describe patterns and create
meaningful subgroups or clusters. (Unsupervised
Learning, Clustering)
Predictive Forecast explicit values, based
upon patterns in known results. (Supervised
Learning, Classification)
How does this apply to Scientific Research?
through KNOWLEDGE DISCOVERY
Data ? Information ? Knowledge ?
Understanding / Wisdom!

10
Astronomy Example
Data
(a) Imaging data (ones zeroes)
(b) Spectral data (ones zeroes)

Information (catalogs / databases)
Measure brightness of galaxies from image (e.g.,
14.2 or 21.7)
Measure redshift of galaxies from spectrum (e.g.,
0.0167 or 0.346)

Knowledge Hubble Diagram ? Redshift-Brightness
Correlation ? Redshift Distance
Understanding the Universe is expanding!!
11
Astronomers have been doing Data Mining for
centuries

The data are mine, and you cant have them!

Seriously ...
Astronomers love to classify things ...
(Supervised Learning. e.g., classification)
Astronomers love to characterize things ...
(Unsupervised Learning. e.g., clustering)
And we love to discover new things ...
(Semi-supervised Learning. e.g., outlier
detection)

12
This sums it up ...

Characterize the new (clustering)
Assign the known (classification)
Discover the unknown (outlier detection)

Graphic from S. G. Djorgovski

2 benefits of very large data sets within a
scientific domain
best statistical analysis of typical events
automated search for rare events

13
OUTLINE

The New Face of Science
Scientific Knowledge Discovery
Data Mining Examples and Techniques
Basic Concepts in Data Mining
Whats next?

14
Database Systems and Data Mining

Data mining brings novel non-traditional (Machine
Learning) concepts to large DBMS (e.g.,
association mining neural networks decision
trees link analysis pattern recognition
classification regression self-organizing
maps). For example
Clustering Analysis group together similar
items, and separate the dissimilar items
Classification predict the class label
Regression predict a numeric attribute value
Association Analysis detect attribute-value
conditions that occur frequently together

15
Data Mining Methods and Some Examples

Clustering
Classification
Associations
Neural Nets
Decision Trees
Pattern Recognition
Correlation/Trend Analysis
Principal Component Analysis
Independent Component Analysis
Regression Analysis
Outlier/Glitch Identification
Visualization
Autonomous Agents
Self-Organizing Maps (SOM)
Link (Affinity Analysis)

Group together similar items and separate
dissimilar items in DB
Classify new data items using the known classes
groups
Find unusual co-occurring associations of
attribute values among DB items
Predict a numeric attribute value
Organize information in the database based on
relationships among key data descriptors
Identify linkages between data items based on
features shared in common
16
Some Data Mining Techniques Graphically
Represented

Self-Organizing Map (SOM)

Clustering
Neural Network
Outlier (Anomaly) Detection
Link Analysis
Decision Tree
17
Categories of Machine Learningand some Examples

Supervised Learning
Classification
Unsupervised Learning
Clustering
Link Analysis
Association Analysis
Semisupervised Learning
Outlier Detection
Class Discovery

18
Some Classification AlgorithmsClassification
the process of learning and then applying a
function that classifies the data into a set of
predefined classes.

Bayes Theorem
Support Vector Machines (SVM)
Decision Trees
Regression
Neural Networks
Markov Modeling
K-Nearest Neighbors

19
Classification - a 2-Step Process

Model Construction (Description) describing a
set of predetermined classes Build the Model.
Each data element/tuple/sample is assumed to
belong to a predefined class, as determined by
the class label attribute
The set of tuples used for model construction
the training set
The model is represented by classification rules,
decision trees, or mathematical formulae
Model Usage (Prediction) for classifying future
or unknown objects, or for predicting missing
values Apply the Model.
It is important to estimate the accuracy of the
model
The known labels of the test sample are compared
with the classification results from the model
Accuracy rate is the percentage of test set
samples that are correctly classified by the
model
Test set is chosen completely independent of the
training set, otherwise overfitting will occur
overfitting is a bad thing!

20
Classification MethodsDecision Trees, Neural
Networks, SVM (Support Vector Machines)

There are 2 Classes!
How do you ...
Separate them?
Distinguish them?
Learn the rules?
Classify them?

Apply Kernel
(SVM)
21
Some Clustering AlgorithmsClustering the
process of partitioning a set of data into
subsets or clusters such that a data element
belonging to a cluster is more similar to data
elements belonging to that same cluster than to
the data elements belonging to other clusters.

Squared Error
Nearest Neighbor
K-Means (most popular)
Mixture Models (statistical)

22
Clustering is used to discover the different
unique groupings (classes) of attribute
values.The case shown below is not obvious one
or two groups?
23
This case is easier there are two groups.(in
fact, this is the same set of data elements as
shown on the previous slide, but plotted here
using a different attribute.)
24
Semi-supervised Learning Outlier Detection and
Class Discovery
Figure The clustering of data clouds (dc)
within a multidimensional parameter space
(p). Such a mapping can be used to search for
and identify clusters, voids, outliers,
one-of-kinds, relationships, and associations
among arbitrary parameters in a database (or
among various parameters in geographically
distributed databases).

statistical analysis of typical events
automated search for rare events

25
Outlier Detection Serendipitous Discovery of
Rare or New Objects Events
26
Principal Components Analysis Independent
Components Analysis
Cepheid Variables Cosmic Yardsticks -- One
Correlation -- Two Classes!
... Class Discovery!
27
Why use Data Mining?Here are 6 reasons...

Most projects now collect massive quantities of
data.
Because of the enormous potential for new
discoveries in existing huge databases.
Data mining moves beyond the analysis of past
events to predicting future trends and behaviors
that may be missed because they lie outside
experts expectations.
Data mining tools can answer complex questions
that traditionally were too time- consuming to
resolve.
Data mining tools can explore the intricate
interdependencies within databases in order to
discover hidden patterns and relationships.
Data mining allows decision-makers to make
proactive, knowledge-driven decisions.

28
OUTLINE

The New Face of Science
Scientific Knowledge Discovery
Data Mining Examples and Techniques
Basic Concepts in Data Mining
Whats next?

29
Basic Concepts Key Steps

The key steps in a data mining project usually
invoke and/or follow these basic concepts
Data browse, preview, and selection
Data cleaning and preparation
Feature selection
Data normalization and transformation
Similarity/Distance metric selection
... Select the data mining method
... Apply the data mining method
... Gather and analyze data mining results
Accuracy estimation
Avoiding overfitting

30
Key Concept for Data MiningData Previewing

Data Previewing allows you to get a sense of the
good, bad, and ugly parts of the database
This includes
Histograms of attribute distributions
Scatter plots of attribute combinations
Max-Min value checks (versus expectations)
Summarizations, aggregations (GROUP BY)
SELECT UNIQUE values (versus expectations)
Checking physical units (and scale factors)
External checks (cross-DB comparisons)
Verify with input DB

31
Key Concept for Data MiningData Preparation
Cleaning the Data

Data Preparation can take 40-80 (or more) of the
effort in a data mining project
This includes
Dealing with NULL (missing) values
Dealing with errors
Dealing with noise
Dealing with outliers (unless that is your
science!)
Transformations units, scale, projections
Data normalization
Relevance analysis Feature Selection
Remove redundant attributes
Dimensionality Reduction

32
Key Concept for Data MiningFeature Selection
the Feature Vector

A feature vector is the attribute vector for a
database record (tuple).
The feature vectors components are database
attributes v w,x,y,z
It contains the set of database attributes that
you have chosen to represent (describe) uniquely
each data element (tuple).
This is only a subset of all possible attributes
in the DB.
Example Sky Survey database object feature
vector
Generic RA, Dec, mag, redshift, color, size
Specific ra2000, dec2000, r, z, g-r, R_eff

?
33
Key Concept for Data MiningData Types

Different data types
Continuous
Numeric (e.g., salaries, ages, temperatures,
rainfall, sales)
Discrete
Binary (0 or 1 Yes/No Male/Female)
Boolean (True/False)
Specific list of allowed values (e.g., zip codes
country names chemical elements amino acids
planets)
Categorical
Non-numeric (character/text data) (e.g.,
peoples names)
Can be Ordinal (ordered) or Nominal (not ordered)
Reference http//www.twocrows.com/glossary.htma
nchor311516
Examples of Data Mining Classification
Techniques
Regression for continuous numeric data
Logistic Regression for discrete data
Bayesian Classification for categorical data

34
Key Concept for Data MiningData Normalization
Data Transformation

Data Normalization transforms data values for
different database attributes into a uniform set
of units or into a uniform scale (i.e., to a
common min-max range).
Data Normalization assigns the correct numerical
weighting to the values of different attributes.
For example
Transform all numerical values from min to max on
a 0 to 1 scale (or 0 to Weight or -1 to 1 or 0
to 100 ).
Convert discrete or character (categorical) data
into numeric values.
Transform ordinal data to a ranked list
(numeric).
Discretize continuous data into bins.

35
Key Concept for Data MiningSimilarity and
Distance Metrics

Similarity between complex data objects is one of
the central notions in data mining.
The fundamental problem is to determine whether
any selected pair of data objects exhibit similar
characteristics.
The problem is both interesting and difficult
because the similarity measures should allow for
imprecise matches.
Similarity and its inverse Distance provide
the basis for all of the fundamental data mining
clustering techniques and for many data mining
classification techniques.

36
Similarity and Distance Measures (metrics)
37
Similarity and Distance Measures

Most clustering algorithms depend on a distance
or similarity measure, to determine (a) the
closeness or alikeness of cluster members, and
(b) the distance or unlikeness of members from
different clusters.
General requirements for any similarity or
distance metric
Non-negative dist(A,B) gt 0 and sim(A,B) gt 0
Symmetric dist(A,B)dist(B,A) and
sim(A,B)sim(B,A)
In order to calculate the distance between
different attribute values, those attributes must
be transformed or normalized (either to the same
units, or else normalized to a similar scale).
The normalization of both categorical
(non-numeric) data and numerical data with units
generally requires domain expertise. This is
part of the pre-processing (data preparation)
step in any data mining activity.

38
Popular Similarity and Distance Measures

General Lp distance x-yp sumx-yp1/p
Euclidean distance p2
DE sqrt(x1-y1)2 (x2-y2)2 (x3-y3)2
Manhattan distance p1 ( of city blocks
walked)
DM x1-y1 x2-y2 x3-y3
Cosine distance angle between two feature
vectors
d(X,Y) arccos X ? Y / X . Y
d(X,Y) arccos (x1y1x2y2x3y3) / X .
Y
Similarity function s(x,y) 1 / 1d(x,y)
s varies from 1 to 0, as distance d varies from 0
to .

8
39
Data Mining Clustering and Nearest Neighbor
Algorithms Issues

Clustering algorithms and nearest neighbor
algorithms (for classification) require a
distance or similarity metric.
You must be especially careful with categorical
data, which can be a problem. For example

What is the distance between blue and green? Is
it larger than the distance from green to red?
How do you metrify different attributes (color,
shape, text labels)? This is essential in order
to calculate distance in multi-dimensions. Is
the distance from blue to green larger or smaller
than the distance from round to square? Which of
these are most similar?

40
Typical Error Matrix
Key Concept for Data Mining Classification
Accuracy
True Positive False Positive False Negative True
Negative
TRAINING DATA (actual classes)
Class-A Class-B Totals
3007
173 (FP)
2834 (TP)
Class-A Class-B Totals
NEURAL NETWORK CLASSIFICATION (output)
3421
318 (FN)
3103 (TN)
3276
3152
6428
41
Typical Measures of Accuracy

Overall Accuracy
(TPTN)/(TPTNFPFN)
Producers Accuracy (Class A) TP/(TPFN)
Producers Accuracy (Class B) TN/(FPTN)
Users Accuracy (Class A)
TP/(TPFP)
Users Accuracy (Class B) TN/(TNFN)

Accuracy of our Classification on preceding slide

Overall Accuracy 92.4
Producers Accuracy (Class A) 89.9
Producers Accuracy (Class B) 94.7
Users Accuracy (Class A) 94.2
Users Accuracy (Class B) 90.7

42
Key Concept for Data MiningOverfitting
d(x)

g(x) is a poor fit (fitting a straight line
through the points)
h(x) is a good fit
d(x) is a very poor fit (fitting every point)
Overfitting

43
How to Avoid Overfitting in Data Mining Models

In Data Mining, the problem arises because you
are training the model on a set of training data
(i.e., a subset of the total database).
That training data set is simply intended to be
representative of the entire database, not a
precise exact copy of the database.
So, if you try to fit every nuance in the
training data, then you will probably
over-constrain the problem and produce a bad fit.
This is where a TEST DATA SET comes in very
handy. You can train the data mining model
(Decision Tree or Neural Network) on the TRAINING
DATA, and then measure its accuracy with the TEST
DATA, prior to unleashing the model (e.g.,
Classifier) on some real new data.
Different ways of subsetting the TRAINING and
TEST data sets
50-50 (50 of data used to TRAIN, 50 used to
TEST)
10 different sets of 90-10 (90 for TRAINING,
10 for TESTING)

44
Schematic Approach to Avoiding Overfitting
To avoid overfitting, you need to know when to
stop training the model. Although the Training
Set error may continue to decrease, you may
simply be overfitting the Training Data. Test
this by applying the model to Test Data (not
part of Training Set). If the Test Set error
starts to increase, then you know that you
are overfitting the Training Set and it is time
to stop!
Test Set error
Error
Training Set error
Training Epoch
STOP Training HERE !
45
OUTLINE