NVOSS08 - PowerPoint PPT Presentation

About This Presentation
Title:

NVOSS08

Description:

THE US NATIONAL VIRTUAL OBSERVATORY Basic Concepts in Data Mining Kirk Borne George Mason University * * – PowerPoint PPT presentation

Number of Views:138
Avg rating:3.0/5.0
Slides: 47
Provided by: calt164
Category:

less

Transcript and Presenter's Notes

Title: NVOSS08


1
THE US NATIONAL VIRTUAL OBSERVATORY
Basic Concepts in Data Mining
Kirk Borne George Mason University
1
2
OUTLINE
  • The New Face of Science
  • Scientific Knowledge Discovery
  • Data Mining Examples and Techniques
  • Basic Concepts in Data Mining
  • Whats next?

3
OUTLINE
  • The New Face of Science
  • Scientific Knowledge Discovery
  • Data Mining Examples and Techniques
  • Basic Concepts in Data Mining
  • Whats next?

4
The Scientific Data Flood
Scientific Data Flood
Large Science Project
Pipeline
5
The New Face of Science 1
  • Big Data (usually geographically distributed)
  • High-Energy Particle Physics
  • Astronomy and Space Physics
  • Earth Observing System (Remote Sensing)
  • Human Genome and Bioinformatics
  • Numerical Simulations of any kind
  • Digital Libraries (electronic publication
    repositories)
  • e-Science
  • Built on Web Services (e-Gov, e-Biz) paradigm
  • Distributed heterogeneous data are the norm
  • Data integration across projects institutions
  • One-stop shopping The right data, right now.

6
The New Face of Science 2
  • Databases enable scientific discovery
  • Data Handling and Archiving (management of
    massive data resources)
  • Data Discovery (finding data wherever they exist)
  • Data Access (WWW-Database interfaces)
  • Data/Metadata Browsing (serendipity)
  • Data Sharing and Reuse (within project teams and
    by other scientists scientific validation)
  • Data Integration (from multiple sources)
  • Data Fusion (across multiple modalities
    domains)
  • Data Mining (KDD Knowledge Discovery in
    Databases)

7
OUTLINE
  • The New Face of Science
  • Scientific Knowledge Discovery
  • Data Mining Examples and Techniques
  • Basic Concepts in Data Mining
  • Whats next?

8
So what is Data Mining?
  • Data Mining is Knowledge Discovery in Databases
    (KDD)
  • Data mining is defined as an information
    extraction activity whose goal is to discover
    hidden facts contained in (large) databases.
  • Note Machine Learning is the field of Computer
    Science research that focuses on algorithms that
    learn from data.
  • Data Mining is the application of Machine
    Learning algorithms to large databases.

9
Scientific Data Mining
  • Data Mining is the Killer App for Scientific
    Databases.
  • Scientific Data Mining References
  • http//voneural.na.infn.it/
  • http//astroweka.sourceforge.net/
  • http//www.itsc.uah.edu/f-mass/
  • Framework for Mining and Analysis of Space
    Science data (F-MASS)
  • Data mining is used to find patterns and
    relationships in data. (EDA Exploratory Data
    Analysis)
  • Patterns can be analyzed via 2 types of models
  • Descriptive Describe patterns and create
    meaningful subgroups or clusters. (Unsupervised
    Learning, Clustering)
  • Predictive Forecast explicit values, based
    upon patterns in known results. (Supervised
    Learning, Classification)
  • How does this apply to Scientific Research?
  • through KNOWLEDGE DISCOVERY
  • Data ? Information ? Knowledge ?
    Understanding / Wisdom!

10
Astronomy Example
Data
(a) Imaging data (ones zeroes)
(b) Spectral data (ones zeroes)
  • Information (catalogs / databases)
  • Measure brightness of galaxies from image (e.g.,
    14.2 or 21.7)
  • Measure redshift of galaxies from spectrum (e.g.,
    0.0167 or 0.346)

Knowledge Hubble Diagram ? Redshift-Brightness
Correlation ? Redshift Distance
Understanding the Universe is expanding!!
11
Astronomers have been doing Data Mining for
centuries
  • The data are mine, and you cant have them!
  • Seriously ...
  • Astronomers love to classify things ...
    (Supervised Learning. e.g., classification)
  • Astronomers love to characterize things ...
    (Unsupervised Learning. e.g., clustering)
  • And we love to discover new things ...
    (Semi-supervised Learning. e.g., outlier
    detection)

12
This sums it up ...
  • Characterize the new (clustering)
  • Assign the known (classification)
  • Discover the unknown (outlier detection)

Graphic from S. G. Djorgovski
  • 2 benefits of very large data sets within a
    scientific domain
  • best statistical analysis of typical events
  • automated search for rare events

13
OUTLINE
  • The New Face of Science
  • Scientific Knowledge Discovery
  • Data Mining Examples and Techniques
  • Basic Concepts in Data Mining
  • Whats next?

14
Database Systems and Data Mining
  • Data mining brings novel non-traditional (Machine
    Learning) concepts to large DBMS (e.g.,
    association mining neural networks decision
    trees link analysis pattern recognition
    classification regression self-organizing
    maps). For example
  • Clustering Analysis group together similar
    items, and separate the dissimilar items
  • Classification predict the class label
  • Regression predict a numeric attribute value
  • Association Analysis detect attribute-value
    conditions that occur frequently together

15
Data Mining Methods and Some Examples
  • Clustering
  • Classification
  • Associations
  • Neural Nets
  • Decision Trees
  • Pattern Recognition
  • Correlation/Trend Analysis
  • Principal Component Analysis
  • Independent Component Analysis
  • Regression Analysis
  • Outlier/Glitch Identification
  • Visualization
  • Autonomous Agents
  • Self-Organizing Maps (SOM)
  • Link (Affinity Analysis)

Group together similar items and separate
dissimilar items in DB
Classify new data items using the known classes
groups
Find unusual co-occurring associations of
attribute values among DB items
Predict a numeric attribute value
Organize information in the database based on
relationships among key data descriptors
Identify linkages between data items based on
features shared in common
16
Some Data Mining Techniques Graphically
Represented
  • Self-Organizing Map (SOM)

Clustering
Neural Network
Outlier (Anomaly) Detection
Link Analysis
Decision Tree
17
Categories of Machine Learningand some Examples
  • Supervised Learning
  • Classification
  • Unsupervised Learning
  • Clustering
  • Link Analysis
  • Association Analysis
  • Semisupervised Learning
  • Outlier Detection
  • Class Discovery

18
Some Classification AlgorithmsClassification
the process of learning and then applying a
function that classifies the data into a set of
predefined classes.
  • Bayes Theorem
  • Support Vector Machines (SVM)
  • Decision Trees
  • Regression
  • Neural Networks
  • Markov Modeling
  • K-Nearest Neighbors

19
Classification - a 2-Step Process
  • Model Construction (Description) describing a
    set of predetermined classes Build the Model.
  • Each data element/tuple/sample is assumed to
    belong to a predefined class, as determined by
    the class label attribute
  • The set of tuples used for model construction
    the training set
  • The model is represented by classification rules,
    decision trees, or mathematical formulae
  • Model Usage (Prediction) for classifying future
    or unknown objects, or for predicting missing
    values Apply the Model.
  • It is important to estimate the accuracy of the
    model
  • The known labels of the test sample are compared
    with the classification results from the model
  • Accuracy rate is the percentage of test set
    samples that are correctly classified by the
    model
  • Test set is chosen completely independent of the
    training set, otherwise overfitting will occur
    overfitting is a bad thing!

20
Classification MethodsDecision Trees, Neural
Networks, SVM (Support Vector Machines)
  • There are 2 Classes!
  • How do you ...
  • Separate them?
  • Distinguish them?
  • Learn the rules?
  • Classify them?

Apply Kernel
(SVM)
21
Some Clustering AlgorithmsClustering the
process of partitioning a set of data into
subsets or clusters such that a data element
belonging to a cluster is more similar to data
elements belonging to that same cluster than to
the data elements belonging to other clusters.
  • Squared Error
  • Nearest Neighbor
  • K-Means (most popular)
  • Mixture Models (statistical)

22
Clustering is used to discover the different
unique groupings (classes) of attribute
values.The case shown below is not obvious one
or two groups?
23
This case is easier there are two groups.(in
fact, this is the same set of data elements as
shown on the previous slide, but plotted here
using a different attribute.)
24
Semi-supervised Learning Outlier Detection and
Class Discovery
Figure The clustering of data clouds (dc)
within a multidimensional parameter space
(p). Such a mapping can be used to search for
and identify clusters, voids, outliers,
one-of-kinds, relationships, and associations
among arbitrary parameters in a database (or
among various parameters in geographically
distributed databases).
  • statistical analysis of typical events
  • automated search for rare events

25
Outlier Detection Serendipitous Discovery of
Rare or New Objects Events
26
Principal Components Analysis Independent
Components Analysis
Cepheid Variables Cosmic Yardsticks -- One
Correlation -- Two Classes!
... Class Discovery!
27
Why use Data Mining?Here are 6 reasons...
  1. Most projects now collect massive quantities of
    data.
  2. Because of the enormous potential for new
    discoveries in existing huge databases.
  3. Data mining moves beyond the analysis of past
    events to predicting future trends and behaviors
    that may be missed because they lie outside
    experts expectations.
  4. Data mining tools can answer complex questions
    that traditionally were too time- consuming to
    resolve.
  5. Data mining tools can explore the intricate
    interdependencies within databases in order to
    discover hidden patterns and relationships.
  6. Data mining allows decision-makers to make
    proactive, knowledge-driven decisions.

28
OUTLINE
  • The New Face of Science
  • Scientific Knowledge Discovery
  • Data Mining Examples and Techniques
  • Basic Concepts in Data Mining
  • Whats next?

29
Basic Concepts Key Steps
  • The key steps in a data mining project usually
    invoke and/or follow these basic concepts
  • Data browse, preview, and selection
  • Data cleaning and preparation
  • Feature selection
  • Data normalization and transformation
  • Similarity/Distance metric selection
  • ... Select the data mining method
  • ... Apply the data mining method
  • ... Gather and analyze data mining results
  • Accuracy estimation
  • Avoiding overfitting

30
Key Concept for Data MiningData Previewing
  • Data Previewing allows you to get a sense of the
    good, bad, and ugly parts of the database
  • This includes
  • Histograms of attribute distributions
  • Scatter plots of attribute combinations
  • Max-Min value checks (versus expectations)
  • Summarizations, aggregations (GROUP BY)
  • SELECT UNIQUE values (versus expectations)
  • Checking physical units (and scale factors)
  • External checks (cross-DB comparisons)
  • Verify with input DB

31
Key Concept for Data MiningData Preparation
Cleaning the Data
  • Data Preparation can take 40-80 (or more) of the
    effort in a data mining project
  • This includes
  • Dealing with NULL (missing) values
  • Dealing with errors
  • Dealing with noise
  • Dealing with outliers (unless that is your
    science!)
  • Transformations units, scale, projections
  • Data normalization
  • Relevance analysis Feature Selection
  • Remove redundant attributes
  • Dimensionality Reduction

32
Key Concept for Data MiningFeature Selection
the Feature Vector
  • A feature vector is the attribute vector for a
    database record (tuple).
  • The feature vectors components are database
    attributes v w,x,y,z
  • It contains the set of database attributes that
    you have chosen to represent (describe) uniquely
    each data element (tuple).
  • This is only a subset of all possible attributes
    in the DB.
  • Example Sky Survey database object feature
    vector
  • Generic RA, Dec, mag, redshift, color, size
  • Specific ra2000, dec2000, r, z, g-r, R_eff

?
33
Key Concept for Data MiningData Types
  • Different data types
  • Continuous
  • Numeric (e.g., salaries, ages, temperatures,
    rainfall, sales)
  • Discrete
  • Binary (0 or 1 Yes/No Male/Female)
  • Boolean (True/False)
  • Specific list of allowed values (e.g., zip codes
    country names chemical elements amino acids
    planets)
  • Categorical
  • Non-numeric (character/text data) (e.g.,
    peoples names)
  • Can be Ordinal (ordered) or Nominal (not ordered)
  • Reference http//www.twocrows.com/glossary.htma
    nchor311516
  • Examples of Data Mining Classification
    Techniques
  • Regression for continuous numeric data
  • Logistic Regression for discrete data
  • Bayesian Classification for categorical data

34
Key Concept for Data MiningData Normalization
Data Transformation
  • Data Normalization transforms data values for
    different database attributes into a uniform set
    of units or into a uniform scale (i.e., to a
    common min-max range).
  • Data Normalization assigns the correct numerical
    weighting to the values of different attributes.
  • For example
  • Transform all numerical values from min to max on
    a 0 to 1 scale (or 0 to Weight or -1 to 1 or 0
    to 100 ).
  • Convert discrete or character (categorical) data
    into numeric values.
  • Transform ordinal data to a ranked list
    (numeric).
  • Discretize continuous data into bins.

35
Key Concept for Data MiningSimilarity and
Distance Metrics
  • Similarity between complex data objects is one of
    the central notions in data mining.
  • The fundamental problem is to determine whether
    any selected pair of data objects exhibit similar
    characteristics.
  • The problem is both interesting and difficult
    because the similarity measures should allow for
    imprecise matches.
  • Similarity and its inverse Distance provide
    the basis for all of the fundamental data mining
    clustering techniques and for many data mining
    classification techniques.

36
Similarity and Distance Measures (metrics)
37
Similarity and Distance Measures
  • Most clustering algorithms depend on a distance
    or similarity measure, to determine (a) the
    closeness or alikeness of cluster members, and
    (b) the distance or unlikeness of members from
    different clusters.
  • General requirements for any similarity or
    distance metric
  • Non-negative dist(A,B) gt 0 and sim(A,B) gt 0
  • Symmetric dist(A,B)dist(B,A) and
    sim(A,B)sim(B,A)
  • In order to calculate the distance between
    different attribute values, those attributes must
    be transformed or normalized (either to the same
    units, or else normalized to a similar scale).
  • The normalization of both categorical
    (non-numeric) data and numerical data with units
    generally requires domain expertise. This is
    part of the pre-processing (data preparation)
    step in any data mining activity.

38
Popular Similarity and Distance Measures
  • General Lp distance x-yp sumx-yp1/p
  • Euclidean distance p2
  • DE sqrt(x1-y1)2 (x2-y2)2 (x3-y3)2
  • Manhattan distance p1 ( of city blocks
    walked)
  • DM x1-y1 x2-y2 x3-y3
  • Cosine distance angle between two feature
    vectors
  • d(X,Y) arccos X ? Y / X . Y
  • d(X,Y) arccos (x1y1x2y2x3y3) / X .
    Y
  • Similarity function s(x,y) 1 / 1d(x,y)
  • s varies from 1 to 0, as distance d varies from 0
    to .

8
39
Data Mining Clustering and Nearest Neighbor
Algorithms Issues
  • Clustering algorithms and nearest neighbor
    algorithms (for classification) require a
    distance or similarity metric.
  • You must be especially careful with categorical
    data, which can be a problem. For example
  • What is the distance between blue and green? Is
    it larger than the distance from green to red?
  • How do you metrify different attributes (color,
    shape, text labels)? This is essential in order
    to calculate distance in multi-dimensions. Is
    the distance from blue to green larger or smaller
    than the distance from round to square? Which of
    these are most similar?

40
Typical Error Matrix
Key Concept for Data Mining Classification
Accuracy
True Positive False Positive False Negative True
Negative
TRAINING DATA (actual classes)
Class-A Class-B Totals
3007
173 (FP)
2834 (TP)
Class-A Class-B Totals
NEURAL NETWORK CLASSIFICATION (output)
3421
318 (FN)
3103 (TN)
3276
3152
6428
41
Typical Measures of Accuracy
  • Overall Accuracy
    (TPTN)/(TPTNFPFN)
  • Producers Accuracy (Class A) TP/(TPFN)
  • Producers Accuracy (Class B) TN/(FPTN)
  • Users Accuracy (Class A)
    TP/(TPFP)
  • Users Accuracy (Class B) TN/(TNFN)

Accuracy of our Classification on preceding slide
  • Overall Accuracy 92.4
  • Producers Accuracy (Class A) 89.9
  • Producers Accuracy (Class B) 94.7
  • Users Accuracy (Class A) 94.2
  • Users Accuracy (Class B) 90.7

42
Key Concept for Data MiningOverfitting
d(x)
  • g(x) is a poor fit (fitting a straight line
    through the points)
  • h(x) is a good fit
  • d(x) is a very poor fit (fitting every point)
    Overfitting

43
How to Avoid Overfitting in Data Mining Models
  • In Data Mining, the problem arises because you
    are training the model on a set of training data
    (i.e., a subset of the total database).
  • That training data set is simply intended to be
    representative of the entire database, not a
    precise exact copy of the database.
  • So, if you try to fit every nuance in the
    training data, then you will probably
    over-constrain the problem and produce a bad fit.
  • This is where a TEST DATA SET comes in very
    handy. You can train the data mining model
    (Decision Tree or Neural Network) on the TRAINING
    DATA, and then measure its accuracy with the TEST
    DATA, prior to unleashing the model (e.g.,
    Classifier) on some real new data.
  • Different ways of subsetting the TRAINING and
    TEST data sets
  • 50-50 (50 of data used to TRAIN, 50 used to
    TEST)
  • 10 different sets of 90-10 (90 for TRAINING,
    10 for TESTING)

44
Schematic Approach to Avoiding Overfitting
To avoid overfitting, you need to know when to
stop training the model. Although the Training
Set error may continue to decrease, you may
simply be overfitting the Training Data. Test
this by applying the model to Test Data (not
part of Training Set). If the Test Set error
starts to increase, then you know that you
are overfitting the Training Set and it is time
to stop!
Test Set error
Error
Training Set error
Training Epoch
STOP Training HERE !
45
OUTLINE
  • The New Face of Science
  • Scientific Knowledge Discovery
  • Data Mining Examples and Techniques
  • Basic Concepts in Data Mining
  • Whats next?

46
Scientific Data Mining in Astronomy
2008 NVO Summer School
46
Write a Comment
User Comments (0)
About PowerShow.com