Zhang Yanxia - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

Zhang Yanxia

Description:

Compression (e.g. Galaxy images and spectra) Classification (e.g. Stars, galaxies, or Gamma Ray Bursts) ... Based on meta learning: adaboost, boosting, bagging ... – PowerPoint PPT presentation

Number of Views:51
Avg rating:3.0/5.0
Slides: 32
Provided by: krist208
Category:
Tags: bagging | yanxia | zhang

less

Transcript and Presenter's Notes

Title: Zhang Yanxia


1
Chinese Virtual Observatory
Data Mining in Astronomy
Zhang Yanxia China-VO Group 2006.11.30 in Guilin
2
Outline
  • Why
  • What
  • How
  • Example
  • challenge
  • summary

3
Astronomy facing data avalanche
Necessity Is the Mother of Invention
DMKDD
4
Issues in Astronomy
Ofer Lahav, 2006, astro-ph/0610703 Summary on the
4th meeting on Statistical Challenge in Modern
Astronomy held at Penn State University in June
2006
  • Compression (e.g. Galaxy images and spectra)
  • Classification (e.g. Stars, galaxies, or Gamma
    Ray Bursts)
  • Reconstruction (e.g. of blurred galaxy images,
    mass distribution from weak gravitational
    lensing)
  • Feature extraction (e.g. signatures feature of
    stars, galaxies and quasars)
  • Parameter estimation (e.g. Star parameter
    measurement, Photometric redshift prediction,
    orbital parameters of extra-solar planets, or
    cosmological parameters )
  • Model selection (e.g. are there 0,1,2,planets
    around stars, or is there a cosmological model
    with none-zero neutrino mass more favorable)

5
Science Requirements for DM (Borne K D, 2001,
Proc. Of the MPA/ESO/MPE Workshop,671)
  • Cross-Identification - refers to the classical
    problem of associating the source list in one
    database to the source list in another.
  • Cross-Correlation - refers to the search for
    correlations, tendencies, and trends between
    physical parameters in multi-dimensional data,
    usually across databases.
  • Nearest-Neighbor Identification - refers to the
    general application of clustering algorithms in
    multi-dimensional parameter space, usually within
    a database.
  • Systematic Data Exploration - refers to the
    application of the broad range of event-based and
    relationship-based queries to a database in the
    hope of making a serendipitous discovery of new
    objects or a new class of objects.

6
KDD Opportunity and Challenges
Competitive Pressure
Data Rich Knowledge Poor (the resource)
KDD
Data Mining Technology Mature
Enabling Technology (Interactive MIS, OLAP,
parallel computing, Web, etc.)
7

KDD A Definition
KDD is the automatic extraction of non-obvious,
hidden knowledge from large volumes of data.
KDD is the automatic extraction of non-obvious,
hidden knowledge from large volumes of data.
106-1012 bytes never see the whole data set or
put it in the memory of computers
What knowledge? How to represent and use it?
Data mining algorithms?
8

Benefits of Knowledge Discovery
Value
Disseminate
DSS
Generate
MIS
EDP
Rapid Response
Volume
EDP Electronic Data Processing MIS Management
Information Systems DSS Decision Support Systems
9
DM A KDD Process
Knowledge
  • Data mining the core of knowledge discovery
    process.

Pattern Evaluation
Data Mining
Task-relevant Data
Selection
Data Warehouse
Data Cleaning
Data Integration
Databases
10
Work at each process of DM
60 50 40 30 20 10 0
  • DM object Data preparation
    Data processing Analysis and Evalution

11
Primary Tasks of Data Mining
finding the description of several predefined
classes and classify a data item into one of
them.
identifying a finite set of categories or
clusters to describe the data.
Clustering
Classification
finding a model which describes significant
dependencies between variables.
maps a data item to a real-valued prediction
variable.
Dependency Modeling
Regression
discovering the most significant changes in the
data
finding a compact description for a subset of
data
Deviation and change detection
Summarization
12
Feature selection
  • Filter method
  • Wrapper method
  • Embedded method
  • Feature weighted method

13
Feature extraction
  • PCA
  • Factor analysis (Principal FA/Maximum Likelihood
    FA)
  • Projection pursuit
  • ICA
  • Non-linear PCA/ICA
  • Random projection
  • Principal curves
  • MDS
  • LLE
  • ISOMAP
  • Topological continuous map
  • Neural network
  • Vector quantization
  • Kernel PCA/ICA
  • LDA (linear discriminant analysis )
  • QDA (quadratic discriminant analysis)
  • FDA (Fisher discriminant analysis)
  • GDA (Generalized discriminant analysis)
  • KDDA (kernel direct discriminant analysis)

14
Classification Methods
  • Based on statistical theory SVMs, ML,
    LDA,FDA,QDA,KNN
  • Based on NN LVQ, RBF, PNN, KSOM,BBN,SLP,MLP
  • Based on Decision Tree REPTree, RandomTree,
    CART,C5.0,
  • J48, DecisionStump, RandomForest,
    NBtree,AC2,Cal5,
  • ADTree,KDTree
  • Based on Decision Rule Decision
    Table,CN2,ITrule, AQ
  • Based on bayesian theory Naive Bayes classifier,
    NBTree
  • Based on meta learning adaboost, boosting,
    bagging
  • Based on evolution theory genetic algorithm
  • Based on fuzzy theory fuzzy set, rough set
  • Ensembles of classifiers

Data Mining algorithm
patterns
15
Regression Methods
  • (penalized) logistic regression
  • Bayesian regression analysis
  • Additive regression
  • Locally weighted regression
  • Voted perceptron network
  • Projection pursuit regression
  • Recursive partitioning regression
  • Alternating condition expectation
  • Stepwise regression
  • Recursive least square
  • Fourier transform regression
  • Ruled-based regression
  • Principal component regression
  • Instance-based regression
  • Multivariate adaptive regression splines
  • Regression trees (CART, RETIS, M5,random forest,
    KDtree)
  • Simple windowed regression
  • SVM
  • NN

16
Method to estimate errors
  • Train-test
  • Cross-validation
  • Bootstrap
  • Leave-one-out

17
Evaluation of methods
  • Accuracy
  • Speed
  • Comprehensibility
  • Time to learn
  • Generalization

18
Model Selection for Classifiction
  • Accuracy
  • G-mean
  • F-measure
  • ROC (Receive Operating Characteristic Curve)

19
Model Selection for Regression
  • AIC(Akaike information criterion)
  • BIC (Bayesian information criterion)
  • SRM (Structure Risk Minimization)

20
Example 1
  • Lim Jien-sien et al. Machine Learning, 40,
    203-229(2000)

33 algorithms on 16 different samples
22 decision trees CART, S-Plus tree,
C4.5,FACT,QUEST,IND,OC1,LMDT,CAL5,T1 9
statistical methods LDA,QDA,NN,LOG,FDA,PDA,MDA,POL
2 neural networks LVQ,RBF
21
Example 1
Lim Jien-sien et al. Machine Learning, 40,
203-229(2000)
22
Example 2
23
Example 3
Zhao,Y, Zhang,Y., 2006, submitted to cospar
24
Zhang,Y,Zhao,Y, 2006, submitted to CHJAA
Example 3
For NB, ADTree MLP, the corresponding whole
accuracy amounts to 97.5, 98.5 and 98.1,
respectively.
25
Zhang,Y, Luo, A, Zhao,Y, 2006, submitted to Cospar
Example 4
By best-forward search, j-h, b-v,j 2.5lgFpeak
are optimal features selected from the 10
features. Decision Table is applied.
10-fold cross-validation for training and test.
98.03
26
Li,Y.,Zhang,Y.,Zhao,Y.,2006,submitted to Chinese
Science
Example 5
k-Nearest neighbor classifier
27
Zhang,Y., Zhao, Y., 2006,ADASS XV,351,173
Example 6
28
Challenges and Influential Aspects
Handling of different types of data with
different degree of supervision
Massive data sets, high dimensionality (efficiency
, scalability)
Interactive, Visualization Knowledge Discovery
Different sources of data (distributed,
heterogeneous databases, noise and missing,
irrelevant data, etc.)
Understandability of patterns, various kinds of
requests and results (decision lists, inference
networks, concept hierarchies, etc.)
Changing data and knowledge
29
Summary
  • Linear or non-linear
  • Gassian or non-gassian
  • Continous or discrete
  • Missing or not
  • Comparision of the number of attributes with that
    of records
  • Choose the appropriate method or ensemble
    algorithms according to the task and data
    characteristics

30
Prospect
With the wing of DM, find better or best
knowledge!
With the wing of DM,
find more, better or best knowledge!
Thank you for your attention!
31
  • Thank you !!!
Write a Comment
User Comments (0)
About PowerShow.com