Title: Computational and Statistical Learning Group CASTLE
1Computational and Statistical Learning Group
(CASTLE)
- Professor Edgar Acuna
- Mathematics Department
- University of Puerto Rico at Mayaguez
- Partially supported by ONR Grant N0014-03-0359
2Our research is focused in the development of
computational and statistical methods for
knowledge discovery in databases related to
Pattern Recognition, Machine Learning,
Bioinformatics, and Data Mining
3CASTLEs Members1
4CASTLEs Members2
5CASTLEs Alummi
- Alex Rojas (MS 2001, now at CMU)
- Adriana Lopez (MS 2002-now at U. Pittsburgh)
- Jose Vega (PhD 2004, now at the UPR-Medical
School) - Santiago Velasco (MS 2004, now at UPRM
6CASTLEs collaborators
- Ana Patricia Ortiz, Puerto Rico Cancer Center.
- Idhaliz Flores, Ponce Medical School, Puerto
Rico. - Jose Vega ( University of Puerto Rico-School of
Medicine)
7Research topics
- Data preprocessing
- Treatment of missing values in Data Mining
- Normalization in Data Mining
- Feature selection procedures for high dimensional
data Wrappers, Filters and hybrid methods with
application to Bionformatics. Feature selection
based on rough sets. - Feature extraction procedures for high
dimensional data, such as Partial least Squares
and supervised Principal components, with
application to Bioniformatics and Chemometrics. - Instance selection procedures, such as
progressive sampling, to make possible the
application of statistical methods and machine
learning techniques.
8Research topics 2
- Regularization Methods. Use of shrunken
estimators for logistic regression trying to
work with classifiers that use all the features
without performing feature selection and feature
extraction. Application in Bioinformatics and
Chemometrics. - Outlier detection procedures, such as
distance-based outliers and density local
outliers. We will apply these procedures to
network intrusion detection. - Visualization procedures for Data Mining. We are
enhancing graphics such as parallel coordinate
plots, surveyplot, and star coordinates that will
allow us to perform exploratory data analysis
previously to the application of a knowledge
discovery technique.
9Research Topics3
- Parallel Data Mining. The use of parallel
computation will allow us to deal with very large
datasets. In particular we are developing
parallel algorithms to perform data preprocessing
tasks in very large datasets. Visualization and
computation of meta-classifiers using parallelism
is also being considered. - Unsupervised learning to find features that
behave similarly in various conditions and to
find subgroups of instances that similar to each
other. In particular we are interested in
validation of clustering algorithms. - We are investigating extensions of data mining
tasks to the multi-relational case (multiple
tables). - Interface between the R statistical computing
language and SQL in order to manipulate large
datasets. - Building of a visual event programming on R to
perform mainly data preprocessing tasks.
10Lab Equipment
- 155 square feet room.
- A cluster of five dual Pentium Xeon processors
Dell workstations. Each of them running at 3.06
GHz and with 3MB of RAM memory. - A Color laserjet 4050 HP printer
11Recent Publications1
- 1 Lozano, E. and Acuna, E. (2005). A 3D
extension of the Star coordinates display. - 2 Lozano, E. and Acuna, E. (2005) Parallel
algoritms for distance-based and density
local-based outliers. To appear in the
Proceedings of the ICDM05 organized by the IEEE.
- 3 Vega, J. and Acuna, E. (2005).
Generalizations of PLS for dimensionality
reduction in supervised classification.
Proceedings of the Fourth International
Conference in Statistics and Related fields.
Hawaii. - 4 Acuna, E. and Rodriguez, C. (2004). The
effect of outliers on the misclassification error
rate. Submitted to the IEEE Transactions on
Knowledge and Data Engineering. - 5 Acuna, E. and Rodriguez, C. (2004). The
treatment of missing values and its effect in the
classifier accuracy. In D. Banks, L. House, F.R.
McMorris, P. Arabie, W. Gaul (Eds).Classification,
Clustering and Data Mining Applications.
Springer-Verlag Berlin-Heidelberg, 639-648.
12Recent Publications2
- 6 Acuña E. and Coaquira, F. (2003). A
comparison of feature selection procedures for
classifiers based on kernel density estimation.
Proceedings of the International conference on
computer, communication and control technologies,
CCCT03. Vol I. p. 462-467. Orlando, Florida. - 7 Daza, L. and Acuña, E.. (2003) Combining
classifiers based on Gaussian Mixtures.
Proceedings of the International conference on
computer, communication and control technologies,
CCCT03. Vol I. p. 473-478. Orlando, Florida. - 8 Lozano, E. and Acuña, E. (2002) Parallel
computation of kernel density estimates
classifiers and their ensembles. Proceedings of
the International conference on computer,
communication and control technologies, CCCT03.
Vol I. p. 473-478. Orlando, Florida.
13Recent Publications3
- 9 Acuña, E , (2003) A comparison of filters and
wrappers for feature selection in supervised
classification. Proceedings of the Interface
2003 Computing Science and Statistics. Vol 34. - 10 Acuña, E. Rojas, A., and Coaquira, F.
(2002). The effect of feature selection on
combining classifiers based on kernel density
estimates. In K. Jajuga, A. Sokodowski, H.-H Bock
(Eds). Classification, Clustering and Data
Analysis. Springer, Heidelberg, 161-168. - 11 Acuña, E., (2002) Combining Classifiers
based on Kernel density classifiers and Gaussian
mixtures. Proceedings of the Interface 2002
Computing Science and Statistics. Vol 33.
14Software
- The Dprep package A library of 68 R functions to
perform mainly data preprocessing tasks
including range normalization, discretization,
handling of missing values, outlier detection,
feature selection, and visualization.
15Castle Members
Elio Lozano Parallel Data mining
Caroline Rodriguez (PR) Data preprocessing-Visual
ization
16Castle Members
Frida Coaquira Rough sets for KDD
Luis Daza Instance Selection
17Castle Members
Trilce Encarnation Databases
Marggie Gonzalez Cluster Validation
18Castle Members
Jaime Porras Supervised Principal component for
classification
Karen Prieto Shrunken estimators for logistic
regression
19Castle Members
Carlos Lopez Bayesian networks classifiers
Carmen Saldana Statistics
20Castle Members
Roxana Aparicio Databases
Sindy Diaz Statistics