Title: Data Exploration with DAVIS
1Data Exploration with DAVIS
- Moon HUH1, KwangRyeol SONG2, YoungSuk PARK1,
- KyungWook Shim
- 1Sungkyunkwan University, Seoul, Korea
- 2 Kwansei Research Institute, Seoul, Korea
2Purpose of DAVIS
- to visually explore the structure or pattern of
data
3Components of DAVIS
- Data Manipulation
- Statistical Tools
- Plots
- Graphic Controllers
4Data Manipulation
- Observation/variable selection
- Focusing/deleting a subset of data set
- Missing value process
- Discretization
5Plots - Univariate
- Bar Charts
- Histogram
- QQ Plot
- FEDF
- BoxPlot
- Parallel Coordinates
6BoxPlot Features
- Standardization
- Indentification
7Parallel Coordinates Features
- Direction of Plotting Horizontal / Vertical
- Ordering of the Variables Component /
Permutation - Jittering
8Parallel Coordinates -options
9Plots-multivariate
- Scatterplot
- Loess curve fitting
- Touring
- Dendrogram
- Line Mosaic Plot
- PCA plot
10Scatterplot-options
11Touring GrandTour/Tracking
12Dendrogram Agglomeration /Distance options
13Line Mosaic Plot for discrete data
14PCA plot
15Real time grouping with DAVIS - hiliting
- Manually grouping the data set into 2 subsets
- by mouse brushing a subset of data
- Always can go back to the original data set
16Real time grouping with DAVISdeleting/focusing
17Interactive Clustering with DAVIS-linking
18Clustering with DAVIS EM with 3 groups
19Coloring a subset outlier detection
20Touring with DAVIS- Tracking
- Can investigate multidimensional structure of the
data
21Data exploration with Decision Trees-Titanic data
22Decision Trees-2
23Variable selection with DAVIS
- Target (Class) variable
- discrete (nominal) type
- Candidate variables
- nominal, numerical, and complex type
24Variable subset selection methods
- MDI ( Lee and Huh, 2003) . using p-values for the
test statistics between the 2 variables. - log (p-value) is suggested
- ReliefF (Kira and Randell, 1992)
- Relief (x)Pdifferent value of X different
class - - Â Â Â Â Â Â Â Â Pdifferent value of X same class
- Mutual Information (originated by Shanon, 1948
and used for the measure of dependence by Perez,
1957, Russian) -
- Darbellay (1999, CSDA) gives a good survey on
the measure of statistical dependence using MI
25Subset selection with DAVIS ranking variables
- MDI (meaured of departure from indep.)
- ReliefF
- MI (measure of Information)
26Subset selection with DAVIS-decision trees
27Subset Selection with DAVIS- stepwise
discriminant analysis
- Continuous variables only
- Good under normality
28Subset Selection with DAVIS- Mutual Information
- Conventional approach
- Discretization required
- Normal mixture approach
- Good for continuous variables
- Incremental Algorith
- Good for complex data
29Variable Selection with DAVIS-design layout
30Variable selection titanic data
- Variable ranking sex, class, age
- subset selection age, class
31Concluding remarks
- DAVIS is a Java-based system
- Any statistical model can be added to the system
as a visual component if it follows certain
rules. - Need more efficient design layout for various
strategies of variable selection. - Need to coin easier-to-understand terminologies
for various elements of the component.