Title: From D2K to SEASR
1- From D2K to SEASR
- Overview
- September 27, 2007
- Loretta Auvil
- Automated Learning Group, NCSA
- University of Illinois, Urbana-Champaign
2ALG Mission
- The specific mission of the Automated Learning
Group is - To collaborate with researchers to develop novel
computer methods and the scientific foundation
for using historical data to improve future
decision making - To work closely with industrial, government, and
academic partners to explore new application
areas for such methods, and - To transfer the resulting software technology
into real world applications
3Knowledge Discovery Process
4Required Effort for each KDD Step
- Arrows indicate the direction we want the effort
to go.
5Three Primary Paradigms
- Predictive Modeling supervised learning
approach where classification or prediction of
one of the attributes is desired. - Classification is the prediction of predefined
classes - e.g. Naive Bayesian, Decision Trees, and Neural
Networks - Regression is the prediction of continuous data
- e.g. Neural Networks, and Decision (Regression)
Trees - Discovery unsupervised learning approach for
exploratory data analysis. - e.g. Association Rules, Link Analysis,
Clustering, and Self Organizing Maps - Deviation Detection identifying outliers in the
data. - e.g. Visualization
6D2K- Framework for Data Analysis
- Provides scalable environment from the Desktop to
Web Services - Employs a visual programming system for data/work
flow paradigm - Provides capability to build custom applications
- Provides capability to access data management
tools - Contains data mining algorithms for prediction
and discovery - Provides data transformations for standard
operations - Integrated environment for models and
visualization - Supports an extensible interface for creating
ones own algorithms - Provides access to distributed computing
capabilities
7D2K Components
- D2K Infrastructure
- Itinerary Execution engine
- D2K-Driven Applications
- Applications that make use of the D2K
Infrastructure - Toolkit is a D2K-Driven app
- D2K Server
- Special kind of D2K-Driven app
- Wraps the infrastructure to provide remote
itinerary and module execution - Used by the Toolkit to distribute module
execution - D2K Web Service
- Provides a generic programmatic interface for
executing itineraries - Communicates with D2K Servers over socket
connections using D2K Specific protocols.
8D2K Streamline (D2K SL)
- Provides step by step interface to guide user in
data analysis - Supports return to earlier steps to run different
parameters - Uses the D2K infrastructure transparently
- Uses same D2K modules
- Provides way to capture different experiments
- Define templates that can be reused in different
experiments
9D2K Web Service Architecture
- Any web enabled client can connect to and use the
D2K Web Service by sending SOAP messages over
HTTP. - Itineraries and modules are stored on the web
service machine and loaded over the network by
the D2K Servers. - Job results are also stored in the web service
tier. - Results are returned to clients upon request.
- A relational database is used by the web service
to lookup accounts, itineraries, servers, and
jobs. - Remote D2K Servers handle itinerary processing.
If possible, modules should load any data from
remote locations.
10Creating Customer Value
Prediction Industrial Manufacturer Computed
customer buying propensities Achieved 25
conquest customer sales lift by executing
directed cross/upsell resulting in 65 million in
incremental revenue Discovery Automotive
manufacturer Identified patterns of inappropriate
warranty work in dealer channel Targeted 200M
of potentially unnecessary annual
expense Monitoring Department store
retailer Watched POS transaction flow for unusual
variations Deterred inappropriate behavior and
fraudulent transactions Resulted in savings of
over 125 million
11Applications Examples
Comparative Genomics
- Harris A. Lewin explains that Evolution Highway
allows one to look " . . . at the whole genome at
once - multiple chromosomes across multiple
species. The insights wouldn't have come so
quickly if we couldn't throw the data at this
framework from NCSA.
Science, Vol. 309, Issue 5734, Pages 613-617, 22
July 2005
Music Analysis
Astronomy
Nicholas M. Ball, Robert J. Brunner, Adam D.
Myers, and David Tcheng, Robust Machine Learning
Applied to Astronomical Data Sets. I. Star-Galaxy
Classification of the Sloan Digital Sky Survey
DR3 Using Decision Trees, The Astrophysical
Journal, Vol. 650, Part 1, Pages 497509, 2006
J. Stephen Downie, The Scientific Evaluation of
Music Information Retrieval Systems Foundations
and Future, Computer Music Journal, Vol. 28, No.
2, Pages 12-23 Summer 2004
12D2K- Lineage
NCSA
RiverGlass
One Llama
? D2K / Data to Knowledge
? D2K Streamline
DataMining
? T2K / ThemeWeaver
? Full Multi-language
TextMining
? I2K / Image to Knowledge
ImageMining
? M2K / Music to
Knowledge
Audio Mining
? MAIDS / Mining Alarming Incidents
from Data Streams
StreamMining
? RiverGlass Recon
WebAcquire
Future Research, Technology, Applications
? RiverGlass Detect
InferenceEng.
? RiverGlass Detect
Fed.Query
MotionMining
? MotionMining
Music Analysis
? One Llama Media
GeoSpatial
? GeoSpatial
Sensors/RFID
? Sensors/RFID
Multimedia
? Multimedia
Interface
Visualization
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
RiverGlass, Inc.
13D2K ToolKit
- Workspace
- Resource Panel
- Modules
- Models
- Itineraries
- Visualizations
- Generated Visualizations
- Generated Models
- Component Information
- Toolbar
- Console
14D2K Basic
- Set of D2K Modules to perform data mining
techniques - Prediction
- Decision Trees
- C4.5 Decision Tree, Continuous Decision Tree, SQL
Rain Forest Decision Tree - Naïve Bayesian Classification and SQL Naïve
Bayesian Classification - Neural Networks
- Discovery
- Rule Association
- Apriori, FP Growth, Htree
- Clustering
- Hierarchical Agglomerative, Kmeans, Coverage,
etc. - Includes visualizations for many of the modeling
approaches - Includes a set of data transformations
- Attribute selection, binning, filtering,
attribute construction - Includes optimization strategy for searching
parameter space
15D2K Modules
- Input Module Loads data from the outside world.
- Flat files, database, etc.
- Data Prep Module Performs functions to select,
clean, or transform the data - Binning, Normalizing, Feature Selection, etc.
- Compute Module Performs main algorithmic
computations. - Naïve Bayesian, Decision Tree, Apriori, FP
Growth, etc. - User Input Module Requires interaction with the
user. - Data Selection, Input and Output selection, etc.
- Output Module Saves data to the outside world.
- Flat files, databases, etc.
- Visualization Module Provides visual feedback to
the user. - Naïve Bayesian, Rule Association, Decision Tree,
Parallel Coordinates, 2D Scatterplot, 3D Surface
Plot
16D2K Module Icon Description
- Module Progress Bar
- Appears during execution to show the percentage
of time that this module executed over the entire
execution time. It is green when the module is
executing and red when not. - Input Port
- Rectangular shapes on the left side of the
module represent the inputs for the module. They
are colored according to the data type that they
represent - Properties Symbol
- If a P is shown in the lower left corner of
the module, then the module has properties that
can be set before execution.
Output Port Rectangular shapes on the right side
of the module represent the outputs for the
module. They are colored according to the data
type that they represent.
17D2K Demo
18SEASR Research, Development, Technology
Transfer Model
19SEASR The Data ProblemStructured Vs.
Unstructured
20 Structured Data
Today, 80 of business is conducted on
unstructured information Gartner Group
80 of the information needed is in the Open
Source NIA
Workers spend 80 of the time gathering
information STIC, EMF
80 Unstructured Data
www.fastsearch.com
20SEASR
- Software Environment for the Advancement of
Scholarly Research (SEASR) - addresses the challenges of transforming
information into knowledge by constructing the
software bridges that are required to move from
the unstructured and semi-structured data world
to the structured data world. - aims to make collections more useful by
integrating two well-known research and
development frameworks NCSAs Data-To-Knowledge
(D2K) and IBMs Unstructured Information
Management Architecture (UIMA) into an easily
usable environment that researchers in any
discipline can easily learn and adapt for their
own unstructured data analysis.
21SEASR Architecture
- SEASRs advanced informatics tools will
expand the technical capabilities of what is now
available in the field by - connecting data sources that are currently
incompatible, whether due to different formats or
protocols - offering all project components as open source,
to enable users to modify and add to tools - allowing users to write analytic engines in their
programming language of choice - installing on all hardware footprints, so that
the tools can be brought to data sets where they
are housed - creating a repository for components that will
support sharing and publishing among users - enabling scalability so that components may run
on a large variety of hardware footprints,
including shared memory processors and clusters
22SEASR Applications
NoraVis OpenLaszlo
DISCUS
SEASR
FeatureLens
M2K
23NoraVis OpenLaszlo
24FeatureLens n-gram patterns
Create by Anthony Don at http//www.cs.umd.edu/hci
l/textvis/featurelens/.
25Getting the Band Together
- June 2007 Band formation
- Project start date
- More use ideas and framework discussions
- December First gig
- Framework and data app demonstration
- Vocals - Research Technology
- John Unsworth, Stephen Downie, Tim Wentling
- Dan Roth, Jiawei Han, Kevin Chang, Cheng Xiang
Zhai - Percussions Bass - SEASR Development
- Loretta Auvil, Tara Bazler, Duane Searsmith,
Andrew Shirk, Students - Lead Designers/Developer/Applications Areas
- Humanities M2K, Nora/Monk and Others (we heard
about yesterday/today)) - Need Groupies! (Advisors, Researchers,
Developers, and Application Drivers) Loretta
Auvil
26SEASR How can I participate?
- Collaborate on application development or
ontology creation - Contribute to component development for analytics
or data access - Participate in visualization and UI design
- Serve as an advisor
- Contact Loretta Auvil (lauvil_at_uiuc.edu)
27SEASREngineering Knowledge for the Humanities