From D2K to SEASR - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

From D2K to SEASR

Description:

From D2K to SEASR – PowerPoint PPT presentation

Number of Views:15
Avg rating:3.0/5.0
Slides: 28
Provided by: gocPrag
Category:
Tags: d2k | seasr | pjk

less

Transcript and Presenter's Notes

Title: From D2K to SEASR


1
  • From D2K to SEASR
  • Overview
  • September 27, 2007
  • Loretta Auvil
  • Automated Learning Group, NCSA
  • University of Illinois, Urbana-Champaign

2
ALG Mission
  • The specific mission of the Automated Learning
    Group is 
  • To collaborate with researchers to develop novel
    computer methods and the scientific foundation
    for using historical data to improve future
    decision making
  • To work closely with industrial, government, and
    academic partners to explore new application
    areas for such methods, and
  • To transfer the resulting software technology
    into real world applications

3
Knowledge Discovery Process
4
Required Effort for each KDD Step
  • Arrows indicate the direction we want the effort
    to go.

5
Three Primary Paradigms
  • Predictive Modeling supervised learning
    approach where classification or prediction of
    one of the attributes is desired.
  • Classification is the prediction of predefined
    classes
  • e.g. Naive Bayesian, Decision Trees, and Neural
    Networks
  • Regression is the prediction of continuous data
  • e.g. Neural Networks, and Decision (Regression)
    Trees
  • Discovery unsupervised learning approach for
    exploratory data analysis.
  • e.g. Association Rules, Link Analysis,
    Clustering, and Self Organizing Maps
  • Deviation Detection identifying outliers in the
    data.
  • e.g. Visualization

6
D2K- Framework for Data Analysis
  • Provides scalable environment from the Desktop to
    Web Services
  • Employs a visual programming system for data/work
    flow paradigm
  • Provides capability to build custom applications
  • Provides capability to access data management
    tools
  • Contains data mining algorithms for prediction
    and discovery
  • Provides data transformations for standard
    operations
  • Integrated environment for models and
    visualization
  • Supports an extensible interface for creating
    ones own algorithms
  • Provides access to distributed computing
    capabilities

7
D2K Components
  • D2K Infrastructure
  • Itinerary Execution engine
  • D2K-Driven Applications
  • Applications that make use of the D2K
    Infrastructure
  • Toolkit is a D2K-Driven app
  • D2K Server
  • Special kind of D2K-Driven app
  • Wraps the infrastructure to provide remote
    itinerary and module execution
  • Used by the Toolkit to distribute module
    execution
  • D2K Web Service
  • Provides a generic programmatic interface for
    executing itineraries
  • Communicates with D2K Servers over socket
    connections using D2K Specific protocols.

8
D2K Streamline (D2K SL)
  • Provides step by step interface to guide user in
    data analysis
  • Supports return to earlier steps to run different
    parameters
  • Uses the D2K infrastructure transparently
  • Uses same D2K modules
  • Provides way to capture different experiments
  • Define templates that can be reused in different
    experiments

9
D2K Web Service Architecture
  • Any web enabled client can connect to and use the
    D2K Web Service by sending SOAP messages over
    HTTP.
  • Itineraries and modules are stored on the web
    service machine and loaded over the network by
    the D2K Servers.
  • Job results are also stored in the web service
    tier.
  • Results are returned to clients upon request.
  • A relational database is used by the web service
    to lookup accounts, itineraries, servers, and
    jobs.
  • Remote D2K Servers handle itinerary processing.
    If possible, modules should load any data from
    remote locations.

10
Creating Customer Value
Prediction Industrial Manufacturer Computed
customer buying propensities Achieved 25
conquest customer sales lift by executing
directed cross/upsell resulting in 65 million in
incremental revenue Discovery Automotive
manufacturer Identified patterns of inappropriate
warranty work in dealer channel Targeted 200M
of potentially unnecessary annual
expense Monitoring Department store
retailer Watched POS transaction flow for unusual
variations Deterred inappropriate behavior and
fraudulent transactions Resulted in savings of
over 125 million
11
Applications Examples
Comparative Genomics
  • Harris A. Lewin explains that Evolution Highway
    allows one to look " . . . at the whole genome at
    once - multiple chromosomes across multiple
    species. The insights wouldn't have come so
    quickly if we couldn't throw the data at this
    framework from NCSA.

Science, Vol. 309, Issue 5734, Pages 613-617, 22
July 2005
Music Analysis
Astronomy
Nicholas M. Ball, Robert J. Brunner, Adam D.
Myers, and David Tcheng, Robust Machine Learning
Applied to Astronomical Data Sets. I. Star-Galaxy
Classification of the Sloan Digital Sky Survey
DR3 Using Decision Trees, The Astrophysical
Journal, Vol. 650, Part 1, Pages 497509, 2006
J. Stephen Downie, The Scientific Evaluation of
Music Information Retrieval Systems Foundations
and Future, Computer Music Journal, Vol. 28, No.
2, Pages 12-23 Summer 2004
12
D2K- Lineage
NCSA
RiverGlass
One Llama
? D2K / Data to Knowledge
? D2K Streamline
DataMining
? T2K / ThemeWeaver
? Full Multi-language
TextMining
? I2K / Image to Knowledge
ImageMining
? M2K / Music to
Knowledge
Audio Mining
? MAIDS / Mining Alarming Incidents
from Data Streams
StreamMining
? RiverGlass Recon
WebAcquire
Future Research, Technology, Applications
? RiverGlass Detect
InferenceEng.
? RiverGlass Detect
Fed.Query
MotionMining
? MotionMining
Music Analysis
? One Llama Media
GeoSpatial
? GeoSpatial
Sensors/RFID
? Sensors/RFID
Multimedia
? Multimedia

Interface

Visualization
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
RiverGlass, Inc.
13
D2K ToolKit
  1. Workspace
  2. Resource Panel
  3. Modules
  4. Models
  5. Itineraries
  6. Visualizations
  7. Generated Visualizations
  8. Generated Models
  9. Component Information
  10. Toolbar
  11. Console

14
D2K Basic
  • Set of D2K Modules to perform data mining
    techniques
  • Prediction
  • Decision Trees
  • C4.5 Decision Tree, Continuous Decision Tree, SQL
    Rain Forest Decision Tree
  • Naïve Bayesian Classification and SQL Naïve
    Bayesian Classification
  • Neural Networks
  • Discovery
  • Rule Association
  • Apriori, FP Growth, Htree
  • Clustering
  • Hierarchical Agglomerative, Kmeans, Coverage,
    etc.
  • Includes visualizations for many of the modeling
    approaches
  • Includes a set of data transformations
  • Attribute selection, binning, filtering,
    attribute construction
  • Includes optimization strategy for searching
    parameter space

15
D2K Modules
  • Input Module Loads data from the outside world.
  • Flat files, database, etc.
  • Data Prep Module Performs functions to select,
    clean, or transform the data
  • Binning, Normalizing, Feature Selection, etc.
  • Compute Module Performs main algorithmic
    computations.
  • Naïve Bayesian, Decision Tree, Apriori, FP
    Growth, etc.
  • User Input Module Requires interaction with the
    user.
  • Data Selection, Input and Output selection, etc.
  • Output Module Saves data to the outside world.
  • Flat files, databases, etc.
  • Visualization Module Provides visual feedback to
    the user.
  • Naïve Bayesian, Rule Association, Decision Tree,
    Parallel Coordinates, 2D Scatterplot, 3D Surface
    Plot

16
D2K Module Icon Description
  • Module Progress Bar
  • Appears during execution to show the percentage
    of time that this module executed over the entire
    execution time. It is green when the module is
    executing and red when not.
  • Input Port
  • Rectangular shapes on the left side of the
    module represent the inputs for the module. They
    are colored according to the data type that they
    represent
  • Properties Symbol
  • If a P is shown in the lower left corner of
    the module, then the module has properties that
    can be set before execution.

Output Port Rectangular shapes on the right side
of the module represent the outputs for the
module. They are colored according to the data
type that they represent.
17
D2K Demo
18
SEASR Research, Development, Technology
Transfer Model
19
SEASR The Data ProblemStructured Vs.
Unstructured
20 Structured Data
Today, 80 of business is conducted on
unstructured information Gartner Group
80 of the information needed is in the Open
Source NIA
Workers spend 80 of the time gathering
information STIC, EMF
80 Unstructured Data
www.fastsearch.com
20
SEASR
  • Software Environment for the Advancement of
    Scholarly Research (SEASR)
  • addresses the challenges of transforming
    information into knowledge by constructing the
    software bridges that are required to move from
    the unstructured and semi-structured data world
    to the structured data world.
  • aims to make collections more useful by
    integrating two well-known research and
    development frameworks NCSAs Data-To-Knowledge
    (D2K) and IBMs Unstructured Information
    Management Architecture (UIMA) into an easily
    usable environment that researchers in any
    discipline can easily learn and adapt for their
    own unstructured data analysis.

21
SEASR Architecture
  • SEASRs advanced informatics tools will
    expand the technical capabilities of what is now
    available in the field by
  • connecting data sources that are currently
    incompatible, whether due to different formats or
    protocols
  • offering all project components as open source,
    to enable users to modify and add to tools
  • allowing users to write analytic engines in their
    programming language of choice
  • installing on all hardware footprints, so that
    the tools can be brought to data sets where they
    are housed
  • creating a repository for components that will
    support sharing and publishing among users
  • enabling scalability so that components may run
    on a large variety of hardware footprints,
    including shared memory processors and clusters

22
SEASR Applications
NoraVis OpenLaszlo
DISCUS
SEASR
FeatureLens
M2K
23
NoraVis OpenLaszlo
24
FeatureLens n-gram patterns
Create by Anthony Don at http//www.cs.umd.edu/hci
l/textvis/featurelens/.
25
Getting the Band Together
  • June 2007 Band formation
  • Project start date
  • More use ideas and framework discussions
  • December First gig
  • Framework and data app demonstration
  • Vocals - Research Technology
  • John Unsworth, Stephen Downie, Tim Wentling
  • Dan Roth, Jiawei Han, Kevin Chang, Cheng Xiang
    Zhai
  • Percussions Bass - SEASR Development
  • Loretta Auvil, Tara Bazler, Duane Searsmith,
    Andrew Shirk, Students
  • Lead Designers/Developer/Applications Areas
  • Humanities M2K, Nora/Monk and Others (we heard
    about yesterday/today))
  • Need Groupies! (Advisors, Researchers,
    Developers, and Application Drivers) Loretta
    Auvil

26
SEASR How can I participate?
  • Collaborate on application development or
    ontology creation
  • Contribute to component development for analytics
    or data access
  • Participate in visualization and UI design
  • Serve as an advisor
  • Contact Loretta Auvil (lauvil_at_uiuc.edu)

27
SEASREngineering Knowledge for the Humanities
  • Thank You
Write a Comment
User Comments (0)
About PowerShow.com