Automated Learning Group presentation

About This Presentation

Transcript and Presenter's Notes

Title: Automated Learning Group

1
Automated Learning Group

November 29, 2004

2
ALG Mission

The specific mission of the Automated Learning
Group is
To collaborate with researchers to develop novel
computer methods and the scientific foundation
for using historical data to improve future
decision making
To work closely with industrial, government, and
academic partners to explore new application
areas for such methods, and
To transfer the resulting software technology
into real world applications

3
ALG Research, Development, Technology Transfer
Model
4
What is It?
Overview of Knowledge Discovery

Knowledge Discovery in Databases is the
non-trivial process of identifying valid, novel,
potentially useful, and ultimately understandable
patterns in data
The understandable patterns are used to
Make predictions about or classifications of new
data
Explain existing data
Summarize the contents of a large database to
support decision making
Create graphical data visualization to aid humans
in discovering complex patterns

5
Why Do We Need Data Mining ?
Overview of Knowledge Discovery

Data volumes are too large for classical analysis
approaches
Large number of records (108 1012 bytes)
High dimensional data ( 102 104 attributes)
How do you explore millions of records, tens or
hundreds or thousands of fields, and find
patterns?
As databases grow, the ability to use traditional
query languages for the decision support process
becomes infeasible
Many queries of interest are difficult to state
in a query language (query formulation problem)
Find all cases of fraud
Find all individuals likely to by Ford Explorer
Find all documents that are similar to this
customers problem

6
Computational Knowledge Discovery
7
Knowledge Discovery Process
Overview of Knowledge Discovery
8
Required Effort for each KDD Step
Overview of Knowledge Discovery

Arrows indicate the direction we want the effort
to go

9
Three Primary Paradigms
Overview of Knowledge Discovery

Predictive Modeling supervised learning
approach where classification or prediction of
one of the attributes is desired
Classification is the prediction of predefined
classes
Naive Bayesian, Decision Trees, and Neural
Networks
Regression is the prediction of continuous data
Neural Networks, and Decision (Regression) Trees
Discovery unsupervised learning approach for
exploratory data analysis
Association Rules and Link Analysis
Clustering and Self Organizing Maps
Deviation Detection identifying outliers in the
data
Information Visualization

10
Advantages of a Framework for Analytics

Provides scalable environment from the Desktop to
Web Services to Grid Services
Employs a visual programming system for data/work
flow paradigm
Provides capability to build custom applications
Provides capability to access data management
tools
Contains data mining algorithms for prediction
and discovery
Provides data transformations for standard
operations
Integrated environment for models and
visualization
Supports an extensible interface for creating
ones own algorithms
Provides access to distributed computing
capabilities

11
D2K - Data To Knowledge
D2K Overview

D2K is a flexible data mining system that
integrates effective analytical data mining
methods for prediction, discovery, and anomaly
detection with data management and information
visualization

12
D2K and Its Many Components
D2K Overview

D2K Infrastructure
D2K API, data flow environment, distributed
computing framework and runtime system
D2K Modules
Computational units written in Java that follow
the D2K API
D2K Itineraries
Modules that are connected to form an application
D2K Toolkit
User interface for specification of itineraries
and execution that provides the rapid application
development environment
D2K-Driven Applications
Applications that use D2K modules with a custom
user interface
D2K Streamline (SL)
Task driven system that uses D2K modules
D2K Web/Grid Services
Enables web deployment

13
D2K Streamline (D2K SL)
D2K SL

Provides step by step interface to guide user in
data analysis
Supports return to earlier steps to run with
different parameters
Uses the D2K infrastructure transparently
Uses same D2K modules
Provides way to capture different experiments

14
Success Story Predictive Analytics

The Problem
Predict number of products a customer will
purchase to enable the increase of conquest,
cross, and upsell sales
The Solution
Built data models to predict what customers were
ready to buy and how many
Computed customer buying propensities
The Results
Achieved increase of conquest customer sales lift
by accurately predicting optimal groups for
directed cross/upsell activity
Increase of more than 50 percent in the number of
sales calls on potential customers.
The average number of engines sold to truck fleet
customers rose 67 percent.
In 1998 the number of promising sales targets
identified jumped to 35 percent, and the number
of engines sold grew to 6.75 per truck fleet
customer
Why it worked
It worked because we added analytics to a process
that had been based on the estimation of experts.
The shift from a process that was built on
professional experience to one that was data
driven reduced opportunities for
misinterpretation of market dynamics.

15
Success Story Predictive Analytics

The Problem
Predict the length of time a customer will keep a
product (subscribe for a service)
The Solution
Built data models to predict customers behavior
Why it worked
It worked because we added analytics to a process
that had been based on the estimation of experts.
The shift from a process that was built on
professional experience to one that was data
driven reduced opportunities for
misinterpretation of market dynamics.

16
Earth, Space, and Environmental Sciences

Grids are being built to work with distributed
earth, space and environmental science data
stores. A next step is to undertake distributed
data analysis utilizing remote data.

EMO Analysis Environment

EMO Evolutionary-based Multiobjective
Optimization for Hazard Management
Barbara Minsker, Civil and Environmental
Engineering
MAEViz Multi-modal Data Integration and
Information Visualization
Dan Abrams, Civil Engineering
MUSTSIM Real-time Data Stream Fusion and
Information Visualization
Amr Elnashi, Dan Kuchman, and Bill Spencer, Civil
Engineering

17
Bioinformatics

Now that the human genome has been sequenced,
attention is turning to the mining of proteomic
and structural biological data and looking for
patterns that arise when examining data from a
wide variety of different omic data sets.

Phylomat

Phylomat
Rex Gaskins, Cell and Structural Biology
Disease Susceptibility
Larry Schook, Animals Science
Constructing Biological Networks
David Rivier, Cell and Structural Biology

18
Social Sciences and Humanities

Although science is leading the way, the
exploring, analyzing, and mining of social
science data stores is beginning to change these
fields, too.

DISCUS Collaboration

Distributed Innovation and Scalable Collaboration
in Uncertain Settings (DISCUS)
David Goldberg, General Engineering
Music Information Retrieval MIR
Stephen Downey, Graduate School LIS
Ticket To Work, Job Demands
Tayna Gallagher, College of Applied Life Studies
Mining Bugzilla
Les Gasser, Graduate School LIS
Multi-modal Global Economic Modeling
Gerald Nelson, AG and Consumer Economics
Concept Modeling in War Periodical
Bruce Rosenstock, LAS-Religion

19
Homeland Defense

Mining homeland defense data is difficult because
the data is massive, distributed, complex and
heterogeneous.

MAIDS Analysis

Mining Alarming Incidents in Data Streams
Jiawei Han, Computer Science
Distributed Innovation and Scalable Collaboration
in Uncertain Settings
David Goldberg, General Engineering
NIBRS Mining the National Incident Based
Reporting System
Tracy McGee, Illinois State Police
Intelligence Gathering from API New Feeds

20
Knowledge Extraction from Streaming Text

Information extraction
process of using advanced automated machine
learning approaches
to identify entities in text documents
extract this information along with the
relationships these entities may have in the text
documents
This project demonstrates information extraction
of names, places and organizations from real-time
news feeds. As news articles arrive, the
information is extracted and displayed.

21
D2K Web Service Architecture
D2K Web Service

Any web enabled client can connect to and use the
D2K Web Service by sending SOAP messages over
HTTP.
Itineraries and modules are stored on the web
service machine and loaded over the network by
the D2K Servers.
Job results are also stored in the web service
tier.
Results are returned to clients upon request.
A relational database is used by the web service
to lookup accounts, itineraries, servers, and
jobs.
Remote D2K Servers handle itinerary processing.
If possible, modules should load any data from
remote locations.

22
MAIDS Stream Mining Architecture

MAIDS is aimed to
Discover changes, trends and evolution
characteristics in data streams
Construct clusters and classification models from
data streams
Explore frequent patterns and similarities among
data streams

23
MAIDS Stream Characteristics
Current ALG Projects

Huge volumes of continuous data, possibly
infinite
Fast changing and requires fast, real-time
response
Data stream captures nicely our data processing
needs of today
Random access is expensivesingle linear scan
algorithm (can only have one look)
Store only the summary of the data seen thus far
Most stream data are at pretty low-level or
multi-dimensional in nature, needs multi-level
and multi-dimensional processing

24
Features of MAIDS

General purpose tool for data stream analysis
Processes high-rate and multi-dimensional data
Adopts a flexible tilted time window framework
Facilitates multi-dimensional analysis using a
stream cube architecture
Integrates multiple data mining functions
Provides user-friendly interface automatic
analysis and on-demand analysis
Facilitates setting alarms for monitoring
Built in D2K as D2K modules and leveraged in the
D2K Streamline tool

25
Statistics Query Engine

Answers user queries on data statistics, such as,
count, max, min, average, regression, etc.
Uses tilted time window
Uses an efficient data structure, H-tree for
partial computation of data cubes

26
Stream Data Classifier

Builds models to make predictions
Uses Naïve Bayesian Classifier with boosting
Uses Tilted Time Window to track time related
info
Sets alarm to monitor events

27
Stream Pattern Finder

Find frequent patterns with multiple time
granularities
Keep precise/ compressed history in tilted time
window
Mine only the interested item set using FP-tree
algorithm
Mining evolution and dramatic changes of frequent
patterns

28
Stream Data Clustering

Two stages micro-clustering and macro-clustering
Uses micro-clustering to do incremental, online
processing and maintenance
Uses tilted time frame
Detects outliers when new clusters are formed

29
The ALG Team

Staff
Loretta Auvil
Peter Bajcsy
Colleen Bushell
Dora Cai
David Clutter
Lisa Gatzke
Vered Goren
Chris Navarro
Greg Pape
Tom Redman
Barry Sanders
Duane Searsmith
Andrew Shirk
Anca Suvaiala
David Tcheng
Michael Welge

Students
John Cassel
Sang-Chul Lee
Xiaolei Li
Martin Urban
Bei Yu

Write a Comment

User Comments (0)

About PowerShow.com

Automated Learning Group PowerPoint PPT Presentation