Automated Learning Group - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

Automated Learning Group

Description:

'Find all individuals likely to by Ford Explorer' ... Dora Cai. David Clutter. Lisa Gatzke. Vered Goren. Chris Navarro. Greg Pape. Tom Redman ... – PowerPoint PPT presentation

Number of Views:109
Avg rating:3.0/5.0
Slides: 30
Provided by: lisag5
Category:

less

Transcript and Presenter's Notes

Title: Automated Learning Group


1
Automated Learning Group
  • November 29, 2004

2
ALG Mission
  • The specific mission of the Automated Learning
    Group is
  •  
  • To collaborate with researchers to develop novel
    computer methods and the scientific foundation
    for using historical data to improve future
    decision making
  • To work closely with industrial, government, and
    academic partners to explore new application
    areas for such methods, and
  •  
  • To transfer the resulting software technology
    into real world applications

3
ALG Research, Development, Technology Transfer
Model
4
What is It?
Overview of Knowledge Discovery
  • Knowledge Discovery in Databases is the
    non-trivial process of identifying valid, novel,
    potentially useful, and ultimately understandable
    patterns in data
  • The understandable patterns are used to
  • Make predictions about or classifications of new
    data
  • Explain existing data
  • Summarize the contents of a large database to
    support decision making
  • Create graphical data visualization to aid humans
    in discovering complex patterns

5
Why Do We Need Data Mining ?
Overview of Knowledge Discovery
  • Data volumes are too large for classical analysis
    approaches
  • Large number of records (108 1012 bytes)
  • High dimensional data ( 102 104 attributes)
  • How do you explore millions of records, tens or
    hundreds or thousands of fields, and find
    patterns?
  • As databases grow, the ability to use traditional
    query languages for the decision support process
    becomes infeasible
  • Many queries of interest are difficult to state
    in a query language (query formulation problem)
  • Find all cases of fraud
  • Find all individuals likely to by Ford Explorer
  • Find all documents that are similar to this
    customers problem

6
Computational Knowledge Discovery
7
Knowledge Discovery Process
Overview of Knowledge Discovery
8
Required Effort for each KDD Step
Overview of Knowledge Discovery
  • Arrows indicate the direction we want the effort
    to go

9
Three Primary Paradigms
Overview of Knowledge Discovery
  • Predictive Modeling supervised learning
    approach where classification or prediction of
    one of the attributes is desired
  • Classification is the prediction of predefined
    classes
  • Naive Bayesian, Decision Trees, and Neural
    Networks
  • Regression is the prediction of continuous data
  • Neural Networks, and Decision (Regression) Trees
  • Discovery unsupervised learning approach for
    exploratory data analysis
  • Association Rules and Link Analysis
  • Clustering and Self Organizing Maps
  • Deviation Detection identifying outliers in the
    data
  • Information Visualization

10
Advantages of a Framework for Analytics
  • Provides scalable environment from the Desktop to
    Web Services to Grid Services
  • Employs a visual programming system for data/work
    flow paradigm
  • Provides capability to build custom applications
  • Provides capability to access data management
    tools
  • Contains data mining algorithms for prediction
    and discovery
  • Provides data transformations for standard
    operations
  • Integrated environment for models and
    visualization
  • Supports an extensible interface for creating
    ones own algorithms
  • Provides access to distributed computing
    capabilities

11
D2K - Data To Knowledge
D2K Overview
  • D2K is a flexible data mining system that
    integrates effective analytical data mining
    methods for prediction, discovery, and anomaly
    detection with data management and information
    visualization

12
D2K and Its Many Components
D2K Overview
  • D2K Infrastructure
  • D2K API, data flow environment, distributed
    computing framework and runtime system
  • D2K Modules
  • Computational units written in Java that follow
    the D2K API
  • D2K Itineraries
  • Modules that are connected to form an application
  • D2K Toolkit
  • User interface for specification of itineraries
    and execution that provides the rapid application
    development environment
  • D2K-Driven Applications
  • Applications that use D2K modules with a custom
    user interface
  • D2K Streamline (SL)
  • Task driven system that uses D2K modules
  • D2K Web/Grid Services
  • Enables web deployment

13
D2K Streamline (D2K SL)
D2K SL
  • Provides step by step interface to guide user in
    data analysis
  • Supports return to earlier steps to run with
    different parameters
  • Uses the D2K infrastructure transparently
  • Uses same D2K modules
  • Provides way to capture different experiments

14
Success Story Predictive Analytics
  • The Problem
  • Predict number of products a customer will
    purchase to enable the increase of conquest,
    cross, and upsell sales
  • The Solution
  • Built data models to predict what customers were
    ready to buy and how many
  • Computed customer buying propensities
  • The Results
  • Achieved increase of conquest customer sales lift
    by accurately predicting optimal groups for
    directed cross/upsell activity
  • Increase of more than 50 percent in the number of
    sales calls on potential customers.
  • The average number of engines sold to truck fleet
    customers rose 67 percent.
  • In 1998 the number of promising sales targets
    identified jumped to 35 percent, and the number
    of engines sold grew to 6.75 per truck fleet
    customer
  • Why it worked
  • It worked because we added analytics to a process
    that had been based on the estimation of experts.
    The shift from a process that was built on
    professional experience to one that was data
    driven reduced opportunities for
    misinterpretation of market dynamics.

15
Success Story Predictive Analytics
  • The Problem
  • Predict the length of time a customer will keep a
    product (subscribe for a service)
  • The Solution
  • Built data models to predict customers behavior
  • Why it worked
  • It worked because we added analytics to a process
    that had been based on the estimation of experts.
    The shift from a process that was built on
    professional experience to one that was data
    driven reduced opportunities for
    misinterpretation of market dynamics.

16
Earth, Space, and Environmental Sciences
  • Grids are being built to work with distributed
    earth, space and environmental science data
    stores. A next step is to undertake distributed
    data analysis utilizing remote data.

EMO Analysis Environment
  • EMO Evolutionary-based Multiobjective
    Optimization for Hazard Management
  • Barbara Minsker, Civil and Environmental
    Engineering
  • MAEViz Multi-modal Data Integration and
    Information Visualization
  • Dan Abrams, Civil Engineering
  • MUSTSIM Real-time Data Stream Fusion and
    Information Visualization
  • Amr Elnashi, Dan Kuchman, and Bill Spencer, Civil
    Engineering

17
Bioinformatics
  • Now that the human genome has been sequenced,
    attention is turning to the mining of proteomic
    and structural biological data and looking for
    patterns that arise when examining data from a
    wide variety of different omic data sets.

Phylomat
  • Phylomat
  • Rex Gaskins, Cell and Structural Biology
  • Disease Susceptibility
  • Larry Schook, Animals Science
  • Constructing Biological Networks
  • David Rivier, Cell and Structural Biology

18
Social Sciences and Humanities
  • Although science is leading the way, the
    exploring, analyzing, and mining of social
    science data stores is beginning to change these
    fields, too.

DISCUS Collaboration
  • Distributed Innovation and Scalable Collaboration
    in Uncertain Settings (DISCUS)
  • David Goldberg, General Engineering
  • Music Information Retrieval MIR
  • Stephen Downey, Graduate School LIS
  • Ticket To Work, Job Demands
  • Tayna Gallagher, College of Applied Life Studies
  • Mining Bugzilla
  • Les Gasser, Graduate School LIS
  • Multi-modal Global Economic Modeling
  • Gerald Nelson, AG and Consumer Economics
  • Concept Modeling in War Periodical
  • Bruce Rosenstock, LAS-Religion

19
Homeland Defense
  • Mining homeland defense data is difficult because
    the data is massive, distributed, complex and
    heterogeneous.

MAIDS Analysis
  • Mining Alarming Incidents in Data Streams
  • Jiawei Han, Computer Science
  • Distributed Innovation and Scalable Collaboration
    in Uncertain Settings
  • David Goldberg, General Engineering
  • NIBRS Mining the National Incident Based
    Reporting System
  • Tracy McGee, Illinois State Police
  • Intelligence Gathering from API New Feeds

20
Knowledge Extraction from Streaming Text
  • Information extraction
  • process of using advanced automated machine
    learning approaches
  • to identify entities in text documents
  • extract this information along with the
    relationships these entities may have in the text
    documents
  • This project demonstrates information extraction
    of names, places and organizations from real-time
    news feeds. As news articles arrive, the
    information is extracted and displayed.

21
D2K Web Service Architecture
D2K Web Service
  • Any web enabled client can connect to and use the
    D2K Web Service by sending SOAP messages over
    HTTP.
  • Itineraries and modules are stored on the web
    service machine and loaded over the network by
    the D2K Servers.
  • Job results are also stored in the web service
    tier.
  • Results are returned to clients upon request.
  • A relational database is used by the web service
    to lookup accounts, itineraries, servers, and
    jobs.
  • Remote D2K Servers handle itinerary processing.
    If possible, modules should load any data from
    remote locations.

22
MAIDS Stream Mining Architecture
  • MAIDS is aimed to
  • Discover changes, trends and evolution
    characteristics in data streams
  • Construct clusters and classification models from
    data streams
  • Explore frequent patterns and similarities among
    data streams

23
MAIDS Stream Characteristics
Current ALG Projects
  • Huge volumes of continuous data, possibly
    infinite
  • Fast changing and requires fast, real-time
    response
  • Data stream captures nicely our data processing
    needs of today
  • Random access is expensivesingle linear scan
    algorithm (can only have one look)
  • Store only the summary of the data seen thus far
  • Most stream data are at pretty low-level or
    multi-dimensional in nature, needs multi-level
    and multi-dimensional processing

24
Features of MAIDS
  • General purpose tool for data stream analysis
  • Processes high-rate and multi-dimensional data
  • Adopts a flexible tilted time window framework
  • Facilitates multi-dimensional analysis using a
    stream cube architecture
  • Integrates multiple data mining functions
  • Provides user-friendly interface automatic
    analysis and on-demand analysis
  • Facilitates setting alarms for monitoring
  • Built in D2K as D2K modules and leveraged in the
    D2K Streamline tool

25
Statistics Query Engine
  • Answers user queries on data statistics, such as,
    count, max, min, average, regression, etc.
  • Uses tilted time window
  • Uses an efficient data structure, H-tree for
    partial computation of data cubes

26
Stream Data Classifier
  • Builds models to make predictions
  • Uses Naïve Bayesian Classifier with boosting
  • Uses Tilted Time Window to track time related
    info
  • Sets alarm to monitor events

27
Stream Pattern Finder
  • Find frequent patterns with multiple time
    granularities
  • Keep precise/ compressed history in tilted time
    window
  • Mine only the interested item set using FP-tree
    algorithm
  • Mining evolution and dramatic changes of frequent
    patterns

28
Stream Data Clustering
  • Two stages micro-clustering and macro-clustering
  • Uses micro-clustering to do incremental, online
    processing and maintenance
  • Uses tilted time frame
  • Detects outliers when new clusters are formed

29
The ALG Team
  • Staff
  • Loretta Auvil
  • Peter Bajcsy
  • Colleen Bushell
  • Dora Cai
  • David Clutter
  • Lisa Gatzke
  • Vered Goren
  • Chris Navarro
  • Greg Pape
  • Tom Redman
  • Barry Sanders
  • Duane Searsmith
  • Andrew Shirk
  • Anca Suvaiala
  • David Tcheng
  • Michael Welge
  • Students
  • John Cassel
  • Sang-Chul Lee
  • Xiaolei Li
  • Martin Urban
  • Bei Yu
Write a Comment
User Comments (0)
About PowerShow.com