Title: Integrating Data Mining with SQL Databases: OLE DB for Data Mining
1Integrating Data Mining with SQL Databases OLE
DB for Data Mining
2Overview
- Background knowledge
- Overview and design philosophy
- Basic components
- Operations on data model
- Concluding remarks
3What is Data Ming?
- Definition Discovery of useful summaries of
data. - Some Examples of Applications
- Decision Trees constructed from bank-loan
histories to produce algorithms to decide whether
to grant a loan. - Patterns of traveler behavior mined to manage the
sale of discounted seat on the planes, rooms in
hotels, etc. - Diapers and Beer in supermarkets for increasing
sales. - Comparison of genotype with/without a condition
allowed the discovery of genes that account for
diseases.
4Why need OLE DB
- To address the following needs, Microsoft created
OLE DB. - A vast amount of the critical information is
found outside of traditional corporate databases,
instead, found in file systems, or personal
systems such as Microsoft Access , spreadsheets,
E-mails, and even on the World Wide Web. - To take advantage of the benefits of database
technology, such as queries, transactions, and
security, businesses have traditionally had to
move the data from its original containing system
into (DBMS). This process is expensive and
redundant. - Furthermore, businesses need to be able to
exploit the advantages of database technology not
only when accessing data within a DBMS but also
when accessing data from any other type of
information container.
5What is OLE DB?
- Object Linking and Embedding for DataBases, is a
means Microsoft use for accessing different types
of data stores in a uniform manner. Now, OLE DB
is an open technology available royalty free in
many operating systems. - OLE DB is a set of Component Object Model (COM)
interfaces that provide applications with uniform
access to data stored in diverse information
sources and that also provide the ability to
implement additional database services. - OLE DB is the way to access data in a MS COM
environment. - Referenceshttp//msdn.microsoft.com/library/defau
lt.asp?url/library/en-us/oledb/htm/oledboverview_
of_ole_db.asp
6Overview
- Background knowledge
- Overview and design philosophy
- Basic components
- Operations on data model
- Concluding remarks
7Why OLE DB for Data Mining?
- Industry standard is critical for data mining
development, usage, interoperability, and
exchange. - Objects, data types, properties, and important
programming in OLE DB naturally cater to the
needs - Building mining applications over relational
databases is nontrivial. - Need different customized data mining algorithms
and methods. - Significant work on the part of application
builders. - Data providers All structured data, i.e. data
that support OLE DB - Data consumers All development tools or
languages requiring access to a broad range of
data
8Motivation of OLE DB for DM
- Facilitate deployment of data mining models
- Generate data mining models
- Store, maintain and refresh models as data are
updated - Programmatically use the model on other data sets
- Browse models
- Enable enterprise application developers to
participate in building data mining solutions
9Features of OLE DB for DM
- Independent of provider or software
- Not specialized to any specific mining model
- Structured to cater to all well-known mining
models - Part of upcoming release of Microsoft SQL Server
2000 - Not propose new Data mining algorithm, but to
suggest an infrastructure to plug in any
algorithms.
10OLE DB for DM Overview
- Core relational engine exposes OLE DB in a
language-based API - Analysis server exposes OLE DB OLAP and OLE DB DM
- Maintain SQL metaphor
- Reuse existing notions
Data mining applications
OLE DB DM
Analysis Server
OLE DB
RDB engine
11Key Operations to Support Data Mining Models
- Define a mining model
- Attributes to be predicted
- Attributes to be used for prediction
- Algorithm used to build the model
- Populate a mining model from training data
- Predict attributes for new data
- Browse a mining model fro reporting and
visualization
12Data Mining Model is Analogous to A Table in SQL
- Create a data mining module object
- CREATE MINING MODEL model_name
- Insert training data into the model and train it
- INSERT INTO model_name
- Use the data mining model
- SELECT relation_name.id, model_name.predict_a
ttr - consult DMM content in order to make predictions
and browse statistics obtained by the model - Use DELETE to empty/reset
- Predictions on datasets prediction join between
a model and a data set (tables) - Deploy DMM by just writing SQL queries!
13Overview
- Background knowledge
- Overview and design philosophy
- Basic components
- Operations on data model
- Concluding remarks
14Two Basic Components Beyond Traditional OLE DB
- Cases/caseset input data
- A table or nested tables (for hierarchical data)
- Data mining model (DMM) a special type of table
- A caseset is associated with a DMM and meta-info
while creating a DMM - Save mining algorithm and resulting abstraction
instead of data itself - Fundamental operations CREATE, INSERT INTO,
PREDICTION JOIN, SELECT, DELETE FROM, and DROP
15Flatterned Representation of Caseset
Customers
Customer ID
Gender
Hair Color
Age
Age Prob
Product Purchases
Customer ID
Product Name
Quantity
Product Type
Problem Lots of replication!
CID Gend Hair Age Age prob Prod Quan Type Car Car prob
1 Male Black 35 100 TV 1 Elec Car 100
1 Male Black 35 100 VCR 1 Elec Car 100
1 Male Black 35 100 Ham 6 Food Car 100
1 Male Black 35 100 TV 1 Elec Van 50
1 Male Black 35 100 VCR 1 Elec Van 50
1 Male Black 35 100 Ham 6 Food Van 50
Car Owernership
Customer ID
Car
Car Prob
16Logical Nested Table Representation of Caseset
- Caseset a set of cases.
- Use Data Shaping Service to generate a
hierarchical rowset.
CID Gend Hair Age Age prob Product Purchases Product Purchases Product Purchases Car Ownership Car Ownership
CID Gend Hair Age Age prob Prod Quan Type Car Car prob
1 Male Black 35 100 TV 1 Elec Car 100
1 Male Black 35 100 VCR 1 Elec Car 100
1 Male Black 35 100 VCR 1 Elec Van 50
1 Male Black 35 100 Ham 6 Food Van 50
17Defining A Data Mining Model
- The name of the model
- The algorithm and parameters
- The algorithm for prediction using this model
- The columns of caseset and the relationships
among columns - Source columns and prediction columns
18Example
CREATE MINING MODEL Age Prediction Name of
Model ( Customer ID LONG KEY, source
column Gender TEXT DISCRETE, source
column Age Double DISCRETIZED()
PREDICT, prediction column Product
Purchases TABLE source column ( Product
Name TEXT KEY, source column Quantity DOUBLE
NORMAL CONTINUOUS, source column Product
Type TEXT DISCRETE RELATED TO Product
Name source column )) USING
Decision_Trees_101 Mining algorithm used
19Content Type of Columns Column Specifiers
- KEY
- ATTRIBUTE
- RELATION (RELATED TO clause)
- QUALIFIER (OF clause)
- PROBABILITY 0, 1
- VARIANCE
- SUPPORT
- PROBABILITY-VARIANCE
- ORDER
- TABLE
20Content Type of Columns Attribute Types
- DISCRETE Area Code
- ORDERED a ranking of skill level (say one to
five) - CYCLICAL day of week
- CONTINOUS salary
- DISCRETIZED Age
- SEQUENCE_TIME a measurement for time
21Overview
- Background knowledge
- Overview and design philosophy
- Basic components
- Operations on data model
- Concluding remarks
22Populating a DMM
- Once a mining model is defined, the next step is
to populate a mining model by consuming a caseset
that satisfies specification In other works,
pull the information into a single rowset, and
use INSERT INTO statement - Consume a case using the data mining model
- Use SHAPE statement to create the nested table
from the input data. - Train the model using the data and algorithm
specified in create syntax.
23Example Populating a DMM
INSERT INTO Age Prediction ( Customer ID,
Gender, Age, Product Purchases( Product
Name, Quantity, Product Type) )
SHAPE SELECT Customer ID, Gender, Age FROM
Customers ORDER BY Customer ID APPEND SELECT
CustID, product Name, Quantity, Product
Type FROM Sales ORDER BY CustID RELATE
Customer ID TO CustID ) AS Product Purchases
24Using Data Model to Predict
- Prediction join
- Prediction on dataset D using DMM M
- Different to join statement in SQL
- DMM a truth table
- SELECT statement associated with PREDICTION JOIN
specifies values extracted from DMM
25Example Using a DMM in Prediction
SELECT t.Customer ID, Age Prediction.Age FRO
M Age Prediction PREDICTION JOIN (SHAPE SELECT
Customer ID, Gender FROM Customers ORDER BY
Customer ID APPEND ( SELECT CustID,
Product Name, Quantity FROM Sales ORDER BY
CustID RELATE Customer ID TO
CustID ) AS Product Purchases) AS t ON Age
Prediction.Gendert.Gender AND Age
Prediction.Product Purchases.Product
Namet.Product Purchases.Product Name
AND Age Prediction.Product Purchases.Quantity
t.Product Purchases.Quantity
26Browsing DMM
- What is in a DMM?
- Rules, formulas, trees, , etc
- Browsing DMM
- Visualization
27Concluding Remarks
- This paper focus on the problem of integration of
data mining with relational databases, rather
than KDD algorithms. - OLE DB for DM integrates data mining and database
systems - A good standard for mining application builders
- Possible improvement
- Extend the functionalities of the OLE DB DM?
- Design more concrete language primitives?
28Finally See why OLEDB for Data Mining?
- Disadvantages of other Data Mining Languages
They can not deal with either arbitrary mining
models or integration of relational database API
with mining applications. So Microsoft proposes
OLEDB for DM. - DMQL A Data Mining Query Language for Relational
Databases - MSQL (Imielinski Virmani99)
- MineRule (Meo Psaila and Ceri96)
- Query flocks based on Datalog syntax (Tsur et
al98) - CRISP-DM (CRoss-Industry Standard Process for
Data Mining)
29Questions and comments?
30OLE DB (sometimes written as OLEDB or OLE-DB),
- OLE DB (sometimes written as OLEDB or OLE-DB),
Object Linking and Embedding for Databases, is a
means Microsoft use for accessing different types
of data stores in a uniform manner. Microsoft has
separated the data store from the application
that needs access to it through the use of this
technology this was done because different
applications need access to different types and
sources of data and do not necessarily want to
know how to access functionality with
technology-specific methods. OLE DB is
conceptually divided into consumers and
providers. The consumers are the applications
that need access to the data, and the provider is
the software component that exposes an OLE DB
interface through the use of the Component Object
Model (or COM). - OLE DB is part of the MDAC (Microsoft Data Access
Components) stack and is the database access
interface technology. MDAC is a group of
Microsoft technologies that interact together as
a framework that allows programmers a uniform and
comprehensive way of developing applications for
accessing almost any data store. OLE DB providers
can be created to access such simple data stores
as a text file or spreadsheet, through to such
complex databases as Oracle and SQL Server.
However, because different data store technology
can have different capabilities, OLE DB providers
may not implement every possible interface
available to OLE DB. The capabilities that are
available are implemented through the use of COM
objects - an OLE DB provider will map the data
store technologies functionality to a particular
COM interface. Microsoft calls the availability
of an interface to be "provider-specific" as it
may not be applicable depending on the database
technology involved. Additionally, however,
providers may also augment the capabilities of a
data store - these capabilities are known as
services in Microsoft parlance.
31ADO
- Microsoft ADO(ActiveX Data Objects) is a
Component object model object for accessing data
sources. It provides a layer between programming
languages and OLE DB (a means of accessing data
stores, whether they be databases or otherwise,
in a uniform manner), which allows a developer to
write programs which access data, without knowing
how the database is implemented. You must be
aware of your database for connection only. No
knowledge of SQL is required to access a database
when using ADO, although one can use ADO to
execute arbitrary SQL commands. The disadvantage
of this is that this introduces a dependency upon
the database.
32Component Object Model
- Component Object Model, or COM, is a Microsoft
platform for software componentry. It is used to
enable cross-application communication and
dynamic object creation in any programming
language that supports the technology. COM is
often used in the software development world as
an umbrella term that encompasses the OLE,
ActiveX, COM and DCOM technologies. COM has been
around since 1993, however Microsoft only really
started emphasizing the name around 1997. - Although it has been implemented on several
platforms, it is primarily used with Microsoft
Windows. COM is expected to be replaced to at
least some extent by the Microsoft .NET framework.
33Component Object Model,
- Component Object Model, or COM, is a Microsoft
platform for software componentry. It is used to
enable cross-application communication and
dynamic object creation in any programming
language that supports the technology. COM is
often used in the software development world as
an umbrella term that encompasses the OLE,
ActiveX, COM and DCOM technologies.
34Component Object Model,
- The object model for all Windows software
- Proven, de facto industry-standard
- Becoming available on non-Windows platforms
- Unix via Bristol and Mainsoft products
- Software AG has 18 ports in progress for Unix and
Mainframe platforms
35data warehouse
- A data warehouse is, primarily, a record of an
enterprise's past transactional and operational
information, stored in a database designed to
favour efficient data analysis and reporting
(especially OLAP). Data warehousing is not meant
for current "live" data. - Data warehouses often hold large amounts of
information which are sometimes subdivided into
smaller logical units called dependent data
marts. - Usually, two basic ideas guide the creation of a
data warehouse - Integration of data from distributed and
differently structured databases, which
facilitates a global overview and comprehensive
analysis in the data warehouse. - Separation of data used in daily operations from
data used in the data warehouse for purposes of
reporting, decision support, analysis and
controlling. - Periodically, one imports data from enterprise
resource planning (ERP) systems and other related
business software systems into the data warehouse
for further processing. It is common practice to
"stage" data prior to merging it into a data
warehouse. In this sense, to "stage data" means
to queue it for preprocessing, usually with an
ETL tool. The preprocessing program reads the
staged data (often a business's primary OLTP
databases), performs qualitative preprocessing or
filtering (including denormalization, if deemed
necessary), and writes it into the warehouse.
36ODBC
- (pronounced as separate letters) Short for Open
DataBase Connectivity, a standard database access
method developed by the SQL Access group in 1992.
The goal of ODBC is to make it possible to access
any data from any application, regardless of
which database management system (DBMS) is
handling the data. ODBC manages this by inserting
a middle layer, called a database driver ,
between an application and the DBMS. The purpose
of this layer is to translate the application's
data queries into commands that the DBMS
understands. For this to work, both the
application and the DBMS must be ODBC-compliant
-- that is, the application must be capable of
issuing ODBC commands and the DBMS must be
capable of responding to them. Since version 2.0,
the standard supports SAG SQL.
37OLE DB
- Object Linking and Embedding for DataBases, is a
means Microsoft use for accessing different types
of data stores in a uniform manner. Microsoft has
separated the data store from the application
that needs access to it through the use of this
technology this was done because different
applications need access to different types and
sources of data and do not necessarily want to
know how to access functionality with
technology-specific methods. OLE DB is
conceptually divided into consumers and
providers. The consumers are the applications
that need access to the data, and the provider is
the software component that exposes an OLE DB
interface through the use of the Component Object
Model (or COM).