Integrating Data Mining with SQL Databases: OLE DB for Data Mining PowerPoint PPT Presentation

presentation player overlay
1 / 37
About This Presentation
Transcript and Presenter's Notes

Title: Integrating Data Mining with SQL Databases: OLE DB for Data Mining


1
Integrating Data Mining with SQL Databases OLE
DB for Data Mining
  • Presenter Lei Chen

2
Overview
  • Background knowledge
  • Overview and design philosophy
  • Basic components
  • Operations on data model
  • Concluding remarks

3
What is Data Ming?
  • Definition Discovery of useful summaries of
    data.
  • Some Examples of Applications
  • Decision Trees constructed from bank-loan
    histories to produce algorithms to decide whether
    to grant a loan.
  • Patterns of traveler behavior mined to manage the
    sale of discounted seat on the planes, rooms in
    hotels, etc.
  • Diapers and Beer in supermarkets for increasing
    sales.
  • Comparison of genotype with/without a condition
    allowed the discovery of genes that account for
    diseases.

4
Why need OLE DB
  • To address the following needs, Microsoft created
    OLE DB.
  • A vast amount of the critical information is
    found outside of traditional corporate databases,
    instead, found in file systems, or personal
    systems such as Microsoft Access , spreadsheets,
    E-mails, and even on the World Wide Web.
  • To take advantage of the benefits of database
    technology, such as queries, transactions, and
    security, businesses have traditionally had to
    move the data from its original containing system
    into (DBMS). This process is expensive and
    redundant.
  • Furthermore, businesses need to be able to
    exploit the advantages of database technology not
    only when accessing data within a DBMS but also
    when accessing data from any other type of
    information container.

5
What is OLE DB?
  • Object Linking and Embedding for DataBases, is a
    means Microsoft use for accessing different types
    of data stores in a uniform manner. Now, OLE DB
    is an open technology available royalty free in
    many operating systems.
  • OLE DB is a set of Component Object Model (COM)
    interfaces that provide applications with uniform
    access to data stored in diverse information
    sources and that also provide the ability to
    implement additional database services.
  • OLE DB is the way to access data in a MS COM
    environment.
  • Referenceshttp//msdn.microsoft.com/library/defau
    lt.asp?url/library/en-us/oledb/htm/oledboverview_
    of_ole_db.asp

6
Overview
  • Background knowledge
  • Overview and design philosophy
  • Basic components
  • Operations on data model
  • Concluding remarks

7
Why OLE DB for Data Mining?
  • Industry standard is critical for data mining
    development, usage, interoperability, and
    exchange.
  • Objects, data types, properties, and important
    programming in OLE DB naturally cater to the
    needs
  • Building mining applications over relational
    databases is nontrivial.
  • Need different customized data mining algorithms
    and methods.
  • Significant work on the part of application
    builders.
  • Data providers All structured data, i.e. data
    that support OLE DB
  • Data consumers All development tools or
    languages requiring access to a broad range of
    data

8
Motivation of OLE DB for DM
  • Facilitate deployment of data mining models
  • Generate data mining models
  • Store, maintain and refresh models as data are
    updated
  • Programmatically use the model on other data sets
  • Browse models
  • Enable enterprise application developers to
    participate in building data mining solutions

9
Features of OLE DB for DM
  • Independent of provider or software
  • Not specialized to any specific mining model
  • Structured to cater to all well-known mining
    models
  • Part of upcoming release of Microsoft SQL Server
    2000
  • Not propose new Data mining algorithm, but to
    suggest an infrastructure to plug in any
    algorithms.

10
OLE DB for DM Overview
  • Core relational engine exposes OLE DB in a
    language-based API
  • Analysis server exposes OLE DB OLAP and OLE DB DM
  • Maintain SQL metaphor
  • Reuse existing notions

Data mining applications
OLE DB DM
Analysis Server
OLE DB
RDB engine
11
Key Operations to Support Data Mining Models
  • Define a mining model
  • Attributes to be predicted
  • Attributes to be used for prediction
  • Algorithm used to build the model
  • Populate a mining model from training data
  • Predict attributes for new data
  • Browse a mining model fro reporting and
    visualization

12
Data Mining Model is Analogous to A Table in SQL
  • Create a data mining module object
  • CREATE MINING MODEL model_name
  • Insert training data into the model and train it
  • INSERT INTO model_name
  • Use the data mining model
  • SELECT relation_name.id, model_name.predict_a
    ttr
  • consult DMM content in order to make predictions
    and browse statistics obtained by the model
  • Use DELETE to empty/reset
  • Predictions on datasets prediction join between
    a model and a data set (tables)
  • Deploy DMM by just writing SQL queries!

13
Overview
  • Background knowledge
  • Overview and design philosophy
  • Basic components
  • Operations on data model
  • Concluding remarks

14
Two Basic Components Beyond Traditional OLE DB
  • Cases/caseset input data
  • A table or nested tables (for hierarchical data)
  • Data mining model (DMM) a special type of table
  • A caseset is associated with a DMM and meta-info
    while creating a DMM
  • Save mining algorithm and resulting abstraction
    instead of data itself
  • Fundamental operations CREATE, INSERT INTO,
    PREDICTION JOIN, SELECT, DELETE FROM, and DROP

15
Flatterned Representation of Caseset
Customers
Customer ID
Gender
Hair Color
Age
Age Prob
Product Purchases
Customer ID
Product Name
Quantity
Product Type
Problem Lots of replication!
CID Gend Hair Age Age prob Prod Quan Type Car Car prob
1 Male Black 35 100 TV 1 Elec Car 100
1 Male Black 35 100 VCR 1 Elec Car 100
1 Male Black 35 100 Ham 6 Food Car 100
1 Male Black 35 100 TV 1 Elec Van 50
1 Male Black 35 100 VCR 1 Elec Van 50
1 Male Black 35 100 Ham 6 Food Van 50
Car Owernership
Customer ID
Car
Car Prob
16
Logical Nested Table Representation of Caseset
  • Caseset a set of cases.
  • Use Data Shaping Service to generate a
    hierarchical rowset.

CID Gend Hair Age Age prob Product Purchases Product Purchases Product Purchases Car Ownership Car Ownership
CID Gend Hair Age Age prob Prod Quan Type Car Car prob
1 Male Black 35 100 TV 1 Elec Car 100
1 Male Black 35 100 VCR 1 Elec Car 100
1 Male Black 35 100 VCR 1 Elec Van 50
1 Male Black 35 100 Ham 6 Food Van 50
17
Defining A Data Mining Model
  • The name of the model
  • The algorithm and parameters
  • The algorithm for prediction using this model
  • The columns of caseset and the relationships
    among columns
  • Source columns and prediction columns

18
Example
CREATE MINING MODEL Age Prediction Name of
Model ( Customer ID LONG KEY, source
column Gender TEXT DISCRETE, source
column Age Double DISCRETIZED()
PREDICT, prediction column Product
Purchases TABLE source column ( Product
Name TEXT KEY, source column Quantity DOUBLE
NORMAL CONTINUOUS, source column Product
Type TEXT DISCRETE RELATED TO Product
Name source column )) USING
Decision_Trees_101 Mining algorithm used
19
Content Type of Columns Column Specifiers
  • KEY
  • ATTRIBUTE
  • RELATION (RELATED TO clause)
  • QUALIFIER (OF clause)
  • PROBABILITY 0, 1
  • VARIANCE
  • SUPPORT
  • PROBABILITY-VARIANCE
  • ORDER
  • TABLE

20
Content Type of Columns Attribute Types
  • DISCRETE Area Code
  • ORDERED a ranking of skill level (say one to
    five)
  • CYCLICAL day of week
  • CONTINOUS salary
  • DISCRETIZED Age
  • SEQUENCE_TIME a measurement for time

21
Overview
  • Background knowledge
  • Overview and design philosophy
  • Basic components
  • Operations on data model
  • Concluding remarks

22
Populating a DMM
  • Once a mining model is defined, the next step is
    to populate a mining model by consuming a caseset
    that satisfies specification In other works,
    pull the information into a single rowset, and
    use INSERT INTO statement
  • Consume a case using the data mining model
  • Use SHAPE statement to create the nested table
    from the input data.
  • Train the model using the data and algorithm
    specified in create syntax.

23
Example Populating a DMM
INSERT INTO Age Prediction ( Customer ID,
Gender, Age, Product Purchases( Product
Name, Quantity, Product Type) )
SHAPE SELECT Customer ID, Gender, Age FROM
Customers ORDER BY Customer ID APPEND SELECT
CustID, product Name, Quantity, Product
Type FROM Sales ORDER BY CustID RELATE
Customer ID TO CustID ) AS Product Purchases
24
Using Data Model to Predict
  • Prediction join
  • Prediction on dataset D using DMM M
  • Different to join statement in SQL
  • DMM a truth table
  • SELECT statement associated with PREDICTION JOIN
    specifies values extracted from DMM

25
Example Using a DMM in Prediction
SELECT t.Customer ID, Age Prediction.Age FRO
M Age Prediction PREDICTION JOIN (SHAPE SELECT
Customer ID, Gender FROM Customers ORDER BY
Customer ID APPEND ( SELECT CustID,
Product Name, Quantity FROM Sales ORDER BY
CustID RELATE Customer ID TO
CustID ) AS Product Purchases) AS t ON Age
Prediction.Gendert.Gender AND Age
Prediction.Product Purchases.Product
Namet.Product Purchases.Product Name
AND Age Prediction.Product Purchases.Quantity
t.Product Purchases.Quantity
26
Browsing DMM
  • What is in a DMM?
  • Rules, formulas, trees, , etc
  • Browsing DMM
  • Visualization

27
Concluding Remarks
  • This paper focus on the problem of integration of
    data mining with relational databases, rather
    than KDD algorithms.
  • OLE DB for DM integrates data mining and database
    systems
  • A good standard for mining application builders
  • Possible improvement
  • Extend the functionalities of the OLE DB DM?
  • Design more concrete language primitives?

28
Finally See why OLEDB for Data Mining?
  • Disadvantages of other Data Mining Languages
    They can not deal with either arbitrary mining
    models or integration of relational database API
    with mining applications. So Microsoft proposes
    OLEDB for DM.
  • DMQL A Data Mining Query Language for Relational
    Databases
  • MSQL (Imielinski Virmani99)
  • MineRule (Meo Psaila and Ceri96)
  • Query flocks based on Datalog syntax (Tsur et
    al98)
  • CRISP-DM (CRoss-Industry Standard Process for
    Data Mining)

29
Questions and comments?
  • Thank You!

30
OLE DB (sometimes written as OLEDB or OLE-DB),
  • OLE DB (sometimes written as OLEDB or OLE-DB),
    Object Linking and Embedding for Databases, is a
    means Microsoft use for accessing different types
    of data stores in a uniform manner. Microsoft has
    separated the data store from the application
    that needs access to it through the use of this
    technology this was done because different
    applications need access to different types and
    sources of data and do not necessarily want to
    know how to access functionality with
    technology-specific methods. OLE DB is
    conceptually divided into consumers and
    providers. The consumers are the applications
    that need access to the data, and the provider is
    the software component that exposes an OLE DB
    interface through the use of the Component Object
    Model (or COM).
  • OLE DB is part of the MDAC (Microsoft Data Access
    Components) stack and is the database access
    interface technology. MDAC is a group of
    Microsoft technologies that interact together as
    a framework that allows programmers a uniform and
    comprehensive way of developing applications for
    accessing almost any data store. OLE DB providers
    can be created to access such simple data stores
    as a text file or spreadsheet, through to such
    complex databases as Oracle and SQL Server.
    However, because different data store technology
    can have different capabilities, OLE DB providers
    may not implement every possible interface
    available to OLE DB. The capabilities that are
    available are implemented through the use of COM
    objects - an OLE DB provider will map the data
    store technologies functionality to a particular
    COM interface. Microsoft calls the availability
    of an interface to be "provider-specific" as it
    may not be applicable depending on the database
    technology involved. Additionally, however,
    providers may also augment the capabilities of a
    data store - these capabilities are known as
    services in Microsoft parlance.

31
ADO
  • Microsoft ADO(ActiveX Data Objects) is a
    Component object model object for accessing data
    sources. It provides a layer between programming
    languages and OLE DB (a means of accessing data
    stores, whether they be databases or otherwise,
    in a uniform manner), which allows a developer to
    write programs which access data, without knowing
    how the database is implemented. You must be
    aware of your database for connection only. No
    knowledge of SQL is required to access a database
    when using ADO, although one can use ADO to
    execute arbitrary SQL commands. The disadvantage
    of this is that this introduces a dependency upon
    the database.

32
Component Object Model
  • Component Object Model, or COM, is a Microsoft
    platform for software componentry. It is used to
    enable cross-application communication and
    dynamic object creation in any programming
    language that supports the technology. COM is
    often used in the software development world as
    an umbrella term that encompasses the OLE,
    ActiveX, COM and DCOM technologies. COM has been
    around since 1993, however Microsoft only really
    started emphasizing the name around 1997.
  • Although it has been implemented on several
    platforms, it is primarily used with Microsoft
    Windows. COM is expected to be replaced to at
    least some extent by the Microsoft .NET framework.

33
Component Object Model,
  • Component Object Model, or COM, is a Microsoft
    platform for software componentry. It is used to
    enable cross-application communication and
    dynamic object creation in any programming
    language that supports the technology. COM is
    often used in the software development world as
    an umbrella term that encompasses the OLE,
    ActiveX, COM and DCOM technologies.

34
Component Object Model,
  • The object model for all Windows software
  • Proven, de facto industry-standard
  • Becoming available on non-Windows platforms
  • Unix via Bristol and Mainsoft products
  • Software AG has 18 ports in progress for Unix and
    Mainframe platforms

35
data warehouse
  • A data warehouse is, primarily, a record of an
    enterprise's past transactional and operational
    information, stored in a database designed to
    favour efficient data analysis and reporting
    (especially OLAP). Data warehousing is not meant
    for current "live" data.
  • Data warehouses often hold large amounts of
    information which are sometimes subdivided into
    smaller logical units called dependent data
    marts.
  • Usually, two basic ideas guide the creation of a
    data warehouse
  • Integration of data from distributed and
    differently structured databases, which
    facilitates a global overview and comprehensive
    analysis in the data warehouse.
  • Separation of data used in daily operations from
    data used in the data warehouse for purposes of
    reporting, decision support, analysis and
    controlling.
  • Periodically, one imports data from enterprise
    resource planning (ERP) systems and other related
    business software systems into the data warehouse
    for further processing. It is common practice to
    "stage" data prior to merging it into a data
    warehouse. In this sense, to "stage data" means
    to queue it for preprocessing, usually with an
    ETL tool. The preprocessing program reads the
    staged data (often a business's primary OLTP
    databases), performs qualitative preprocessing or
    filtering (including denormalization, if deemed
    necessary), and writes it into the warehouse.

36
ODBC
  • (pronounced as separate letters) Short for Open
    DataBase Connectivity, a standard database access
    method developed by the SQL Access group in 1992.
    The goal of ODBC is to make it possible to access
    any data from any application, regardless of
    which database management system (DBMS) is
    handling the data. ODBC manages this by inserting
    a middle layer, called a database driver ,
    between an application and the DBMS. The purpose
    of this layer is to translate the application's
    data queries into commands that the DBMS
    understands. For this to work, both the
    application and the DBMS must be ODBC-compliant
    -- that is, the application must be capable of
    issuing ODBC commands and the DBMS must be
    capable of responding to them. Since version 2.0,
    the standard supports SAG SQL.

37
OLE DB
  • Object Linking and Embedding for DataBases, is a
    means Microsoft use for accessing different types
    of data stores in a uniform manner. Microsoft has
    separated the data store from the application
    that needs access to it through the use of this
    technology this was done because different
    applications need access to different types and
    sources of data and do not necessarily want to
    know how to access functionality with
    technology-specific methods. OLE DB is
    conceptually divided into consumers and
    providers. The consumers are the applications
    that need access to the data, and the provider is
    the software component that exposes an OLE DB
    interface through the use of the Component Object
    Model (or COM).
Write a Comment
User Comments (0)