ISQS 6339, Business Intelligence Extraction, Transformation, and Loading - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

ISQS 6339, Business Intelligence Extraction, Transformation, and Loading

Description:

Extraction, Transformation, and Loading Zhangxi Lin Texas Tech University * Control Flow for Importing Expanded Files * Exploring Features of SQL Server ETL System ... – PowerPoint PPT presentation

Number of Views:193
Avg rating:3.0/5.0
Slides: 36
Provided by: zlinBaTt8
Category:

less

Transcript and Presenter's Notes

Title: ISQS 6339, Business Intelligence Extraction, Transformation, and Loading


1
ISQS 6339, Business IntelligenceExtraction,
Transformation, and Loading
  • Zhangxi Lin
  • Texas Tech University

2
Outline
  • Introduction to SSIS
  • Learn by doing Exercise 4
  • More about SSIS features - Package development
    tools

3
Youtube Videos
  • Introduction to SQL Server Integration Services
  • Part 1 1019, Part 2 812
  • Create A Basic SSIS Package with SSIS 755
  • How to create a simple SSIS package
  • Part 1 319, Part 2 754, Part 3 641
  • More videos
  • An Overview of SSIS
  • Part 1 612, Part 2 613, Part 3 820
  • ETL Demon 1040
  • ETL Tools 456
  • SSIS 2008 Package Deployment
  • Part I 804, Part II 509
  • Introduction to SSIS 2008
  • Part I 957, Part II 957, Part III 955, Part
    IV 959, Part V 611
  • ETL Strategies with SSIS 2156

4
Introduction to SSIS
5
ETL Topics
  • Dimension Processing
  • Extract changed row from the operational database
  • Handling slowly changing dimensions
  • De-duplication and fuzzy transforms
  • Fact Processing
  • Extract fact data from the operational database
  • Extract fact updates and deletes
  • Cleaning fact data
  • Checking data quality and halting package
    execution
  • Transform fact data
  • Surrogate key pipeline
  • Loading fact data
  • Analysis services processing
  • Integrating all tasks

6
Automating your routine information processing
tasks
  • Your routine information processing tasks
  • Read online news at 800a and collect a few most
    important pieces
  • Retrieve data from database to draft a short
    daily report at 10a
  • View and reply emails and take some notes that
    are saved in a database
  • View 10 companies webpage to see the updates.
    Input the summaries into a database
  • Browse three popular magazines twice a week.
    Input the summaries into a database
  • Generate a few one-way frequency and two-way
    frequency tables and put them on the web
  • Merge datasets collected by other people into a
    main database.
  • Prepare a weekly report using the database and at
    4p every Monday, and publish it to the internal
    portal site.
  • Prepare a monthly report at 11a on the first day
    of a month, which must be converted into a pdf
    file and uploaded to the website.
  • Seems there are many things on going. How to
    handle them properly in the right time?
  • Organizer yes
  • How about regular data processing tasks?

7
SQL Server Integration Services (SSIS)
  • The data in data warehouses and data marts is
    usually updated frequently, and the data loads
    are typically very large.
  • Integration Services includes a task that bulk
    loads data directly from a flat file into SQL
    Server tables and views, and a destination
    component that bulk loads data into a SQL Server
    database as the last step in a data
    transformation process.
  • An SSIS package can be configured to be
    restartable. This means you can rerun the package
    from a predetermined checkpoint, either a task or
    container in the package. The ability to restart
    a package can save a lot of time, especially if
    the package processes data from a large number of
    sources.

8
What can you do with SSIS?
  • To load the dimension and fact tables in the
    database. If the source data for a dimension
    table is stored in multiple data sources, the
    package can merge the data into one dataset and
    load the dimension table in a single process,
    instead of using a separate process for each data
    source.
  • To update data in data warehouses and data marts.
    The Slowly Changing Dimension Wizard automates
    support for slowly changing dimensions by
    dynamically creating the SQL statements that
    insert and update records, update related
    records, and add new columns to tables.
  • To process Analysis Services cubes and
    dimensions. When the package updates tables in
    the database that a cube is built on, you can use
    Integration Services tasks and transformations to
    automatically process the cube and to process
    dimensions as well.
  • To compute functions before the data is loaded
    into its destination. If your data warehouses and
    data marts store aggregated information, the SSIS
    package can compute functions such as SUM,
    AVERAGE, and COUNT. An SSIS transformation can
    also pivot relational data and transform it into
    a less-normalized format that is more compatible
    with the table structure in the data warehouse.

9
SQL Server Integration Services
  • The hierarchy of SSIS
  • Project -gt Package -gt Control flow -gt Data flow
  • Package structure
  • Control flow
  • Data flow
  • Event handler
  • Package explorer
  • Connection tray
  • Features
  • Event driven
  • Layered
  • Drag-and-drop programming
  • Data I/O definitions are done using Connection
    Managers

9
10
  • SSIS Architecture

11
Three Layer Structure of Integration Services
  • Touch down to the playground

Control Flow Data Flow Event Handler
11
12
Control Flow
  • Bulk Insert task Perform a fast load of data fro
    flat files into a target table. Good for loading
    clean data.
  • Execute SQL task Perform database operations,
    creating views, tables, or even databases. Query
    data or metadata
  • File Transfer Protocol and File System tasks
    transfer files or sets of files.
  • Execute Package, Execute DTS2000 Package, and
    Execute Process tasks Break a complex workflow
    into smaller ones, and define a parent or master
    package to execute them.
  • Send Mail task sends an email message.

13
Control Flow (contd)
  • Script and ActiveX Script tasks Perform an
    endless array of operations that are beyond the
    scope of the standard tasks.
  • Data Mining and Analysis Service Processing
    tasks Launch processing on SSAS dimensions and
    databases. Use SSAS DDL task to create new
    Analysis Services partitions, or perform any data
    definition language operation.
  • XML and Web Services tasks
  • Message Queue, WMI Data Reader, and WMI Event
    Watcher tasks Help to build an automatic ELT
    system.
  • ForEach Loop, For Loop, and Sequence containers
    Execute a set of tasks multiple times
  • Data Flow tasks

14
Data Flow Task
  • Data Flow task is a pipeline in which data is
    picked up, processed and written to a
    destination.
  • Avoids I/O, which provided excellent performance
  • Concepts
  • Data sources
  • Data destinations
  • Data transformations
  • Error flows

15
Frequently Used Data Transformation Steps
  • Sort and Aggregate transforms
  • Conditional Split and Multicast transforms
  • Union All, Merge Join, and Lookup transforms
  • Slowly Changing Dimension transform
  • OLE DB Command transform
  • Row Count and Audit transforms
  • Pivot and Unpivot transforms
  • Data mining Model Training and data Mining Query
    transforms
  • Term extraction and Term Lookup transforms
  • File Extractor and File Injector transforms

16
Dynamic Packaging
  • Modifying the actions that a package takes when
    its executing.
  • SSIS implements a rich expression language that
    is used in control flow and also in data flow
    transform.
  • Concepts
  • Expressions. Uses an expression language, simple.
  • Variables. Can be defined within a package. Can
    be scoped to any object package-wide, within a
    container, a single task, etc.
  • Configurations. Can overwrite most of the
    settings for SSIS objects by supplying a
    configuration file at runtime.

17
Decision Issues in ELT System Design
  • Source-to-target mapping
  • Load frequency
  • How much history is needed

ISQS 6339, Data Management Business Intelligence
17
18
Strategies for Extracting Data
  • Extracting data from packaged source systems
    self-contained data sources
  • May not be good to use their APIs
  • May not be good to use their add-on analytic
    system
  • Extracting directly from the source databases
  • Strategies vary depending on the nature of the
    source database
  • Extracting data from incremental loads
  • How the source database records the changes of
    the rows
  • Extracting historical data

ISQS 6339, Data Management Business Intelligence
18
19
De-Duplication
  • Two common situations person, and organization
  • SSIS provides two general-purpose transforms
    helping address data quality and de-duplication
  • Fuzzy Lookup
  • Fuzzy Grouping

19
20
Learn by doing Exercise 4
21
Exercise 4 Populate Maximum Miniatures
Manufacturing Data Mart Dimensions
  • Preparation Data sources and destination
    definition
  • Source database AccountingSystemDatabase
  • Loading dimensions
  • ProductType
  • ProductSubType
  • Product
  • Country
  • Plant (using SQL Command)
  • Material (using SQL Command, Aggregate item)
  • MachineType (copied from the Material loading
    task)
  • Machine (copied from the MachineType loading
    task)
  • Note DimBatch and the fact table will be loaded
    in the next exercise.
  • Debugging
  • Step by step
  • Understand the error messages
  • Watch database loading status
  • See more detailed Guidelines of this exercise
  • Submit the screenshots of green results of the
    ETL flow to isqs6347_at_gmail.com by February 21
    before 5p.

21
22
Snowflake Schema of the Data Mart
ManufacturingFact
DimBatch
9
10
DimMachine
8
DimProduct
3
DimMachineType
DimPlant
5
7
DimProductSubType
2
DimMaterial
DimCountry
4
6
DimProductType
1
Aggregate
SQL Coding
22
23
Codes for data flows
  • The following codes are used to selectively
    retrieve data from the source for the destination
    database
  • Code for DimPlant loading
  • SELECT LocationCode, LocationName, CountryCode
  • From Locations
  • WHERE LocationType 'Plant Site'
  • Code for DimMaterial loading
  • SELECT AssetCode, AssetName, AssetClass,
    LocationCode,
  • Manufacturer, DateOfPurchase, RawMaterial
  • FROM CapitalAssets
  • WHERE AssetType 'Molding Machine'

23
24
Package Items
  • Data flow Task main task
  • Control Flow Items
  • For Loop Container, Foreach Loop Container,
    Sequence Container
  • Data Preparation Tasks
  • File System Task, FTP Task, Web Service Task, XML
    Task
  • Work Flow Tasks
  • Execute Package Task, Execute DTS 2000 Package
    Task, Execute Process Task, Message Queue Task,
    Send Mail Task, WMI Data Reader Task, WMI Event
    Watcher Task
  • SQL Server Tasks
  • Bulk Insert Task, Execute SQL task
  • Scripting Tasks
  • ActiveX Script Task, Script Task
  • Analysis Services Tasks
  • Analysis Services Processing Task, Analysis
    Services Execute DDL Task, Data Mining Query Task
  • Transfer Tasks
  • Transfer Database Task, Transfer Error Messages
    Task, Transfer Logins Task Transfer Objects Task,
    Transfer Stored Procedures Task
  • Maintenance Tasks
  • Custom Tasks

25
Exploring Features of SQL Server ETL System
  • Data sources and data destinations
  • SQL Server file (OLE DB file)
  • Flat file
  • Excel file
  • Data flow transformation
  • Aggregate
  • Derived Column
  • Data Conversion
  • Sort

25
26
More About SSIS Features
27
ETL System Debugging
  • Most frequently encountered errors
  • Data format error The database tables data type
    does not match the input datas format
  • Reason 1 Flat Text file uses varchar(50), or
    stringDT_STR format Excel file uses nvarchar
    format
  • Reason 2 You defined the database using
    different formats, which could be caused by the
    imported data set.
  • Solution A Data Conversion data transformation
    node can be used for changing the format
  • SQL Server system error Even though you did
    things correctly you could not get through.
  • Solution the easiest way to solve this problem
    is to redo the ETL flow.

27
28
ETL How-to Problems
  • How to use Merge function of Data Transformation
    to join datasets from two tables into one.
  • How to split a dataset to two tables
  • How to remove duplicated rows in a table.
  • How to detect the changes of the rows in the data
    sources and extract the updated rows into a table
    in the data warehouse.
  • How to load multiple datasets with similar
    structure into a table
  • Reference SQL Server 2005 Integration Services,
    McGraw Hill Osborne, 2007

28
29
Connection managers
  • Excel Connection Manger
  • File Connection Manger
  • Flat File Connection Manager
  • FTP Connection Manager
  • HTTP Connection Manager
  • ODBC Connection Manager
  • OLE DB Connection Manager ADO Connection Manager
    for legacy applications using earlier versions
    of programming languages, such as VB 6.0
  • ADO.NET Connection Manager Access to Microsoft
    SQL Server and data sources exposed through OLE
    DB and XML by using a .NET provider
  • Microsoft .NET Data Provider for mySQL Business
    Suite access to SAP server and enables to
    execute RFC/NAPI commands and select queries
    against SAP tables
  • Design-time data source objects can be created in
    SSIS, SSAS and SSRS projects

30
Container Managers
  • Foreach Loop Container
  • For Loop Container
  • Sequence Container

31
Different Types of ETL Control Flows
  • With data flows, e.g.
  • Import data
  • Database updates
  • Loading SCD
  • Database cleansing
  • Aggregating data
  • Without data flows, e.g.
  • Downloading zipped files
  • Archiving downloaded files
  • Reading application log
  • Mailing opportunities
  • Consolidating workflow package

32
Data Flow for Updating Database
33
Data Flow for Loading Slowly Changing Dimension
34
Control Flow for Importing Expanded Files
35
Exploring Features of SQL Server ETL System (TBD)
  • Data Set
  • Source Commrex_2011, D5.txt (in the shared
    directory under \OtherDatasets)
  • Destination Flat file, Excel file, OLE DB file
  • Data flow transformation
  • Aggregate (Use D5.txt, and aggregate the data
    with regard to UserID)
  • Derived Column (Use Commrex_2011, and create a
    new column NewID)
  • Data Conversion (Use Commrex_2011, and convert
    data type of some columns, such as UserID,
    Prop_ID)
  • Sort (use D5.txt, sort ascending with ID, Date,
    Time)

35
Write a Comment
User Comments (0)
About PowerShow.com