ISQS 6339, Business Intelligence Supplemental Notes on the Term Project - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

ISQS 6339, Business Intelligence Supplemental Notes on the Term Project

Description:

ISQS 6339, Business Intelligence Supplemental Notes on the Term Project Zhangxi Lin Texas Tech University * Term project 2-3 students form a team to fulfill a data ... – PowerPoint PPT presentation

Number of Views:262
Avg rating:3.0/5.0
Slides: 21
Provided by: admin1259
Category:

less

Transcript and Presenter's Notes

Title: ISQS 6339, Business Intelligence Supplemental Notes on the Term Project


1
ISQS 6339, Business IntelligenceSupplemental
Notes on the Term Project
  • Zhangxi Lin
  • Texas Tech University

2
Term project
  • 3-6 students form a team to fulfill a data mart
    development project.
  • Stage 1 (10) SQL Server Project proposal. March
    4
  • Stage 2 (25) Data mart Implementation. March 27
  • Stage 3 (10) Hadoop Project proposal. Due April
    15
  • Stage 4 (25) Hadoop Project completed. Due
    April 29
  • Stage 5 (10) Presentation. May 1 6
  • Stage 6 (30) Final report. Due May 13
  • Detailed instructions http//zlin.ba.ttu.edu/6339
    /Projects14.html

3
Merits of the project outcomes
  • Carefully developed project proposal
    demonstrating the understanding of the business
    requirements, attractive analytics themes, and
    clearly defined project goal and objectives
  • Comprehensive data mart design, such as multiple
    fact tables, with supporting analytic themes
  • Applications of advanced ETL model or techniques,
    such as slowly changing dimensions, the use of
    containers, etc.
  • Advanced OLAP cube design, and/or optional MDX
    scripting by self-taught
  • Rich data analysis outcomes
  • Well-presented final report
  • Demonstrating the creative ideas and skillful
    data warehousing ability

4
Project Datasets
  • ISQS 6339
  • 2014-01

5
Qiyi (???) Data of Online Purchases
  • Website http//www.6695.com
  • Data format MySQL
  • Data size about 3GB
  • Data structure 11 tables
  • This data is to be installed to MySQL. It can be
    applied for two purposes
  • Data warehousing with Hadoop/HBase, MySQL, or SQL
    Server 2008
  • Data mining for credit assessment or purchase
    preference (ISQS 6347)
  • Note some contents are in Chinese

6
Table 8 order_refund_log
order_sn refund_id user_id
Table 3 order_air_user
order_sn
Table 2 order_air
order_sn
Table 7 order_refund
order_sn refund_id user_id
Table 9 order_rights
order_sn user_id
Table 4 order_be-selled
order_sn
Table 6 order_data
order_sn user_id
Table 10 order_table
order_sn user_id
Table 5 order_caigou
order_sn
Table 11 order_ship
order_sn
Table 1 data_affix
order_sn
7
Vehicle locations provided by Beijing 1039
traffic radio
  • Website http//www.fm1039.com/index/index.htm
  • Data format MongoDB
  • Data Size about 3BG for traffic status in a week
  • Data structure (x, y) of vehicles locations
  • This data is to be installed on MongoDB
  • This data can be installed either at Hadoop or
    SQL Serve 2008 for traffic condition analyses

8
Lending Club Data
  • Website https//www.lendingclub.com/
  • This is a famous P2P lending company to go IPO in
    2014. Google bought its 7 share in 2013 in 125
    million.
  • Data size 5 compressed tables, total about 80
    MB

9
Propsper.com dataset
  • Website http//www.prosper.com
  • Propsper.com is another famous P2P lending
    company with a differentiated business model from
    that used by Lending Club.
  • Data size is to be determined.

10
Helsinki cell phone user data
  • 15 users activity data, 20-22 tables, about 30
    GB

11
Hadoop projects
12
Components
  • Load Balancer
  • Oozie
  • Solr, SolrCloud, SolrJ, HA
  • NewSQL
  • Kafka, Storm, Impala
  • REST
  • ZK
  • MySQL
  • Nginx/HA-Proxy
  • Flume
  • Sqoop
  • Ganglia
  • Technology stack
  • Tomcat, Jetty 
  • Avro

13
Data Warehousing Methodology
  • - Implementing data warehouse systematically

13
14
Dimensional Modeling Process
  • Preparation
  • Identify roles and participants
  • Understanding the data architecture strategy
  • Setting up the modeling environment
  • Establishing naming conventions
  • Data profiling and research
  • Data profiling and source system exploration
  • Interacting with source system experts
  • Identifying core business users
  • Studying existing reporting systems
  • Building Dimensional models
  • High-level dimensional model design
  • Identifying dimension and fact attributes
  • Developing the detailed dimensional model
  • Testing the model
  • Reviewing and validating the model

15
Business Dimensional Lifecycle
Business Reqts definition
Technical Arch. Design
Product Selection Installation
Growth
Dimensional Modeling
Physical Design
ETL design Development
Deployment
Project Planning
BI Appl. Specification
BI Application Development
Maintenance
Project Management
15
16
Data Profiling
  • Data profiling is a methodology for learning
    about he characteristics of the data
  • It is a hierarchical process that attempt to
    build an assessment of the metadata associated
    with a collection of data sets.
  • Three levels
  • Bottom characterizing the values associated
    with individual attributes
  • Middle the assessment looking at relationships
    between multiple columns within a single table.
  • Highest level the profile describing
    relationships that exist between data attributes
    across different tables.
  • Can run a program against the sandbox source
    system to obtain the needed information.

16
17
ETL Methodology
  • Develop a high-level map
  • Build a sandbox source system (optional)
  • Detailed data profiling
  • Make decisions
  • The source-to-target mapping
  • How often loading tables
  • The strategy for partitioning the relational and
    Analysis Services fact table
  • The strategy for extracting data from each source
    system
  • De-duplicate key data from each source system
    (optional)
  • Develop a strategy for distributing dimension
    tables across multiple database servers
    (optional)

17
18
Sandbox Source System
  • Sandbox
  • A protected, limited environment where
    applications are allowed to "play" without
    risking damage to the rest of the system.
  • A term for the RD department at many software
    and computer companies. The term is
    half-derisive, but reflects the truth that
    research is a form of creative play.
  • In the DW/BI context, sandbox source system is a
    subset of source database for analytic
    exploration tasks
  • How to create
  • Set up a static snapshot of the database
  • By sampling

18
19
Decision Issues in ETL System Design
  • Source-to-target mapping
  • Load frequency
  • How much history is needed

19
20
Strategies for Extracting Data
  • Extracting data from packaged source systems
    self-contained data sources
  • May not be good to use their APIs
  • May not be good to use their add-on analytic
    system
  • Extracting directly from the source databases
  • Strategies vary depending on the nature of the
    source database
  • Extracting data from incremental loads
  • How the source database records the changes of
    the rows
  • Extracting historical data

20
Write a Comment
User Comments (0)
About PowerShow.com