Metadata Management in Data Warehousing - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Metadata Management in Data Warehousing

Description:

Ex) an actual chair is a pretty concrete item. Any specific chair has a host of ... it has two cushions, and so on. one step in the chain of abstraction ... – PowerPoint PPT presentation

Number of Views:1373
Avg rating:3.0/5.0
Slides: 26
Provided by: dblabE
Category:

less

Transcript and Presenter's Notes

Title: Metadata Management in Data Warehousing


1
Metadata Management in Data Warehousing
  • Prof. Hwan-Seung Yong
  • Dept. of CSE, Ewha Womans Univ.
  • http//dblab.ewha.ac.kr/hsyong

2
Data Environment
  • Data is becoming an abundant commodity
  • we can get it anywhere and everywhere
  • ???? ???? ??? ??? ????? ?? ??/?? ??
  • ??? ???? ?? ??? ?!
  • like trying to find one specific grain of sand on
    a stretch of beach
  • Good business practice
  • First, acquiring quality raw data
  • second, combining and integrating the data to
    make useful information, and then
  • analyzing the information and making high quality
    decisions
  • The torrent of raw data has added more choice,
    and therefore complexity, to the process.
  • What we need to do is put the data in context,
    give the data meaning, relevance, and purpose,
    and make it complete and accurate

3
Data From Everywhere
  • Corporations are coupling together in webs of
    suppliers, partners, and customers, exchanging a
    myriad of information through a spectrum of
    technologies such as
  • Electronic Document Interchange (EDI) systems,
  • Electronic Funds Transfer (EFT) systems,
  • email,
  • a host of other data acquisition and networking
    applications
  • existing legacy systems within enterprises
    continue to generate
  • data on orders, sales, revenues, employee
    information, manufacturing schedules, inventory,
    fleet status, and every other parameter
    imaginable.

4
What we know about the data generated by these
systems
  • Each department, division, group, branch, section
    or any other subdivision is today capable of
    generating its own unique caches of data.
  • data in one group can have a different meaning in
    another group in the same organization
  • This disparate data is exacerbated by
  • readily available CASE tools,
  • rapid application development tools,
  • application and code generators,
  • underutilized data models and definitions,
  • database products,
  • spreadsheets,
  • other client friendly products,
  • a lack of leadership in management.

5
Approach to solve this disparity
  • The concept of a data warehouse has emerged as a
    technology by which management can get a single
    comprehensive view of the state of the
    organization.
  • a requirement to under-stand and manage the
    properties of the data
  • Metadata attributes about the data, or data
    about data.
  • data that sits behind the operational data,
  • describes its origin, meaning, derivation, etc
  • ?) What is gross sales?
  • Dollars or French Francs,
  • quarterly or annualized,
  • what system does it come from,
  • when is it extracted, etc.?)
  • ??? ???? ?? ?? (? ??? ?? ?? ??) ?? ??? ????? ????
    ?????? ?!
  • range from a conceptual overview of the real
    world to detailed physical specifications for a
    particular database management system.

6
What is Metadata
  • 'chain of abstraction concept
  • Ex) an actual chair is a pretty concrete item.
  • Any specific chair has a host of characteristics
  • it is made of wood,
  • it has two arms,
  • it is a particular shape,
  • it has two cushions, and so on
  • one step in the chain of abstraction
  • Software is abstraction piled on abstraction
  • LET BALANCE BALANCE CREDIT
  • The source code statement in turn, is an
    abstraction of a banking transaction.
  • Metadata is an abstraction from data.
  • It is high-level data that describes lower-level
    data.

7
What is Metadata
  • Software is full of metadata for example,
  • Record descriptions in a COBOL program.
  • Logical view descriptions in a data server's
    catalog.
  • SQL Create statements.
  • Entity-relationship diagrams in a CASE tool's
    repository.
  • METADATA IS THE KEY TO DATA
  • Metadata is instrumental in transforming raw data
    into knowledge.
  • ?1) a given stream of bits can be interpreted as
  • a customer's address,
  • part of a photographic image,
  • a code fragment in a given computer's machine
    language.
  • ?2) In Windows 9x, file name extension is another
    metadata
  • Should the metadata get mixed up or out of
    alignment, none of the data will make sense.

8
Metadata Status in an Organization
  • rushed implementation
  • little thought given to coordinating data
    elements with other groups
  • so the problem continues
  • the metadata situation is worse than the data
    situation
  • the disparate data arriving from multiple sources
    from within and outside the corporation,
  • multiple tools creating metadata in a variety of
    formats.
  • Most companies
  • want to rapidly implement data warehousing
    systems
  • then discover themselves in a metadata dilemma.

9
Metadata Management Historical Approaches
  • Early in the days of Information Technology
  • all data was defined and maintained within the
    computer program itself.
  • Because the files' metadata was embedded in the
    application's data definitions, no application
    could make sense of another application's files
  • There was no need to share data between systems
  • In the late 1960's and early '70s, when databases
    were introduced
  • one of the greatest advantages they brought was
    that the metadata was stored in the database
    catalog
  • data "dictionaries" that were shared by programs
  • early implementers of data warehouses
  • the idea of a metadata repository was created,
    similar to a data dictionary
  • The awareness of the need to manage metadata has
    been an offshoot of the growth of data
    warehousing.

10
Growth of Data Warehousing
  • Growth across platforms.
  • spans from Microsoft NT, through the UNIX domain,
    and on up into the giants such as NCR, DEC, and
    IBM.
  • Growth across tools.
  • Success has many parents, and failure is an
    orphan.
  • From legacy data extraction tools to maintenance
    and scheduling tools, and yes, even tools that
    purported to handle metadata, vendors and
    products proliferated.
  • Growth across departments.
  • Success also breeds demand, which is exactly what
    happens when department A gets a new data mart
    and starts showing colleagues in department B how
    easily they can now access consistent data. Soon,
    department B has its own data mart, and
    departments C, D, and E are not far behind.

11
Metadata Management Problem in DW
  • the subject of metadata management remained
    fractured and dispersed
  • metadata is spread across different components of
    the warehouse,
  • in the data extraction/cleansing tools
  • in the loading tools,
  • in the OLAP tools, which need to present metadata
    to users in order to navigate.
  • Business rules are separated from technical
    metadata,
  • they are kept by different systems in different
    formats with different user interfaces.
  • In the case of multiple data marts spread across
    the enterprise,
  • if there is a need for a user in department A to
    use a data mart created by department B, then in
    most cases, that user has to relearn the metadata
    navigation for that system.

12
METADATA USED IN THE DATA WAREHOUSE
  • Results from the Data Warehouse
  • 1. 739516 13238350 426615 800441 4912313
  • 2. "An Inverness Group report dated 4/7/95 states
    that the European market for repository tools
    expanded by 33 in 1994".
  • 3. "Leading gadget vendors Protz Group 48,
    Harris Goods 29, Zymurgy Inc. 13"
  • How we can interpret this?
  • In the first case, The numbers may be
  • a firm's sales for different regions,
  • population of towns,
  • the number of hairs on certain individuals'
    heads.
  • a sequence of machine code for some computer.
  • In the second case, seems to be self-describing
  • the date is ambiguous. Does it mean '4th July
    1995' (British convention) or '7th April 1995'
    (US convention)?

13
METADATA USED IN THE DATA WAREHOUSE
  • Results from the Data Warehouse
  • 3. "Leading gadget vendors Protz Group 48,
    Harris Goods 29, Zymurgy Inc. 13"
  • The third example contains some metadata, but not
    enough.
  • Stipulating that 'gadget' has a precise meaning
    to someone in the gadget industry, we still do
    not know whether the market referred to is US,
    world, European,
  • what time period is referred to
  • how the data was collected or even the source of
    the information.
  • Under the circumstances, this example should be
    filed in the data warehouse with a metadata
  • note that says 'source unknown, reliability
    unknown, no further details'.

14
METADATA USEDIN THE DATA WAREHOUSE
  • In the operational environment,
  • metadata is mostly valuable to software
    developers and database administrators.
  • Operational databases are only accessed by
    transaction processing applications.
  • The end-users - people like bank clerks, travel
    agents and hospital staff -
  • do not need to know how information is held in
    the database.
  • simply interact with the forms and screens
    provided by the applications they use in the
    course of their everyday work.
  • The decision support environment
  • data analysts and executives are looking for
    useful facts and correlations
  • they need to understand its structure and
    meaning.
  • ?) A driver with a complete set of large-scale
    and small-scale maps
  • A data warehouse without adequate metadata is
    like a filing cabinet stuffed with papers, but
    without any folders or labels.

15
METADATA REQUIREMENTS
  • What tables, attributes, and keys does the data
    warehouse contain?
  • Where did each set of data come from?
  • What transformation logic was applied in loading
    the data?
  • How has the metadata changed over time?
  • What aliases exist and how are they related to
    each other?
  • What are the cross-references between technical
    and business terms?
  • For instance, the field name XVT-351J presumably
    meant something to a COBOL programmer in 1965,
    but what does it mean today?)
  • How often does the data get reloaded?
  • How much data is there?
  • This helps end-users to avoid submitting
    unrealistic queries. Given some means of
    determining the size of tables, staff can tell
    the end-users 'You can do anything you like with
    15,000 rows, but if it turns out to be 15 million
    rows - back off and ask for help!'

16
METADATA COMPONENTS
  • The mapping information records how data from
    operational sources is transformed on its way
    into the warehouse.
  • Identification of source fields.
  • Simple attribute-to-attribute mapping.
  • Attribute conversions.
  • Physical characteristic conversions.
  • Encoding/reference table conversions.
  • Naming changes.
  • Key changes.
  • Defaults.
  • Logic to choose from among multiple sources.
  • Algorithmic changes.

17
METADATA COMPONENTS (Contd)
  • EXTRACT HISTORY
  • Whenever historical information is analysed,
    meticulous update records have to be kept.
  • The metadata history is a good place to start any
    time-based report.
  • MISCELLANEOUS
  • Aliases can make the warehouse much more
    use-friendly
  • They also come in useful when different
    departments wish to use their own names to refer
    to the same underlying data.
  • Status information can be used to keep
    information for instance, tables might be
    classified 'in-design', 'in-test', 'inactive' or
    'active'.
  • Often, parts of the same data warehouse may be in
    different stages of development
  • Volumetrics could usefully include such
    information as number of rows, growth rate, usage
    characteristics, indexing, and byte
    specifications.
  • T he criteria and timescales for purging old
    data.

18
METADATA COMPONENTS (Contd)
  • SUMMARISATION ALGORITHMS
  • The algorithms for summarising the detail data
  • RELATIONSHIP ARTIFACTS AND HISTORY
  • Data warehouses implement relationships in a
    different way from production databases.
  • Metadata pertaining to related tables,
    constraints, and cardinality with text
    descriptions and ownership records.
  • OWNERSHIP/STEWARDSHIP
  • to identify the originator of each set of data,
    so that inquiries and corrections can be made to
    the proper group.
  • It is useful to distinguish between 'ownership'
    of data in the operational environment and
    'stewardship' in the data warehouse.
  • ACCESS PATTERNS
  • patterns of access to the warehouse to optimize
    and tune performance.
  • Less frequently used data can be migrated to
    cheaper storage media,

19
METADATA COMPONENTS (Contd)
  • REFERENCE TABLES/ENCODED DATA
  • Reference data is data that is stored in an
    external table and contains commonly used
    translations of values.
  • To recover the original unencoded data, together
    with 'effective from' and 'effective to' dates.
  • DATA MODEL - DESIGN REFERENCE
  • Metadata describing the mapping between the data
    model and the physical design
  • This allows an ambiguities or uncertainties to be
    resolved.

20
A Repository
  • the vehicle of metadata where information
    (metadata) about an organization's information
    systems components (objects, table definitions,
    fields, business rules and so on) is held.
  • also contains tools to facilitate the
    manipulation and query of the metadata.
  • a repository can
  • aid in the integration of the views of disparate
    systems by helping understand how the data used
    by those systems are related
  • support rapid change and assistance in building
    systems quickly by impact analysis and provision
    of standardized data
  • facilitate reuse by using object concepts and
    central accessibility
  • assist in implementation for data warehousing
  • support software development teams.

21
Requirements for Enterprise Repository
  • Nonproprietary Relational Database Management
    System
  • should use an industry standard DBMS which
    provides significant advantages over
    vendor-developed DBMSs.
  • advanced tools and utilities for database
    management (such as backups and performance
    tuning)
  • maintainablity and accessibility are enhanced by
    an "open" system.
  • Fully Extensible Meta Model
  • a complete self-defining, extensible repository
    based on a common entity/relationship diagram.
  • The repository should support the following meta
    model extensions
  • adding or modifying an entity type,
  • adding or modifying a linkage between entity
    types (associations or relationships),
  • adding user views to entities or relationships,
  • adding, deleting, or modifying attributes of
    relationships or entities,
  • modifying the list of allowable values for an
    attribute type,
  • adding or modifying commands or user exits,

22
Requirements for Enterprise Repository(Contd)
  • An API access to the repository
  • can provide with the flexibility needed to create
    a metadata management system which suits their
    unique needs.
  • Central Point of Metadata Control
  • The repository serves as a central point of
    control for data
  • a single place of record about information assets
    across the enterprise.
  • documents where the data is located,
  • who created and maintains the data,
  • what application processes it drives,
  • what relationship it has with other data,
  • how it should be translated and transformed.
  • Impact Analysis Capability
  • to provide the total impact of any change.

23
Requirements for Enterprise Repository (Contd)
  • Naming Standards Flexibility
  • should provide a detailed map of data definitions
    and elements,
  • allowing an organization to evaluate redundant
    definitions and elements
  • decide which ones should be eliminated,
    translated, or converted.
  • By enforcing naming standards,
  • assists in reducing data redundancies and
    increasing data sharing,
  • making the application development process more
    efficient and therefore less costly.
  • Versioning Capabilities
  • "versioning" can have many different definitions.
  • version control as in test vs. production
    (lifecycle phasing)
  • versions as unique occurrences
  • versioning by department or business unit and
  • version by aggregate or workstation ID.
  • facilitate the application lifecycle development
    process by allowing developers to work with the
    same object concurrently.

24
Requirements for Enterprise Repository (Contd)
  • Robust Query and Reporting
  • The end user tool should seamlessly pass queries
    to its own tool or third party products for
    automatic query generation and execution.
  • business users should be able to create detailed
    reports from these tools,
  • Data Warehousing Support
  • The repository provides information about the
    location and nature of operational data which is
    critical in the construction of a data warehouse.
  • It acts as a guide to the warehouse data, storing
    information necessary to define the migration
    environment, mappings of sources to targets,
    translation requirements, business rules, and
    selection criteria to build the warehouse

25
References
  • Putting Metadata to Work in the Warehouse,
    http//www.cai.com/products/platinum/wp/wp_meta.ht
    m
  • ComputerWire Data Warehousing Tools Bulletin
    Briefing Paper What is Metadata,
    http//www.computerwire.com/bulletinsuk/212e_1a6.h
    tm
Write a Comment
User Comments (0)
About PowerShow.com