High Performance Data Management - PowerPoint PPT Presentation

1 / 67
About This Presentation
Title:

High Performance Data Management

Description:

Image processing techniques are used to extract ... Angkor Wat, Cambodia. Vast complex of more. than 60 temples. Spiritual center for. the Khmer people ... – PowerPoint PPT presentation

Number of Views:95
Avg rating:3.0/5.0
Slides: 68
Provided by: scm77
Category:

less

Transcript and Presenter's Notes

Title: High Performance Data Management


1
High PerformanceData Management
  • Omer Rana
  • Cardiff University and
  • Welsh E-Science Centre
  • o.f.rana_at_cs.cf.ac.uk

2
From Michael Lesk
Burned manuscript from British Library
3
From Michael Lesk
4
From Michael Lesk
George Washington manuscript
5
From Michael Lesk
Keio University - Gutenberg Bible
6
From Michael Lesk
International Dunhuang project cave painting
7
From Michael Lesk
Seismic activity IRIS data base
8
From Michael Lesk
9
From Michael Lesk
Alcohol dehydrogenase
Molecule of the month, Protein Data Bank
10
From Michael Lesk
CalFlora database rhododendron
11
From Michael Lesk
Sky Sloan Digital Sky Survey
12
From Michael Lesk
MRI scan, UCLA
13
From Michael Lesk
East coast dolphin database, Eckerd College
(Kelly Debure)
14
From multimap.com
15
Synthetic Aperture Radar
Uses Vegetation type, extent and deforestation,
Soil erosion Soil moisture content,
Archaeological investigations, Ocean dynamics,
wave and surface wind speeds and
direction Volcanic and tectonic activity
Roy Williams (Caltech)
16
Applications Detection and monitoring of oil
spills
Roy Williams (Caltech)
17
Archaelogy
SAR sees new archaeology beneath forest cover
Angkor Wat, Cambodia Vast complex of more than
60 temples Spiritual center for the Khmer
people 9th century
Roy Williams (Caltech)
18
From Michael Lesk
Human motion digital library, Jezekiel Ben-Arie
19
From Michael Lesk
CT scans of crocodile skulls, Tim Rowe (U Texas)
20
From Michael Lesk
Beauvais cathedral, Peter Allen, Columbia
21
From Michael Lesk
Forma Urbis Romae, Marc Levoy, Stanford
22
From Michael Lesk
Image searching Jitendra Malik, David Forsyth
23
From Michael Lesk
3-D search Tom Funkhouser, Princeton
24
(No Transcript)
25
Drivers
  • Survey of needs - led by some questions
  • Grid based infrastructure has become an important
    environment -- how do we utilise this?
  • What are specific manufacturers offering?
  • BlueArcs FPGA based storage
  • What is available now - and what is (ideally)
    needed at all layers of data management
    hierarchy?
  • Currently available tools
  • Application demands

26
Server Storage Capacity Requirements
27
(No Transcript)
28
Technologies
  • Three technologies core
  • Object based representation
  • XML for metadata representation and management
  • Object distribution and Peer-2-Peer systems
    (JXTA, Gnutella, Freenet)
  • Related areas
  • Digital Libraries
  • Persistent Archives
  • Data Mining environments

29
Wider Perspective Data Management
  • Data Preparation, Format, Fusion Data
    Interoperability and Integration (Quality)
    Problem (Meta(n) Data)
  • Data Storage I/O Problem (hierarchic)
  • Data Mining and Knowledge Discovery
    Intelligence Problem Neural nets, Genetic
    Algorithms, Rules
  • Query Estimation and Optimisation estimate cost
    of query based on data gathered from user
  • Data Exploration and Visualisation Navigation
    and Interaction Problem
  • Role of Standards

30
Services
  • Requirement for
  • Local services
  • directly supported on resource or via a proxy
  • Must be implemented on all Grid-enabled resources
  • Resource dependent
  • Global services
  • Shared across resources
  • May be subscribed to
  • May be integrated with other such services

31
Global Services
  • Data Access
  • Primitive operations (read, write) supported with
    location management
  • Service should provide hooks to such locally
    supported functions
  • Storage management to support multiple access (eg
    maintaining a container in cache if multiple
    accesses are being requested)
  • Location Transparency
  • Support data set discovery
  • Keep location of data source independent of
    access method
  • Support for a global namespace
  • Support logical view of data set -- and maintain
    independence from physical location of data
    source
  • Support logical factoring of data sets into
    containers - and subsequently, federations

32
Global Services 2
  • Security
  • System/User access without account on remote
    source
  • Access control shared across collections/aggregati
    ons
  • Ownership by collection objects, rather than
    individual users
  • Access control catalogues to separate logical
    access control mechanisms, independent of any
    particular resource
  • Persistence and Replication
  • Access via unique identifier independent of
    physical location of data set/collection (not
    possible with URIs or PURLs)
  • Must be automatically supported via an event
    service
  • Requires a global namespace
  • Error Handling
  • Distinguish between errors from individual
    resources and those from data collections/aggregat
    ions
  • Error handling supported via global namespace and
    event service
  • Error reporting to resource and client requesting
    service
  • Management Services

33
Global Services 3
  • Additional management services provided globally
    can include
  • Check-pointing and state management service
  • Data migration service to facilitate logical
    collections
  • Container management
  • Support for collection aggregation
  • Transformation between data supported through
    languages such as XSLT
  • XML useful for encoding data
  • Must be able to cover both ASCII and Binary data
    (BinX (NeSC), VOTable (Astronomy))

34
XPath and XSLT
  • XSL
  • Extensible Style Language
  • XSLT
  • Extensible Style Language Transformation

Output (HTML, Latex, Excel, ...)
35
Data Storage and Access
  • Policy (division between services)
  • automated (resource supported) vs. user
    implemented.
  • Must be exposed to the user
  • Operations (minimal set supported)
  • Access operations (read, write)
  • Discovery support (address lookup, access
    properties)
  • Exception handling
  • State Management (transaction support)
  • State recording within resource or via a
    checkpoint service
  • Mechanism
  • Actual implementation of local services
  • Direct access to such mechanisms via resource
    metadata (by global services)
  • Ability to support multiple mechanisms in same
    resource
  • Structure
  • physical organisation of resource or its contents
    (disks, number of heads, access/transfer rates
    etc)
  • Support for external manipulation

36
Minimum Unit of transfer
  • Unit of transfer based on access patterns and
    data structs
  • Array based access in Scientific Computing
  • Random access in Business Computing
  • Support for block, cyclic, or irregular array
    strides (eg from Fortran programs)
  • Resource type also determines minimum unit
  • File system (NFS or AFS)
  • HPSS or DPSS or network caches
  • Structured databases (Objectivity, Oracle)
  • Minimum unit of transfer must be made explicit
  • Support for collections/aggregations of minimal
    units

37
Data Formats
  • Need for common data formats to support
    cross-domain analysis and data sharing
  • Automated annotation
  • of experimental results for analysis
  • of stages of analysis for management tools
  • Support for data fusion and quality management
  • However,
  • Unlikely to happen
  • Unlikely to be ratified through standards
  • Unlikely to be accepted by everyone
  • Compromise?
  • Define points of sharing rather than actual
    data formats
  • Support ease of exchange between formats, rather
    than agree on specific formats
  • Tried before Ontolingua (Stanford), DAML (DARPA,
    Maryland)
  • Can we have a Grid Ontology for Data Management?
    (OWL, OiL)

38
Metadata Management
  • Separate Content from Structure
  • Re-purpose data (supports sharing) - via
    catalogues
  • Used for data integration
  • DataCutter Project
  • Could represent
  • Scheme for locating data
  • Properties of data resource externally visible
  • security prevlidges (access rights)
  • Content types and structure (relationships)
  • Content Structure (Relationships) to support
    semantic interoperability
  • semantic/functional
  • spatial/structural
  • temporal/procedural

39
Data Processing
  • Processing Characteristics
  • Well defined work flow
  • Correction, calibration, transformation,filtering,
    merging
  • Relatively static reference data
  • Stable processing functions (audited changes)
  • Periodic reprocessing from archive

40
Analysis and Interpretation
  • Analysis Characteristics
  • - Variable workflow
  • - Standard functions
  • - Standard and personal
  • filtering and summarisation
  • - Retain drill down capability

41
Analysis and Interpretation
  • Conclusions/Inferences
  • Descriptions
  • Trends
  • Correlations
  • Relationships
  • Analysis and Interpretation Characteristics
  • Highly dynamic work flow
  • Multiple data types
  • Volatile data
  • Annotations, inferences, conclusions
  • Evidential reasoning
  • Shared multiple versions of truth
  • Periodic version consolidation

42
Metadata Requirements
  • Technical Metadata
  • Direct referencing - Physical location and data
    schema/structure
  • Data currency/status version, time stamping
  • Accreditation/Access permissions - Ownership
    (Dublin Core)
  • Query time/Governance - data volume, no. of
    records, access paths
  • Contextual Metadata
  • Logical referencing physical data
    semantic/syntactic ontologies
  • Lexical translation Thesaurus, ontological
    mapping
  • Named derivations (summarisations)
  • Scope of Requirements
  • All science communities
  • Related to provenance

43
Metadata Requirements
  • Data Versioning
  • Distinguish latest/agreed version of data
  • Maintain history record of change
  • Synchronise and mirror replicated data
  • Distinguish shared personal interpretations
    and/or annotations
  • Provenance
  • Record of data processing calibration,
    filtering, transformation
  • Record of workflow methods, standards and
    protocols
  • Reasoning evidential justification for
    inferences conclusions
  • Scope of Requirements
  • All science communities
  • Includes Technical and Contextual Metadata

44
Provenance
  • When you see some data on the Web, do you know
  • where it came from?
  • why it is there?
  • This information (provenance) is typically lost
    in the process of copying/transcribing/transformin
    g databases
  • Loss of provenance is an acute problem in some
    scientific databases
  • Especially relevant when combining data from
    multiple sources
  • The Web has lots of stuff (50 TB or so). Digital
    libraries introducing new kinds of collections,
    and often try to maintain metadata control, but
    often still have little control over the ultimate
    content.

45
Standards at different levels
  • IEEE Open Storage Systems Inteconnection (OSSI)
  • derived from IEEE Mass Storage Reference Model
    (1980s)
  • Low level detail of storage systems media, drive
    technology, structure of media
  • Meant to facilitate mechanism sharing between
    storage resources
  • ISO Open Archival Information System (OAIS)
  • Support for managing persistent archives
  • Support for data submission, recording,
    ingestion, encapsulation of data objects with
    attributes and data export
  • Involves the human in the loop (this is seen as
    an important part)
  • ISO 13250 Topic Maps
  • Standard notation for structure of information
    sources (topics) and relationships between topics
  • Interrelated docs supporting this representation
    Topic Maps
  • Can relate topics via associations, groups,
    similarities

46
A Common Infrastructure
Application
Data Cutter
Local Services
Collect./ Aggreg.
Storage Resource Broker
Global Services
Globus (Legion)
P2P Infrastructure
Client/Server Infras.
Resources
47
Data Cutter
  • Provides support for managing distributed,
    archived data sets
  • Extends the Active Data Repository to
  • support range queries
  • user defined filters on data
  • data processing at source
  • Uses the abstraction of a filter to support
    distributed processing and management of data via
    common services
  • Also supports data division based on range
    queries
  • uses a hierarchical indexing scheme
  • Primary aim is to support
  • distributed data sets organised as collections
  • where data sets can be generated by distributed
    platforms

48
From Reagan Moore
SDSC Storage Resource Broker Meta-data
Catalog Common APIs
Application
Linux I/O
OAI
Access APIs
DLL / Python
Java, NT Browsers
GridFTP

Consistency Management / Authorization-Authenticat
ion
Prime Server
Logical Name Space
Latency Management
Data Transport
Metadata Transport
Storage Abstraction
Catalog Abstraction
Databases DB2, Oracle, Sybase
Servers
HRM
49
Spitfire (CERN)
Uniform access to RDMS
  • Aimed as being a middleware for access to
    relational databases
  • Three SOAP services
  • Base service for standard operations,
  • Admin service for administrative access,
  • Info service for information on the database and
    its tables

50
Grid Database Service Specification Principles
From Amy Kraus et al. (GGF6)
  • Provide service-based access to existing database
    systems in the context of Grid Computing.
  • Be orthogonal to the Grid authentication and
    authorization mechanisms
  • Accommodate diverse database paradigms within a
    consistent
  • framework.
  • Accommodate diverse database metadata
  • Support higher-level information-integration and
    federation services.
  • Adopt the document approach to service
    description
  • Defined semi-formally (WSDL and English text)

51
Grid Database Service Port Types
Grid Data Service
GridService Port
  • Mandatory
  • GridDataService.
  • GridService.
  • Optional
  • GridDataTransport.
  • NotificationSource.

Find Service Data

Client (Application or Federator)
GridDataService Port
Query Specification

Transport Specification
GridDataTransport Port

52
Grid Data Service Initialization
Grid Service Registry
Grid Data Service Factory
Create Grid Data Service
Create Grid Data Service
Find Factory

Client (Application or Federator)

Grid Data Service
Database
Data Request

53
  • Montage - Custom Image Mosaics
  • http//montage.ipac.caltech.edu
  • User specified size, WCS projection,
    coordinates, spatial sampling, rotation
  • Rectification of backgrounds in images
  • Supports drizzle algorithm
  • Delivery
  • Semi-annual deliveries from Feb 2003
  • Final Delivery Jan 2005
  • Available for download
  • Science Drivers
  • Science Grade Images
  • Analyze diverse images as if part of same
    multi-wavelength image (radio, infrared,
    optical etc)

54
Montage
  • Combine data from different astronomers (using
    different instruments)

55
(No Transcript)
56
(No Transcript)
57
http//www.neurogrid.net/
58
http//www.neurogrid.net/
59
OceanStore (Berkeley)
data source
data plane
Web content server
network plane
60
OceanStore Goal and Challenges
Provide content distribution to clients with good
Quality of Service (QoS) while retaining
efficient and balanced resource consumption of
the underlying infrastructure
  • Dynamic choice of number and location of replicas
  • Clients QoS constraints
  • Servers capacity constraints
  • Efficient update dissemination
  • Delay
  • Bandwidth consumption
  • Scalability millions of objects, clients and
    servers
  • No global network topology knowledge

61
Dynamic Replica Placement naïve
data plane
s
c
network plane
Tapestry mesh
62
Dynamic Replica Placement naïve
data plane
parent candidate
s
proxy
c
network plane
Tapestry mesh
Tapestry overlay path
63
Dynamic Replica Placement smart
data plane
client child
s
parent
proxy
sibling
c
server child
network plane
64
Dynamic Replica Placement smart
  • Aggressive search
  • Lazy placement
  • Greedy load distribution

data plane
parent candidates
client child
s
parent
proxy
sibling
c
server child
network plane
65
HiveCache from MojoNation
  • Utilises free disk space to undertake backup --
    builds a dynamic RAID network
  • Disk space monitored to determine how much to
    allocate (via a local agent)
  • File broken into pieces along with error
    correction information encryption
  • Data highly replicated
  • Little unique info in a company PC
  • common apps (word, excel etc)
  • operating system files, utilities
  • Personal data, however, is not very large
  • Uses SHA1 hash algorithm to map each file
    fragment
  • acts as a unique key to locate the data fragment
  • Request to retrieve a file is mapped to machines
    -- which retrieve file in parts in parallel

66
Peer-Oriented systems
From Wolfgang Hoschek
67
Final Thoughts
  • Aim to identify themes of interest
  • for Grid environments
  • Identify Local and Global services
  • Some of these are available in existing software
    - some must be implemented by a user
  • Standards play an important role
  • Development of Common Open Grid Data Services for
    Science
Write a Comment
User Comments (0)
About PowerShow.com