GEDDM: Comparisons of OGSA-DAI and GridFTP for access to and conversion of remote unstructured data in legal data mining - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

GEDDM: Comparisons of OGSA-DAI and GridFTP for access to and conversion of remote unstructured data in legal data mining

Description:

Specialising in grid enabled 'data-centric' matching across multiple sectors ... Representation of unstructured data such as email, weblog, report dumps. ... – PowerPoint PPT presentation

Number of Views:26
Avg rating:3.0/5.0
Slides: 26
Provided by: pau79
Category:

less

Transcript and Presenter's Notes

Title: GEDDM: Comparisons of OGSA-DAI and GridFTP for access to and conversion of remote unstructured data in legal data mining


1
GEDDM Comparisons of OGSA-DAI and GridFTP for
access to and conversion of remote unstructured
data in legal data mining
  • Karen Loughran

2
Introduction
  • Grid Enabled Distributed Data Mining
  • Industrial partner
  • Overview of GEDDM
  • GEDDM Common Semantic Model (CSM) objectives
  • Grid enabled solution

3
Industrial Partner - Datactics
  • Northern Ireland based (formed 1999)
  • Specialising in grid enabled data-centric
    matching across multiple sectors
  • Datactics technology is fully parallelised
  • Computationally intensive - need to compare every
    record with every other record
  • Improve data quality by applying fuzzy matching
    techniques
  • Data mining software being used in the real world

4
GEDDM Business Driver
  • Data sources
  • numerous structures, formats, locations,
    administrative domains
  • Client
  • US County Court insider trading litigation case
  • 45Tb
  • Variety of formats
  • Email, pdf, weblogs, DBMS, report text dumps
  • How to interface to large volumes of data in
    common structured parallel approach

5
Common Semantic Model (CSM) Objectives
  • Representation of unstructured data such as
    email, weblog, report dumps.
  • Conversion to structured format.
  • Evaluation of Grid technologies for access and
    conversion.
  • Secure, reliable and scaleable.
  • Exploit high bandwidth.

6
CSM Grid Enabled Solution
  • Two Stages
  • Represent and convert unstructured Flat File
    Formats (FFF) to structured Common Output Format
    File (COFF).
  • Investigate Grid technologies for the remote
    access and conversion of unstructured data.

7
CSM Representation Conversion
  • Data Description Language DDL - XSD
  • Data Description File DDF
  • Parser

8
Sample FFF data source DDF
  • App Account Address
    Balance
  • IMP 343818 Dede H Smith
    8600.76
  • 181 Glen Rd
  • Earls Court,
    London
  • IMP 565777 Annie Saunders
    9905.50
  • 60 Newhaven St
  • Edinburgh,
    Scotland
  • __________________________________________________
    _________________
  • ltdatasourcegt
  • ltdatabasegt
  • ltheadergtltheadertextgtApp Account
    Address Balance

  • lt/headertextgtlt/headergt
  • ltrectype eorecord\ngt
  • ltpfield nameApp pos1 length3/gt
  • ltpfield nameAccount pos10
    length6/gt
  • ltpfield nameAddress pos24
    length23

  • multilineyes/gt
  • ltpfield nameBalance pos49
    length8/gt

9
Parser Design
  • Object oriented component hierarchy
  • Each object represents an XML element
  • Encapsulates data relating to the flat file
    component it describes
  • Encapsulates all import parse
  • SAX parse performed on DDF to build up internal
    OO representation of FFF
  • Parse called on top level object.

10
CSM Grid technologies
  • Transfer conversion tools
  • OGSA-DAI (Version 4)
  • GridFTP (GT4.0.0)
  • GUI interfacing to both of these technologies.

11
GUI interface access conversion

GUI Interface to sample remote FFF, DDF creation
and conversion.
12
Implementation under OGSA-DAI
  • OGSA-DAI 4.0.0
  • Globus Toolkit 3.2.1
  • New conversion activity designed implemented
  • Calls out to python scripts to perform conversion

13
Implementation under GridFTP
  • Globus Toolkit 4.0.0
  • Data Storage Interface (DSI) creation to perform
    conversion processing at server
  • Instead of original unstructured FFF, send the
    COFF file back to client
  • Setup striped server architecture multiple
    nodes working together in parallel.

14
GridFTP Striped Architecture
15
GridFTP Machine Specifications
  • BELFAST
  • AMD4400 Dual Processor
  • 4Gig RAM
  • 1 Terabyte hard disk, serial ATA2
  • 1 Gigabit ethernet
  • LONDON
  • Dual Optron Processor
  • 4Gig RAM
  • 1 Terabyte hard disk
  • 1 Gigabit ethernet

16
GridFTP Evaluation Tests
  • Attempted conversion and access to large files
    across the network.
  • File sizes
  • 13Mb, 26Mb, 52Mb, 103Mb, 205Mb, 409Mb, 817Mb,
    1634Mb
  • Buffer sizes
  • Default, 4915, 409150, 785408
  • MTU 1400 - 8000

17
OGSA-DAI Benchmark Results
  • Currently no results available
  • Socket Timeout Error and Engine receives a
    terminate signal when Activity takes longer than
    approximately 10 minutes to run.
  • DeliverToGridFTP activity would not work in
    version 4. Patches required. So far, unable to
    get working with these patches.
  • Security setup issues.

18
GridFTP Network Topology

19
Results GridFTP transfer
  • Throughput hindered by
  • Physical Infrastructure/Service Provider-80Mbs
  • Router/switches/NIC
  • 808 Mbs CPU to CPU (London to Belfast)
  • 688 Mbs Disk to Disk (BBC NI)
  • Striping with 2 BE servers - 60 improvement
  • Local 100Mbs switch
  • Disc to disc 82 Mbs

20
OGSA-DAI Evaluation .
  • DeliverToGridFTP not working in 4.0.0
  • Configuring GridFTP not possible (buffer sizes,
    no. of streams, striped transfer etc.)
  • Some way to go in efficient transfer of large
    files.
  • Installation/runtime overheads
  • Design/code conversion activity design perform
    documents for access/conversion
  • Timeouts converting large files. Threads may be
    solution.
  • Clear documentation

21
GridFTP Evaluation
  • Secure, reliable, fast and scaleable
  • Lightweight installation
  • Optimum use of high bandwidth networks
  • Extra ERET/ESTO processing allows tighter
    integration of conversions operation through the
    definition of a DSI
  • Striping for much improved efficiency

22
GridFTP Evaluation
  • Extensive tuning required
  • No clear documentation for writing a DSI.
  • gridftp-mpd_at_globus.org useful source of info
  • Poor performance on NFS.
  • PVFS like filesystem recommended for striping.
  • 1Gbit bandwidth in practice difficult to achieve
    due to problems with
  • Router
  • NIC
  • Physical Infrastructure

23
Conclusions
  • Investigated grid technologies for remote access
    conversion
  • OGSA-DAI disappointing due to lack of support for
    large file transfer
  • GridFTP involved extensive configuration and due
    to network infrastructure problems difficult to
    get optimum performance in remote transfer

24
Future work
  • Tighter integration of conversion services within
    GridFTP DSI server module.
  • Extend the services under GridFTP to cope with
    Distributed Query Processing.
  • COFF produced as XML, ready for XPATH queries.

25
  • Questions ?
Write a Comment
User Comments (0)
About PowerShow.com