Title: GEDDM: Comparisons of OGSA-DAI and GridFTP for access to and conversion of remote unstructured data in legal data mining
1GEDDM Comparisons of OGSA-DAI and GridFTP for
access to and conversion of remote unstructured
data in legal data mining
2Introduction
- Grid Enabled Distributed Data Mining
- Industrial partner
- Overview of GEDDM
- GEDDM Common Semantic Model (CSM) objectives
- Grid enabled solution
3Industrial Partner - Datactics
- Northern Ireland based (formed 1999)
- Specialising in grid enabled data-centric
matching across multiple sectors - Datactics technology is fully parallelised
- Computationally intensive - need to compare every
record with every other record - Improve data quality by applying fuzzy matching
techniques - Data mining software being used in the real world
4GEDDM Business Driver
- Data sources
- numerous structures, formats, locations,
administrative domains - Client
- US County Court insider trading litigation case
- 45Tb
- Variety of formats
- Email, pdf, weblogs, DBMS, report text dumps
- How to interface to large volumes of data in
common structured parallel approach
5Common Semantic Model (CSM) Objectives
- Representation of unstructured data such as
email, weblog, report dumps. - Conversion to structured format.
- Evaluation of Grid technologies for access and
conversion. - Secure, reliable and scaleable.
- Exploit high bandwidth.
6CSM Grid Enabled Solution
- Two Stages
- Represent and convert unstructured Flat File
Formats (FFF) to structured Common Output Format
File (COFF). - Investigate Grid technologies for the remote
access and conversion of unstructured data.
7CSM Representation Conversion
- Data Description Language DDL - XSD
- Data Description File DDF
- Parser
8Sample FFF data source DDF
- App Account Address
Balance - IMP 343818 Dede H Smith
8600.76 - 181 Glen Rd
- Earls Court,
London - IMP 565777 Annie Saunders
9905.50 - 60 Newhaven St
- Edinburgh,
Scotland - __________________________________________________
_________________ - ltdatasourcegt
- ltdatabasegt
- ltheadergtltheadertextgtApp Account
Address Balance -
lt/headertextgtlt/headergt - ltrectype eorecord\ngt
- ltpfield nameApp pos1 length3/gt
- ltpfield nameAccount pos10
length6/gt - ltpfield nameAddress pos24
length23 -
multilineyes/gt - ltpfield nameBalance pos49
length8/gt
9Parser Design
- Object oriented component hierarchy
- Each object represents an XML element
- Encapsulates data relating to the flat file
component it describes - Encapsulates all import parse
- SAX parse performed on DDF to build up internal
OO representation of FFF - Parse called on top level object.
10CSM Grid technologies
- Transfer conversion tools
- OGSA-DAI (Version 4)
- GridFTP (GT4.0.0)
- GUI interfacing to both of these technologies.
11GUI interface access conversion
GUI Interface to sample remote FFF, DDF creation
and conversion.
12Implementation under OGSA-DAI
- OGSA-DAI 4.0.0
- Globus Toolkit 3.2.1
- New conversion activity designed implemented
- Calls out to python scripts to perform conversion
13Implementation under GridFTP
- Globus Toolkit 4.0.0
- Data Storage Interface (DSI) creation to perform
conversion processing at server - Instead of original unstructured FFF, send the
COFF file back to client - Setup striped server architecture multiple
nodes working together in parallel.
14GridFTP Striped Architecture
15GridFTP Machine Specifications
- BELFAST
- AMD4400 Dual Processor
- 4Gig RAM
- 1 Terabyte hard disk, serial ATA2
- 1 Gigabit ethernet
- LONDON
- Dual Optron Processor
- 4Gig RAM
- 1 Terabyte hard disk
- 1 Gigabit ethernet
16GridFTP Evaluation Tests
- Attempted conversion and access to large files
across the network. - File sizes
- 13Mb, 26Mb, 52Mb, 103Mb, 205Mb, 409Mb, 817Mb,
1634Mb - Buffer sizes
- Default, 4915, 409150, 785408
- MTU 1400 - 8000
17OGSA-DAI Benchmark Results
- Currently no results available
- Socket Timeout Error and Engine receives a
terminate signal when Activity takes longer than
approximately 10 minutes to run. - DeliverToGridFTP activity would not work in
version 4. Patches required. So far, unable to
get working with these patches. - Security setup issues.
18GridFTP Network Topology
19Results GridFTP transfer
- Throughput hindered by
- Physical Infrastructure/Service Provider-80Mbs
- Router/switches/NIC
- 808 Mbs CPU to CPU (London to Belfast)
- 688 Mbs Disk to Disk (BBC NI)
- Striping with 2 BE servers - 60 improvement
- Local 100Mbs switch
- Disc to disc 82 Mbs
20OGSA-DAI Evaluation .
- DeliverToGridFTP not working in 4.0.0
- Configuring GridFTP not possible (buffer sizes,
no. of streams, striped transfer etc.) - Some way to go in efficient transfer of large
files. - Installation/runtime overheads
- Design/code conversion activity design perform
documents for access/conversion - Timeouts converting large files. Threads may be
solution. - Clear documentation
21GridFTP Evaluation
- Secure, reliable, fast and scaleable
- Lightweight installation
- Optimum use of high bandwidth networks
- Extra ERET/ESTO processing allows tighter
integration of conversions operation through the
definition of a DSI - Striping for much improved efficiency
22GridFTP Evaluation
- Extensive tuning required
- No clear documentation for writing a DSI.
- gridftp-mpd_at_globus.org useful source of info
- Poor performance on NFS.
- PVFS like filesystem recommended for striping.
- 1Gbit bandwidth in practice difficult to achieve
due to problems with - Router
- NIC
- Physical Infrastructure
23Conclusions
- Investigated grid technologies for remote access
conversion - OGSA-DAI disappointing due to lack of support for
large file transfer - GridFTP involved extensive configuration and due
to network infrastructure problems difficult to
get optimum performance in remote transfer
24Future work
- Tighter integration of conversion services within
GridFTP DSI server module. - Extend the services under GridFTP to cope with
Distributed Query Processing. - COFF produced as XML, ready for XPATH queries.
25