Multithreaded ingestion of BUFR messages from the IDD - PowerPoint PPT Presentation

About This Presentation
Title:

Multithreaded ingestion of BUFR messages from the IDD

Description:

IDD HRS BUFR data stream. Multithreaded processing of IDD messages. Indexing data ... stream. Break into. Separate. messages. Message. Queue. pipe ... – PowerPoint PPT presentation

Number of Views:81
Avg rating:3.0/5.0
Slides: 31
Provided by: car97
Category:

less

Transcript and Presenter's Notes

Title: Multithreaded ingestion of BUFR messages from the IDD


1
Multithreaded ingestion of BUFR messages from the
IDD
  • John Caron
  • Oct 8, 2008

2
Overview
  • BUFR format
  • IDD HRS BUFR data stream
  • Multithreaded processing of IDD messages
  • Indexing data

3
BUFR data format
  • WMO standard for observational met data
  • circa 1988 Table Driven Forms (TDF)
  • Improvement over character oriented codes (eg
    metars)
  • Migration from previous forms still large WMO
    focus
  • Today Edition 4 format, Version 13 of the tables
  • Table driven (12000 entries in global tables)
  • Each record contains a set of data descriptors
    (dds)
  • Global WMO and local tables
  • Simple Compressed binary
  • Packed bits, scale/offset covert to float
  • Fixed precision, no dynamic range
  • Difference from reference value

4
3-1-32 tableD 3-1-1 tableD
0-1-1 WMO_block_number unitsNumeric
scale0 refVal0 nbits7 0-1-2
WMO_station_number unitsNumeric scale0 refVal0
nbits10 0-2-1 Type_of_station
unitsCode table scale0 refVal0 nbits2
3-1-11 tableD 0-4-1 Year
unitsYear scale0 refVal0 nbits12
0-4-2 Month unitsMonth scale0 refVal0
nbits4 0-4-3 Day unitsDay scale0
refVal0 nbits6 3-1-12 tableD
0-4-4 Hour unitsHour scale0 refVal0
nbits5 0-4-5 Minute unitsMinute
scale0 refVal0 nbits6 3-1-24 tableD
0-5-2 Latitude unitsDegree scale2
refVal-9000 nbits15 0-6-2 Longitude
unitsDegree scale2 refVal-18000 nbits16
0-7-1 Height_of_station unitsm scale0
refVal-400 nbits15 0-1-18
Short_station_or_site_name unitsCCITT IA5
nchars5 0-2-3 Type_of_measuring_equipment
_used unitsCode table scale0 refVal0
2-1-132 tableC-operators 2-2-130
tableC-operators 0-2-121 Mean_frequency
unitsHz scale-8 refVal0 nbits7 2-2-0
tableC-operators 2-1-0 tableC-operators
0-8-21 Time_significance unitsCode table
scale0 refVal0 nbits5 0-4-26
Time_period_or_displacement unitsSecond scale0
refVal-4096 nbits13 1-9-0 replication
0-31-1 Delayed_descriptor_replication_factor
unitsNumeric scale0 refVal0 0-7-6
Height_above_station unitsm scale0 refVal0
nbits15 0-25-34 Wind_profiler_quality_contr
ol_test_results unitsFlag table scale0
0-11-1 Wind_direction unitsDegree true
scale0 refVal0 nbits9 0-11-2 Wind_speed
unitsm s-1 scale1 refVal0 nbits12 2-1-127
tableC-operators 0-11-50
Standard_deviation_of_horizontal_wind_speed
unitsm s-1 scale1 refVal0 nbits12 2-1-0
tableC-operators 0-11-6 w-component
unitsm s-1 scale2 refVal-4096 nbits13
0-11-51 Standard_deviation_of_vertical_wind_spee
d unitsm s-1 scale1 refVal0 nbits8
5
BUFR problems (1)
  • BUFR format is too complex
  • Looks like design by committee
  • Specification not exact
  • No coding/decoding reference implementation
  • Mixture of data model / data encoding / standard
    quantities
  • BUFR format is too simple
  • Fixed length tables (64 categories, 256 entries)
    eventually run out
  • Fixed dynamic range (no exponents)

6
BUFR problems (2)
  • Table-driven parsing is brittle
  • No authoritative registry of local Tables
  • WMO global table is not machine-readable
  • Past versions are not available
  • It seems that
  • Each provider has their own set of software and
    tables
  • Often legacy Fortran

7
BUFR Table mismatch
  • No way to be sure if coder/decoder use the same
    table
  • If table entry missing, cant decode
  • If wrong table entry is used
  • Bit size wrong, usually can detect with bit
    counting
  • Scale/Factor/Name/Units wrong silent failure
    (expert/human may detect)

8
Table mismatches
  • Each archive center probably has solved this
    coder/decoder matching internally
  • NCEP encodes the tables in BUFR messages, and
    stores in the archive files
  • Others???

9
BUFR progress
  • As of 9/2008, WMO decided
  • Will make tables available in Microsoft Access
    format
  • Clarified versioning (sort of)
  • Progress in detecting/fixing encoding errors
  • Unidata nudge email group, validation web site
  • BritMet effort to map BUFR to ISO, define XML
    version of tables

10
BUFR data on IDD
  • 177 K messages / day
  • 6.7 M observations / day
  • 1.2 Gbytes / day
  • Avg message size 7227 bytes
  • Avg obs/message 37
  • Unique wmo Headers 555
  • Unique dds 125
  • wmoHeaders with multiple dds 61

11
Originating Stations
  • CWAO Montreal
  • EDZW Offenbach (RSMC) (78.0)
  • EGRR UK Meteorological Office Bracknell (RSMC)
    (74.0)
  • EKMI Copenhagen (94.0),
  • EUMG EUMETSAT Operation Centre (254.0)
  • EUSR
  • KBOU The NOAA Forecast Systems Laboratory (59.0)
  • KKCI US National Weather Service (NCEP) (7.0)
  • KNES US NOAA/NESDIS (160.0)
  • KWBC US National Weather Service (NCEP) (7.0)
  • KWNH US National Weather Service (NCEP)
  • KWNO NCEP / Central Operations (7.3)
  • LFPW Toulouse (RSMC) (85.0),
  • RJTD Tokyo (RSMC), Japan Meteorological Agency
    (34.0)
  • RKSL Seoul 40.0
  • SBBR Brazilian Space Agency ? INPE (46.0)
  • VHHH Hong-Kong 110.0

12
Data heterogeneity
  • Each BUFR record in principle could have its own
    data schema 2M database schemas!
  • In reality, there are much smaller number of
    groups of homogenous records
  • WMO headers are not sufficient
  • Cant use pqact FILE by matching the header
  • Only the dds itself is reliable
  • So must crack the message to reliably group the
    records

13
(No Transcript)
14
Multithreaded Processing of IDD Messages
15
Overview
  • Get messages from LDM pipe
  • Process in memory, write out to disk
  • Must be very fast, no blocking I/O
  • Use java.util.concurrent library for
    multithreading

16
LDM pqact
  • Get all BUFR messages from HRS
  • HRS IJ
  • PIPE metadata java jar ldm.jar

17
LDM stream
pipe
ArrayBlockingQueueltMessageTaskgt
Message Queue
Break into Separate messages
1.extract
pipeReadingThread (1) (io)
blocking take
Read contents Classify type by dds
2.dispatch
Step 1 and 2 Extract and dispatch
MessType processor
MessType processor
MessType processor
messageThread (1?) (cpu)
18
dispatch
MessType processor
Step 3 Write message
dispatch
MessageWriter implements CallableltResultgt Concurr
entLinkedQueueltMessagegt Owns file eg
2008-09-11.bufr
submit
MessageWriter implements CallableltResultgt Resu
lt call() write message(s)
Executor CompletionServiceltResultgt
3.write
messageThread (1) (cpu)
threadPool (n) (io)
19
MessageWriter implements CallableltIndexerTaskgt
IndexTask call() write message(s)
Step 4 Index
Write message Return IndexerTask
Executor QueueltFutureltIndexerTaskgtgt
Add to Index
blocking take
indexThread (1?) (io)
20
dispatch
Step 5 cleanup
MessType processor
dispatch
Close files Concurrent hashMap ?
MessageWriter implements CallableltResultgt Concurr
entLinkedQueueltMessagegt Owns file 2008-09-11.bufr
cleanupThread (1) (io)
submit
Executor CompletionServiceltResultgt
messageThread (1) (cpu)
21
Step 6 Scour
scourThread (1) (io)
Remove from Index Delete file
Executor QueueltFutureltIndexerTaskgtgt
Add to Index
blocking take
indexThread (1?) (io)
22
Why isnt Scouring part of LDM?
  • LDM is message oriented doesnt know contents
  • Decoders know about the contents of the messages
  • Put scouring into the decoders

23
Threads
  1. Read from LDM pipe
  2. Read message content and dispatch
  3. Write Messages to files
  4. Index
  5. Cleanup / close MessageWriters
  6. Scour

24
(Thought) Experiments with Indexing
25
Design prejudices
  • Keep data in original format
  • Data reliability
  • Aggregate homogeneous data into files
  • Data locality
  • Create external indices, with pointers into the
    files
  • Data recovery
  • Scour entire files, not parts of a file

26
Indexing
  • Need 1D indexes (B-trees)
  • Want 2D indices for spatial data
  • Rtree (areas)
  • Quadtree (points)
  • Index selectivity seek vs. scan
  • Sequential access 100x faster than random access
  • Index must select lt 1 data to be useful

27
Possible Open Source Indexers
  • Berkeley DB Java edition
  • Btree, very fast, no SQL
  • Dual GPL/commercial license
  • Relational databases SQL on Btrees
  • Java (Derby, H2, many others)
  • C (MySQL, Postgres)
  • Object databases
  • Db4o (dual GPL/commercial license)

28
High performance
  • Embeddable in the decoder
  • Same process space
  • Not client/server
  • Access from server answering queries
  • Multiprocess access or client/server
  • Bdb must sync periodically (perf?)
  • Transactions probably too slow
  • Need recovery strategy

29
Test Assumptions
  • Process IDD messages in memory (vs) write to file
    then postprocess
  • Store in files add external indexing (vs) store
    data in database
  • One database vs many?
  • Embedded vs client/server
  • SQL vs specific queries
  • SQL allows ad-hoc queries - performance?
  • 2D indexing

30
Conclusions
  • Test/time various indexing strategies and
    technologies
  • Production
  • scouring
  • Eventually part of IDD/TDS
  • Must be easy to maintain (Java)
  • Scale to large archives / data volumes
Write a Comment
User Comments (0)
About PowerShow.com