Title: Multithreaded ingestion of BUFR messages from the IDD
1Multithreaded ingestion of BUFR messages from the
IDD
2Overview
- BUFR format
- IDD HRS BUFR data stream
- Multithreaded processing of IDD messages
- Indexing data
3BUFR data format
- WMO standard for observational met data
- circa 1988 Table Driven Forms (TDF)
- Improvement over character oriented codes (eg
metars) - Migration from previous forms still large WMO
focus - Today Edition 4 format, Version 13 of the tables
- Table driven (12000 entries in global tables)
- Each record contains a set of data descriptors
(dds) - Global WMO and local tables
- Simple Compressed binary
- Packed bits, scale/offset covert to float
- Fixed precision, no dynamic range
- Difference from reference value
4 3-1-32 tableD 3-1-1 tableD
0-1-1 WMO_block_number unitsNumeric
scale0 refVal0 nbits7 0-1-2
WMO_station_number unitsNumeric scale0 refVal0
nbits10 0-2-1 Type_of_station
unitsCode table scale0 refVal0 nbits2
3-1-11 tableD 0-4-1 Year
unitsYear scale0 refVal0 nbits12
0-4-2 Month unitsMonth scale0 refVal0
nbits4 0-4-3 Day unitsDay scale0
refVal0 nbits6 3-1-12 tableD
0-4-4 Hour unitsHour scale0 refVal0
nbits5 0-4-5 Minute unitsMinute
scale0 refVal0 nbits6 3-1-24 tableD
0-5-2 Latitude unitsDegree scale2
refVal-9000 nbits15 0-6-2 Longitude
unitsDegree scale2 refVal-18000 nbits16
0-7-1 Height_of_station unitsm scale0
refVal-400 nbits15 0-1-18
Short_station_or_site_name unitsCCITT IA5
nchars5 0-2-3 Type_of_measuring_equipment
_used unitsCode table scale0 refVal0
2-1-132 tableC-operators 2-2-130
tableC-operators 0-2-121 Mean_frequency
unitsHz scale-8 refVal0 nbits7 2-2-0
tableC-operators 2-1-0 tableC-operators
0-8-21 Time_significance unitsCode table
scale0 refVal0 nbits5 0-4-26
Time_period_or_displacement unitsSecond scale0
refVal-4096 nbits13 1-9-0 replication
0-31-1 Delayed_descriptor_replication_factor
unitsNumeric scale0 refVal0 0-7-6
Height_above_station unitsm scale0 refVal0
nbits15 0-25-34 Wind_profiler_quality_contr
ol_test_results unitsFlag table scale0
0-11-1 Wind_direction unitsDegree true
scale0 refVal0 nbits9 0-11-2 Wind_speed
unitsm s-1 scale1 refVal0 nbits12 2-1-127
tableC-operators 0-11-50
Standard_deviation_of_horizontal_wind_speed
unitsm s-1 scale1 refVal0 nbits12 2-1-0
tableC-operators 0-11-6 w-component
unitsm s-1 scale2 refVal-4096 nbits13
0-11-51 Standard_deviation_of_vertical_wind_spee
d unitsm s-1 scale1 refVal0 nbits8
5BUFR problems (1)
- BUFR format is too complex
- Looks like design by committee
- Specification not exact
- No coding/decoding reference implementation
- Mixture of data model / data encoding / standard
quantities - BUFR format is too simple
- Fixed length tables (64 categories, 256 entries)
eventually run out - Fixed dynamic range (no exponents)
6BUFR problems (2)
- Table-driven parsing is brittle
- No authoritative registry of local Tables
- WMO global table is not machine-readable
- Past versions are not available
- It seems that
- Each provider has their own set of software and
tables - Often legacy Fortran
7BUFR Table mismatch
- No way to be sure if coder/decoder use the same
table - If table entry missing, cant decode
- If wrong table entry is used
- Bit size wrong, usually can detect with bit
counting - Scale/Factor/Name/Units wrong silent failure
(expert/human may detect)
8Table mismatches
- Each archive center probably has solved this
coder/decoder matching internally - NCEP encodes the tables in BUFR messages, and
stores in the archive files - Others???
9BUFR progress
- As of 9/2008, WMO decided
- Will make tables available in Microsoft Access
format - Clarified versioning (sort of)
- Progress in detecting/fixing encoding errors
- Unidata nudge email group, validation web site
- BritMet effort to map BUFR to ISO, define XML
version of tables
10BUFR data on IDD
- 177 K messages / day
- 6.7 M observations / day
- 1.2 Gbytes / day
- Avg message size 7227 bytes
- Avg obs/message 37
- Unique wmo Headers 555
- Unique dds 125
- wmoHeaders with multiple dds 61
11Originating Stations
- CWAO Montreal
- EDZW Offenbach (RSMC) (78.0)
- EGRR UK Meteorological Office Bracknell (RSMC)
(74.0) - EKMI Copenhagen (94.0),
- EUMG EUMETSAT Operation Centre (254.0)
- EUSR
- KBOU The NOAA Forecast Systems Laboratory (59.0)
- KKCI US National Weather Service (NCEP) (7.0)
- KNES US NOAA/NESDIS (160.0)
- KWBC US National Weather Service (NCEP) (7.0)
- KWNH US National Weather Service (NCEP)
- KWNO NCEP / Central Operations (7.3)
- LFPW Toulouse (RSMC) (85.0),
- RJTD Tokyo (RSMC), Japan Meteorological Agency
(34.0) - RKSL Seoul 40.0
- SBBR Brazilian Space Agency ? INPE (46.0)
- VHHH Hong-Kong 110.0
12Data heterogeneity
- Each BUFR record in principle could have its own
data schema 2M database schemas! - In reality, there are much smaller number of
groups of homogenous records - WMO headers are not sufficient
- Cant use pqact FILE by matching the header
- Only the dds itself is reliable
- So must crack the message to reliably group the
records
13(No Transcript)
14Multithreaded Processing of IDD Messages
15Overview
- Get messages from LDM pipe
- Process in memory, write out to disk
- Must be very fast, no blocking I/O
- Use java.util.concurrent library for
multithreading
16LDM pqact
- Get all BUFR messages from HRS
- HRS IJ
- PIPE metadata java jar ldm.jar
17LDM stream
pipe
ArrayBlockingQueueltMessageTaskgt
Message Queue
Break into Separate messages
1.extract
pipeReadingThread (1) (io)
blocking take
Read contents Classify type by dds
2.dispatch
Step 1 and 2 Extract and dispatch
MessType processor
MessType processor
MessType processor
messageThread (1?) (cpu)
18dispatch
MessType processor
Step 3 Write message
dispatch
MessageWriter implements CallableltResultgt Concurr
entLinkedQueueltMessagegt Owns file eg
2008-09-11.bufr
submit
MessageWriter implements CallableltResultgt Resu
lt call() write message(s)
Executor CompletionServiceltResultgt
3.write
messageThread (1) (cpu)
threadPool (n) (io)
19MessageWriter implements CallableltIndexerTaskgt
IndexTask call() write message(s)
Step 4 Index
Write message Return IndexerTask
Executor QueueltFutureltIndexerTaskgtgt
Add to Index
blocking take
indexThread (1?) (io)
20dispatch
Step 5 cleanup
MessType processor
dispatch
Close files Concurrent hashMap ?
MessageWriter implements CallableltResultgt Concurr
entLinkedQueueltMessagegt Owns file 2008-09-11.bufr
cleanupThread (1) (io)
submit
Executor CompletionServiceltResultgt
messageThread (1) (cpu)
21Step 6 Scour
scourThread (1) (io)
Remove from Index Delete file
Executor QueueltFutureltIndexerTaskgtgt
Add to Index
blocking take
indexThread (1?) (io)
22Why isnt Scouring part of LDM?
- LDM is message oriented doesnt know contents
- Decoders know about the contents of the messages
- Put scouring into the decoders
23Threads
- Read from LDM pipe
- Read message content and dispatch
- Write Messages to files
- Index
- Cleanup / close MessageWriters
- Scour
24(Thought) Experiments with Indexing
25Design prejudices
- Keep data in original format
- Data reliability
- Aggregate homogeneous data into files
- Data locality
- Create external indices, with pointers into the
files - Data recovery
- Scour entire files, not parts of a file
26Indexing
- Need 1D indexes (B-trees)
- Want 2D indices for spatial data
- Rtree (areas)
- Quadtree (points)
- Index selectivity seek vs. scan
- Sequential access 100x faster than random access
- Index must select lt 1 data to be useful
27Possible Open Source Indexers
- Berkeley DB Java edition
- Btree, very fast, no SQL
- Dual GPL/commercial license
- Relational databases SQL on Btrees
- Java (Derby, H2, many others)
- C (MySQL, Postgres)
- Object databases
- Db4o (dual GPL/commercial license)
28High performance
- Embeddable in the decoder
- Same process space
- Not client/server
- Access from server answering queries
- Multiprocess access or client/server
- Bdb must sync periodically (perf?)
- Transactions probably too slow
- Need recovery strategy
29Test Assumptions
- Process IDD messages in memory (vs) write to file
then postprocess - Store in files add external indexing (vs) store
data in database - One database vs many?
- Embedded vs client/server
- SQL vs specific queries
- SQL allows ad-hoc queries - performance?
- 2D indexing
30Conclusions
- Test/time various indexing strategies and
technologies - Production
- scouring
- Eventually part of IDD/TDS
- Must be easy to maintain (Java)
- Scale to large archives / data volumes