Multithreaded ingestion of BUFR messages from the IDD

About This Presentation

Title:

Multithreaded ingestion of BUFR messages from the IDD

Description:

IDD HRS BUFR data stream. Multithreaded processing of IDD messages. Indexing data ... stream. Break into. Separate. messages. Message. Queue. pipe ... – PowerPoint PPT presentation

Number of Views:81

Avg rating:3.0/5.0

Slides: 31

Provided by: car97

Learn more at: https://www.unidata.ucar.edu

Category:

more less

Transcript and Presenter's Notes

Title: Multithreaded ingestion of BUFR messages from the IDD

1
Multithreaded ingestion of BUFR messages from the
IDD

John Caron
Oct 8, 2008

2
Overview

BUFR format
IDD HRS BUFR data stream
Multithreaded processing of IDD messages
Indexing data

3
BUFR data format

WMO standard for observational met data
circa 1988 Table Driven Forms (TDF)
Improvement over character oriented codes (eg
metars)
Migration from previous forms still large WMO
focus
Today Edition 4 format, Version 13 of the tables
Table driven (12000 entries in global tables)
Each record contains a set of data descriptors
(dds)
Global WMO and local tables
Simple Compressed binary
Packed bits, scale/offset covert to float
Fixed precision, no dynamic range
Difference from reference value

4
3-1-32 tableD 3-1-1 tableD
0-1-1 WMO_block_number unitsNumeric
scale0 refVal0 nbits7 0-1-2
WMO_station_number unitsNumeric scale0 refVal0
nbits10 0-2-1 Type_of_station
unitsCode table scale0 refVal0 nbits2
3-1-11 tableD 0-4-1 Year
unitsYear scale0 refVal0 nbits12
0-4-2 Month unitsMonth scale0 refVal0
nbits4 0-4-3 Day unitsDay scale0
refVal0 nbits6 3-1-12 tableD
0-4-4 Hour unitsHour scale0 refVal0
nbits5 0-4-5 Minute unitsMinute
scale0 refVal0 nbits6 3-1-24 tableD
0-5-2 Latitude unitsDegree scale2
refVal-9000 nbits15 0-6-2 Longitude
unitsDegree scale2 refVal-18000 nbits16
0-7-1 Height_of_station unitsm scale0
refVal-400 nbits15 0-1-18
Short_station_or_site_name unitsCCITT IA5
nchars5 0-2-3 Type_of_measuring_equipment
_used unitsCode table scale0 refVal0
2-1-132 tableC-operators 2-2-130
tableC-operators 0-2-121 Mean_frequency
unitsHz scale-8 refVal0 nbits7 2-2-0
tableC-operators 2-1-0 tableC-operators
0-8-21 Time_significance unitsCode table
scale0 refVal0 nbits5 0-4-26
Time_period_or_displacement unitsSecond scale0
refVal-4096 nbits13 1-9-0 replication
0-31-1 Delayed_descriptor_replication_factor
unitsNumeric scale0 refVal0 0-7-6
Height_above_station unitsm scale0 refVal0
nbits15 0-25-34 Wind_profiler_quality_contr
ol_test_results unitsFlag table scale0
0-11-1 Wind_direction unitsDegree true
scale0 refVal0 nbits9 0-11-2 Wind_speed
unitsm s-1 scale1 refVal0 nbits12 2-1-127
tableC-operators 0-11-50
Standard_deviation_of_horizontal_wind_speed
unitsm s-1 scale1 refVal0 nbits12 2-1-0
tableC-operators 0-11-6 w-component
unitsm s-1 scale2 refVal-4096 nbits13
0-11-51 Standard_deviation_of_vertical_wind_spee
d unitsm s-1 scale1 refVal0 nbits8
5
BUFR problems (1)

BUFR format is too complex
Looks like design by committee
Specification not exact
No coding/decoding reference implementation
Mixture of data model / data encoding / standard
quantities
BUFR format is too simple
Fixed length tables (64 categories, 256 entries)
eventually run out
Fixed dynamic range (no exponents)

6
BUFR problems (2)

Table-driven parsing is brittle
No authoritative registry of local Tables
WMO global table is not machine-readable
Past versions are not available
It seems that
Each provider has their own set of software and
tables
Often legacy Fortran

7
BUFR Table mismatch

No way to be sure if coder/decoder use the same
table
If table entry missing, cant decode
If wrong table entry is used
Bit size wrong, usually can detect with bit
counting
Scale/Factor/Name/Units wrong silent failure
(expert/human may detect)

8
Table mismatches

Each archive center probably has solved this
coder/decoder matching internally
NCEP encodes the tables in BUFR messages, and
stores in the archive files
Others???

9
BUFR progress

As of 9/2008, WMO decided
Will make tables available in Microsoft Access
format
Clarified versioning (sort of)
Progress in detecting/fixing encoding errors
Unidata nudge email group, validation web site
BritMet effort to map BUFR to ISO, define XML
version of tables

10
BUFR data on IDD

177 K messages / day
6.7 M observations / day
1.2 Gbytes / day
Avg message size 7227 bytes
Avg obs/message 37
Unique wmo Headers 555
Unique dds 125
wmoHeaders with multiple dds 61

11
Originating Stations

CWAO Montreal
EDZW Offenbach (RSMC) (78.0)
EGRR UK Meteorological Office Bracknell (RSMC)
(74.0)
EKMI Copenhagen (94.0),
EUMG EUMETSAT Operation Centre (254.0)
EUSR
KBOU The NOAA Forecast Systems Laboratory (59.0)
KKCI US National Weather Service (NCEP) (7.0)
KNES US NOAA/NESDIS (160.0)
KWBC US National Weather Service (NCEP) (7.0)
KWNH US National Weather Service (NCEP)
KWNO NCEP / Central Operations (7.3)
LFPW Toulouse (RSMC) (85.0),
RJTD Tokyo (RSMC), Japan Meteorological Agency
(34.0)
RKSL Seoul 40.0
SBBR Brazilian Space Agency ? INPE (46.0)
VHHH Hong-Kong 110.0

12
Data heterogeneity

Each BUFR record in principle could have its own
data schema 2M database schemas!
In reality, there are much smaller number of
groups of homogenous records
WMO headers are not sufficient
Cant use pqact FILE by matching the header
Only the dds itself is reliable
So must crack the message to reliably group the
records

13
(No Transcript)
14
Multithreaded Processing of IDD Messages
15
Overview

Get messages from LDM pipe
Process in memory, write out to disk
Must be very fast, no blocking I/O
Use java.util.concurrent library for
multithreading

16
LDM pqact

Get all BUFR messages from HRS
HRS IJ
PIPE metadata java jar ldm.jar

17
LDM stream
pipe
ArrayBlockingQueueltMessageTaskgt
Message Queue
Break into Separate messages
1.extract
pipeReadingThread (1) (io)
blocking take
Read contents Classify type by dds
2.dispatch
Step 1 and 2 Extract and dispatch
MessType processor
MessType processor
MessType processor
messageThread (1?) (cpu)
18
dispatch
MessType processor
Step 3 Write message
dispatch
MessageWriter implements CallableltResultgt Concurr
entLinkedQueueltMessagegt Owns file eg
2008-09-11.bufr
submit
MessageWriter implements CallableltResultgt Resu
lt call() write message(s)
Executor CompletionServiceltResultgt
3.write
messageThread (1) (cpu)
threadPool (n) (io)
19
MessageWriter implements CallableltIndexerTaskgt
IndexTask call() write message(s)
Step 4 Index
Write message Return IndexerTask
Executor QueueltFutureltIndexerTaskgtgt
Add to Index
blocking take
indexThread (1?) (io)
20
dispatch
Step 5 cleanup
MessType processor
dispatch
Close files Concurrent hashMap ?
MessageWriter implements CallableltResultgt Concurr
entLinkedQueueltMessagegt Owns file 2008-09-11.bufr
cleanupThread (1) (io)
submit
Executor CompletionServiceltResultgt
messageThread (1) (cpu)
21
Step 6 Scour
scourThread (1) (io)
Remove from Index Delete file
Executor QueueltFutureltIndexerTaskgtgt
Add to Index
blocking take
indexThread (1?) (io)
22
Why isnt Scouring part of LDM?

LDM is message oriented doesnt know contents
Decoders know about the contents of the messages
Put scouring into the decoders

23
Threads

Read from LDM pipe
Read message content and dispatch
Write Messages to files
Index
Cleanup / close MessageWriters
Scour

24
(Thought) Experiments with Indexing
25
Design prejudices

Keep data in original format
Data reliability
Aggregate homogeneous data into files
Data locality
Create external indices, with pointers into the
files
Data recovery
Scour entire files, not parts of a file

26
Indexing

Need 1D indexes (B-trees)
Want 2D indices for spatial data
Rtree (areas)
Quadtree (points)
Index selectivity seek vs. scan
Sequential access 100x faster than random access
Index must select lt 1 data to be useful

27
Possible Open Source Indexers

Berkeley DB Java edition
Btree, very fast, no SQL
Dual GPL/commercial license
Relational databases SQL on Btrees
Java (Derby, H2, many others)
C (MySQL, Postgres)
Object databases
Db4o (dual GPL/commercial license)

28
High performance

Embeddable in the decoder
Same process space
Not client/server
Access from server answering queries
Multiprocess access or client/server
Bdb must sync periodically (perf?)
Transactions probably too slow
Need recovery strategy

29
Test Assumptions

Process IDD messages in memory (vs) write to file
then postprocess
Store in files add external indexing (vs) store
data in database
One database vs many?
Embedded vs client/server
SQL vs specific queries
SQL allows ad-hoc queries - performance?
2D indexing

30
Conclusions

Test/time various indexing strategies and
technologies
Production
scouring
Eventually part of IDD/TDS
Must be easy to maintain (Java)
Scale to large archives / data volumes

Write a Comment

User Comments (0)

About PowerShow.com

Multithreaded ingestion of BUFR messages from the IDD - PowerPoint PPT Presentation

Multithreaded ingestion of BUFR messages from the IDD

IDD HRS BUFR data stream. Multithreaded processing of IDD messages. Indexing data ... stream. Break into. Separate. messages. Message. Queue. pipe ... – PowerPoint PPT presentation