Title: HBase at Xiaomi
1HBase at Xiaomi
Liang Xie / Honghua Feng
xieliang, fenghonghua_at_xiaomi.com
2About Us
Honghua Feng
Liang Xie
3Outline
- Introduction
- Latency practice
- Some patches we contributed
- Some ongoing patches
- QA
4About Xiaomi
- Mobile internet company founded in 2010
- Sold 18.7 million phones in 2013
- Over 5 billion revenue in 2013
- Sold 11 million phones in Q1, 2014
5Hardware
6Software
7Internet Services
8About Our HBase Team
- 5 members
- Liang Xie
- Shaohui Liu
- Jianwei Cui
- Liangliang He
- Honghua Feng
- Resolved 130 JIRAs so far
9Our Clusters and Scenarios
- 15 Clusters 9 online / 2 processing / 4 test
- Scenarios
- MiCloud
- MiPush
- MiTalk
- Perf Counter
10Our Latency Pain Points
- Java GC
- Stable page write in OS layer
- Slow buffered IO (FS journal IO)
- Read/Write IO contention
11HBase GC Practice
- Bucket cache with off-heap mode
- Xmn/ServivorRatio/MaxTenuringThreshold
- PretenureSizeThreshold repl src size
- GC concurrent thread number
GC time per day 2500, 3000 -gt 300, 600s !!!
12Write Latency Spikes
- HBase client put
- -gtHRegion.batchMutate
- -gtHLog.sync
- -gtSequenceFileLogWriter.sync
- -gtDFSOutputStream.flushOrSync
- -gtDFSOutputStream.waitForAckedSeqno ltStuck here
often!gt
- DataNode pipeline write, in BlockReceiver.receiveP
acket() - -gtreceiveNextPacket
- -gtmirrorPacketTo(mirrorOut) //write packet to the
mirror - -gtout.write/flush //write data to local disk. lt-
buffered IO - Added instrumentation(HDFS-6110) showed the
stalled write was the culprit, strace result also
confirmed it
13Root Cause of Write Latency Spikes
- write() is expected to be fast
- But blocked by write-back sometimes!
14Stable page write issue workaround
Workaround 2.6.32.279(6.3) -gt
2.6.32.220(6.2) or 2.6.32.279(6.3) -gt
2.6.32.358(6.4) Try to avoid deploying
REHL6.3/Centos6.3 in an extremely latency
sensitive HBase cluster!
15Root Cause of Write Latency Spikes
- ...
- 0xffffffffa00dc09d do_get_write_access0x29d/0x5
20 jbd2 - 0xffffffffa00dc471 jbd2_journal_get_write_access
0x31/0x50 jbd2 - 0xffffffffa011eb78 __ext4_journal_get_write_acce
ss0x38/0x80 ext4 - 0xffffffffa00fa253 ext4_reserve_inode_write0x73
/0xa0 ext4 - 0xffffffffa00fa2cc ext4_mark_inode_dirty0x4c/0x
1d0 ext4 - 0xffffffffa00fa6c4 ext4_generic_write_end0xe4/0
xf0 ext4 - 0xffffffffa00fdf74 ext4_writeback_write_end0x74
/0x160 ext4 - 0xffffffff81111474 generic_file_buffered_write0
x174/0x2a0 kernel - 0xffffffff81112d60 __generic_file_aio_write0x25
0/0x480 kernel - 0xffffffff81112fff generic_file_aio_write0x6f/0
xe0 kernel - 0xffffffffa00f3de1 ext4_file_write0x61/0x1e0
ext4 - 0xffffffff811762da do_sync_write0xfa/0x140
kernel - 0xffffffff811765d8 vfs_write0xb8/0x1a0
kernel - 0xffffffff81176fe1 sys_write0x51/0x90 kernel
XFS in latest kernel can relieve journal IO
blocking issue, more friendly to metadata heavy
scenarios like HBase HDFS
16Write Latency Spikes Testing
- 8 YCSB threads write 20 million rows, each 3200
Bytes 3 DN kernel 3.12.17 - Statistic the stalled write() which costs gt 100ms
The largest write() latency in Ext4 600ms !
17Hedged Read (HDFS-5776)
18Other Meaningful Latency Work
- Long first put issue (HBASE-10010)
- Token invalid (HDFS-5637)
- Retry/timeout setting in DFSClient
- Reduce write traffic? (HLog compression)
- HDFS IO Priority (HADOOP-10410)
19Wish List
- Real-time HDFS, esp. priority related
- Core data structure GC friendly
- More off-heap shenandoah GC
- TCP/Disk IO characteristic analysis
- Need more eyes on OS
- Stay tuned
20Some Patches Xiaomi Contributed
- New write thread model(HBASE-8755)
- Reverse scan(HBASE-4811)
- Per table/cf replication(HBASE-8751)
- Block index key optimization(HBASE-7845)
211. New Write Thread Model
Old model
WriteHandler
WriteHandler
WriteHandler
256
Local Buffer
WriteHandler write to HDFS
WriteHandler write to HDFS
WriteHandler write to HDFS
256
WriteHandler sync to HDFS
WriteHandler sync to HDFS
256
WriteHandler sync to HDFS
Problem WriteHandler does everything, severe
lock race!
22New Write Thread Model
New model
WriteHandler
WriteHandler
WriteHandler
256
Local Buffer
AsyncWriter write to HDFS
1
AsyncSyncer sync to HDFS
WriteHandler sync to HDFS
WriteHandler sync to HDFS
4
AsyncNotifier notify writers
1
23New Write Thread Model
- Heavy load Huge improvement (3.5x)
242. Reverse Scan
1. All scanners seek to previous rows
(SeekBefore)
2. Figure out next row max previous row
3. All scanners seek to first KV of next row
(SeekTo)
Row2 kv2
Row1 kv2
Row1 kv1
Row3 kv1
Row3 kv2
Row2 kv1
Row3 kv3
Row3 kv4
Row2 kv3
Row4 kv2
Row4 kv4
Row4 kv1
Row4 kv5
Row4 kv6
Row4 kv3
Row5 kv2
Row5 kv3
Row6 kv1
Performance 70 of forward scan
253. Per Table/CF Replication
- PeerB creates T2 only replication cant work!
PeerA (backup)
- PeerB creates T1T2 all data replicated!
T1cfA,cfB T2cfX,cfY
Source
T1 cfA, cfB T2 cfX, cfY
PeerB (T2cfX)
?
Need a way to specify which data to replicate!
26Per Table/CF Replication
PeerA
- add_peer PeerB, PeerB_ZK, T2cfX
T1cfA,cfB T2cfX,cfY
Source
T1 cfA, cfB T2 cfX, cfY
PeerB (T2cfX)
T2cfX
274. Block Index Key Optimization
Before Block 2 block index key ah,
hello world/
Now Block 2 block index key ac/ (
k1 lt key lt k2)
k1ab
k2 ah, hello world
Block 1
Block 2
- Save seeking previous block if the searching key
is in ac, ah, hello world
28Some ongoing patches
- Cross-table cross-row transaction(HBASE-10999)
- HLog compactor(HBASE-9873)
- Adjusted delete semantic(HBASE-8721)
-
- Coordinated compaction (HBASE-9528)
- Quorum master (HBASE-10296)
291. Cross-Row Transaction Themis
http//github.com/xiaomi/themis
- Google Percolator Large-scale Incremental
Processing Using - Distributed
Transactions and Notifications
- Two-phase commit strong cross-table/row
consistency
- Global timestamp server global strictly
incremental timestamp
- No touch to HBase internal based on HBase Client
and coprocessor
- Read 90, Write 23 (same downgrade as
Google percolator)
302. HLog Compactor
HLog 1,2,3
Region x few writes but scatter in many HLogs
Region 1
Region 2
Region x
Memstore
HFiles
PeriodicMemstoreFlusher flush old memstores
forcefully
- flushCheckInterval/flushPerChanges hard to
config
- HBASE-10499 problematic region cant be flushed!
31HLog Compactor
HLog 1, 2, 3,4
- Compact HLog 1,2,3,4 ? HLog x
- Archive HLog1,2,3,4
HLog x
Region 1
Region 2
Region x
Memstore
HFiles
323. Adjusted Delete Semantic
Scenario 1
1. Write kvA at t0
2. Delete kvA at t0, flush to hfile
3. Write kvA at t0 again
4. Read kvA
Result kvA cant be read out
Scenario 2
1. Write kvA at t0
2. Delete kvA at t0, flush to hfile
3. Major compact
4. Write kvA at t0 again
5. Read kvA
Result kvA can be read out
Fix delete cant mask kvs with larger mvcc (
put later )
334. Coordinated Compaction
RS
RS
RS
Compact storm!
HDFS (global resource)
- Compact uses a global HDFS, while whether to
compact is decided locally!
34Coordinated Compaction
RS
RS
RS
Can I ?
OK
Master
Can I ?
NO
Can I ?
OK
HDFS (global resource)
- Compact is scheduled by master, no compact storm
any longer
355. Quorum Master
A
zk3
zk2
X
Master
A
Read info/states
Master
zk1
ZooKeeper
RS
RS
RS
- When active master serves, standby master stays
really idle
- When standby master becomes active, it needs to
rebuild in-memory status
36Quorum Master
A
X
Master 3
Master 1
A
Master 2
RS
RS
RS
- Better master failover perf No phase to rebuild
in-memory status
- Better restart perf for BIG cluster(10K regions)
- No external(ZooKeeper) dependency
- No potential consistency issue
37Hangjun Ye, Zesheng Wu, Peng ZhangXing Yong,
Hao Huang, Hailei LiShaohui Liu, Jianwei Cui,
Liangliang HeDihao Chen
Acknowledgement
38Thank You!xieliang_at_xiaomi.comfenghonghua_at_xiao
mi.com
www.mi.com