HBase at Xiaomi - PowerPoint PPT Presentation

1 / 38

About This Presentation

Title:

HBase at Xiaomi

Description:

Liang Xie / Honghua Feng {xieliang, fenghonghua}_at_xiaomi.com * www.mi.com Coordinated Compaction RS RS RS Master Can I ? OK Can I ? OK Can I ? NO HDFS (global resource ... – PowerPoint PPT presentation

Number of Views:148

Avg rating:3.0/5.0

Slides: 39

Provided by: admi2566

Category:

more less

Transcript and Presenter's Notes

Title: HBase at Xiaomi

1
HBase at Xiaomi
Liang Xie / Honghua Feng
xieliang, fenghonghua_at_xiaomi.com
2
About Us
Honghua Feng
Liang Xie
3
Outline

Introduction
Latency practice
Some patches we contributed
Some ongoing patches
QA

4
About Xiaomi

Mobile internet company founded in 2010
Sold 18.7 million phones in 2013
Over 5 billion revenue in 2013
Sold 11 million phones in Q1, 2014

5
Hardware
6
Software
7
Internet Services
8
About Our HBase Team

Founded in October 2012

5 members
Liang Xie
Shaohui Liu
Jianwei Cui
Liangliang He
Honghua Feng

Resolved 130 JIRAs so far

9
Our Clusters and Scenarios

15 Clusters 9 online / 2 processing / 4 test

Scenarios
MiCloud
MiPush
MiTalk
Perf Counter

10
Our Latency Pain Points

Java GC
Stable page write in OS layer
Slow buffered IO (FS journal IO)
Read/Write IO contention

11
HBase GC Practice

Bucket cache with off-heap mode
Xmn/ServivorRatio/MaxTenuringThreshold
PretenureSizeThreshold repl src size
GC concurrent thread number

GC time per day 2500, 3000 -gt 300, 600s !!!
12
Write Latency Spikes

HBase client put
-gtHRegion.batchMutate
-gtHLog.sync
-gtSequenceFileLogWriter.sync
-gtDFSOutputStream.flushOrSync
-gtDFSOutputStream.waitForAckedSeqno ltStuck here
often!gt
DataNode pipeline write, in BlockReceiver.receiveP
acket()
-gtreceiveNextPacket
-gtmirrorPacketTo(mirrorOut) //write packet to the
mirror
-gtout.write/flush //write data to local disk. lt-
buffered IO
Added instrumentation(HDFS-6110) showed the
stalled write was the culprit, strace result also
confirmed it

13
Root Cause of Write Latency Spikes

write() is expected to be fast
But blocked by write-back sometimes!

14
Stable page write issue workaround
Workaround 2.6.32.279(6.3) -gt
2.6.32.220(6.2) or 2.6.32.279(6.3) -gt
2.6.32.358(6.4) Try to avoid deploying
REHL6.3/Centos6.3 in an extremely latency
sensitive HBase cluster!
15
Root Cause of Write Latency Spikes

...
0xffffffffa00dc09d do_get_write_access0x29d/0x5
20 jbd2
0xffffffffa00dc471 jbd2_journal_get_write_access
0x31/0x50 jbd2
0xffffffffa011eb78 __ext4_journal_get_write_acce
ss0x38/0x80 ext4
0xffffffffa00fa253 ext4_reserve_inode_write0x73
/0xa0 ext4
0xffffffffa00fa2cc ext4_mark_inode_dirty0x4c/0x
1d0 ext4
0xffffffffa00fa6c4 ext4_generic_write_end0xe4/0
xf0 ext4
0xffffffffa00fdf74 ext4_writeback_write_end0x74
/0x160 ext4
0xffffffff81111474 generic_file_buffered_write0
x174/0x2a0 kernel
0xffffffff81112d60 __generic_file_aio_write0x25
0/0x480 kernel
0xffffffff81112fff generic_file_aio_write0x6f/0
xe0 kernel
0xffffffffa00f3de1 ext4_file_write0x61/0x1e0
ext4
0xffffffff811762da do_sync_write0xfa/0x140
kernel
0xffffffff811765d8 vfs_write0xb8/0x1a0
kernel
0xffffffff81176fe1 sys_write0x51/0x90 kernel

XFS in latest kernel can relieve journal IO
blocking issue, more friendly to metadata heavy
scenarios like HBase HDFS
16
Write Latency Spikes Testing

8 YCSB threads write 20 million rows, each 3200
Bytes 3 DN kernel 3.12.17
Statistic the stalled write() which costs gt 100ms

The largest write() latency in Ext4 600ms !
17
Hedged Read (HDFS-5776)
18
Other Meaningful Latency Work

Long first put issue (HBASE-10010)
Token invalid (HDFS-5637)
Retry/timeout setting in DFSClient
Reduce write traffic? (HLog compression)
HDFS IO Priority (HADOOP-10410)

19
Wish List

Real-time HDFS, esp. priority related
Core data structure GC friendly
More off-heap shenandoah GC
TCP/Disk IO characteristic analysis
Need more eyes on OS
Stay tuned

20
Some Patches Xiaomi Contributed

New write thread model(HBASE-8755)
Reverse scan(HBASE-4811)
Per table/cf replication(HBASE-8751)
Block index key optimization(HBASE-7845)

21
1. New Write Thread Model
Old model

WriteHandler
WriteHandler
WriteHandler
256
Local Buffer
WriteHandler write to HDFS
WriteHandler write to HDFS
WriteHandler write to HDFS
256
WriteHandler sync to HDFS
WriteHandler sync to HDFS
256
WriteHandler sync to HDFS
Problem WriteHandler does everything, severe
lock race!
22
New Write Thread Model
New model

WriteHandler
WriteHandler
WriteHandler
256
Local Buffer
AsyncWriter write to HDFS
1
AsyncSyncer sync to HDFS
WriteHandler sync to HDFS
WriteHandler sync to HDFS
4
AsyncNotifier notify writers
1
23
New Write Thread Model

Low load No improvement

Heavy load Huge improvement (3.5x)

24
2. Reverse Scan
1. All scanners seek to previous rows
(SeekBefore)
2. Figure out next row max previous row
3. All scanners seek to first KV of next row
(SeekTo)
Row2 kv2
Row1 kv2
Row1 kv1
Row3 kv1
Row3 kv2
Row2 kv1
Row3 kv3
Row3 kv4
Row2 kv3
Row4 kv2
Row4 kv4
Row4 kv1
Row4 kv5
Row4 kv6
Row4 kv3
Row5 kv2
Row5 kv3
Row6 kv1
Performance 70 of forward scan
25
3. Per Table/CF Replication

PeerB creates T2 only replication cant work!

PeerA (backup)

PeerB creates T1T2 all data replicated!

T1cfA,cfB T2cfX,cfY
Source
T1 cfA, cfB T2 cfX, cfY
PeerB (T2cfX)
?
Need a way to specify which data to replicate!
26
Per Table/CF Replication

add_peer PeerA, PeerA_ZK

PeerA

add_peer PeerB, PeerB_ZK, T2cfX

T1cfA,cfB T2cfX,cfY
Source
T1 cfA, cfB T2 cfX, cfY
PeerB (T2cfX)
T2cfX
27
4. Block Index Key Optimization
Before Block 2 block index key ah,
hello world/
Now Block 2 block index key ac/ (
k1 lt key lt k2)
k1ab
k2 ah, hello world

Block 1
Block 2

Reduce block index size

Save seeking previous block if the searching key
is in ac, ah, hello world

28
Some ongoing patches

Cross-table cross-row transaction(HBASE-10999)
HLog compactor(HBASE-9873)
Adjusted delete semantic(HBASE-8721)
Coordinated compaction (HBASE-9528)
Quorum master (HBASE-10296)

29
1. Cross-Row Transaction Themis
http//github.com/xiaomi/themis

Google Percolator Large-scale Incremental
Processing Using
Distributed
Transactions and Notifications

Two-phase commit strong cross-table/row
consistency

Global timestamp server global strictly
incremental timestamp

No touch to HBase internal based on HBase Client
and coprocessor

Read 90, Write 23 (same downgrade as
Google percolator)

More details HBASE-10999

30
2. HLog Compactor
HLog 1,2,3
Region x few writes but scatter in many HLogs
Region 1
Region 2
Region x
Memstore
HFiles
PeriodicMemstoreFlusher flush old memstores
forcefully

flushCheckInterval/flushPerChanges hard to
config

Result in tiny HFiles

HBASE-10499 problematic region cant be flushed!

31
HLog Compactor
HLog 1, 2, 3,4

Compact HLog 1,2,3,4 ? HLog x
Archive HLog1,2,3,4

HLog x
Region 1
Region 2
Region x
Memstore
HFiles
32
3. Adjusted Delete Semantic
Scenario 1
1. Write kvA at t0
2. Delete kvA at t0, flush to hfile
3. Write kvA at t0 again
4. Read kvA
Result kvA cant be read out
Scenario 2
1. Write kvA at t0
2. Delete kvA at t0, flush to hfile
3. Major compact
4. Write kvA at t0 again
5. Read kvA
Result kvA can be read out
Fix delete cant mask kvs with larger mvcc (
put later )
33
4. Coordinated Compaction
RS
RS
RS
Compact storm!
HDFS (global resource)

Compact uses a global HDFS, while whether to
compact is decided locally!

34
Coordinated Compaction
RS
RS
RS
Can I ?
OK
Master
Can I ?
NO
Can I ?
OK
HDFS (global resource)

Compact is scheduled by master, no compact storm
any longer

35
5. Quorum Master
A
zk3
zk2
X
Master
A
Read info/states
Master
zk1
ZooKeeper
RS
RS
RS

When active master serves, standby master stays
really idle

When standby master becomes active, it needs to
rebuild in-memory status

36
Quorum Master
A
X
Master 3
Master 1
A
Master 2
RS
RS
RS

Better master failover perf No phase to rebuild
in-memory status

Better restart perf for BIG cluster(10K regions)

No external(ZooKeeper) dependency

No potential consistency issue

Simpler deployment

37
Hangjun Ye, Zesheng Wu, Peng ZhangXing Yong,
Hao Huang, Hailei LiShaohui Liu, Jianwei Cui,
Liangliang HeDihao Chen
Acknowledgement
38
Thank You!xieliang_at_xiaomi.comfenghonghua_at_xiao
mi.com
www.mi.com

Write a Comment

User Comments (0)