HBase架构的重要模块与动作

写一下个人看了hbase文章后的一些看法,


Overview

hbase通过row key来查找相应的列,即正排,如下图,

image

hbase数据结构

es可以通过对每个field做倒排(基于Lucene),从而快速查找对应field(field可以是全匹配,可以是切词后的全匹配,可以是模糊匹配等)的所属docSet。类似于hbase,如果使用docId在es中直接查找的话,就是正排,即docId = row key,如下,

doc1(doc1id) -> field1, field2, field3, ...
doc2(doc2id) -> field1, field2, field3, ...

其中hbase与es有不少相似之处,如WAL, failover(reply), flush, merge等。


HBase Architecture

以下内容主要copy自mapr,加入了个人的理解和组织整理,

https://mapr.com/blog/in-depth-look-hbase-architecture

hbase主要有三个组件,

  • region server(rs), serve data for reads and writes. When accessing data, clients communicate with RS directly. One RS can serve about 1,000 regions by default
    • region, a region contains all rows in the table between the region’s start key and end key
  • hbase master(HMaster), handle region assignment, DDL (create, delete tables) operations
    • coordinating the region servers
      • 在启动或者故障恢复时分配region
      • 监控hbase集群中所有rs实例的状态
    • admin functions
      • 处理client发送过来的CRUD
  • Zookeeper(zk), maintains a live cluster state,检测node服务的存活性与可得性,用以保证公共状态的有效传输与共享

image

overview1

image

overview2

image

overview3

上面这三个组件,zk用于协调分布式系统的状态共享信息。rs和hmaster通过一个session会话连接到zk,rs和HMaster通过heartbeats心跳保持与zk的通信。

client读写hbase的过程,

  • client gets the rs that hosts the meta table from zk
  • client will query the meta server to get the rs corresponding to the row key it wants to access. The client caches this information along with the meta table location
  • It will get the Rows from the corresponding rs

Important Module

下面分别介绍hbase的重要模块,

meta table

  • The meta table is an HBase table that keeps a list of all regions in the system
  • The meta table is like a B tree
  • The meta table structure is as follows:
    • Key: region start key,region id
    • Values: RegionServer

image

meta table

region server

A Region Server runs on an HDFS data node and has the following components:

  • WAL: Write Ahead Log is a file on the distributed file system(disk). The WAL is used to store new data that hasn’t yet been persisted to permanent storage; it is used for recovery in the case of failure
  • BlockCache: is the read cache. It stores frequently read data in memory. Least Recently Used(LRU) data is evicted when full
  • MemStore: is the write cache. It stores new data which has not yet been written to disk. It is sorted before writing to disk. There is one MemStore per column family(CF) per region
  • Hfiles: stores the rows as sorted KeyValues on disk

image

region server

region

HBase Tables are divided horizontally by row key range into “Regions”. A region contains all rows in the table between the region’s start key and end key. Regions are assigned to the nodes in the cluster, called rs, and these serve data for reads and writes.

  • A table can be divided horizontally into one or more regions. A region contains a contiguous, sorted range of rows between a start key and an end key
  • Each region is 1GB in size (default)
  • A region of a table is served to the client by a rs
  • A rs can serve about 1,000 regions

image

region1

image

region2

memStore

The memStore stores updates(writing ops) in memory as sorted KeyValues, the same as it would be stored in an HFile. There is one memStore per column family. The updates are sorted per column family.

image

memStore

HFile

Data is stored in an HFile which contains sorted key/values. When the memStore accumulates enough data, the entire sorted KeyValue set is written to a new HFile in HDFS. This is a sequential write. It is very fast, as it avoids moving the disk drive head. 顺序写disk,避免磁盘的寻道耗时。

image

HFile

HFile structure

A HFile contains a multi-layered index which allows HBase to seek to the data without having to read the whole file. The multi-level index is like a B+ tree. HFile分为数据块,索引块,bloom过滤器以及trailer,

  • data block stores increasing order keyValues with size in 64KB
  • each block has it own leaf-index
  • bloom filters help to skip files that do not contain a certain row key
  • Trailer主要记录了HFile的基本信息,各个部分的偏移和地址信息

image

HFile structure

HFile Index

The index is loaded when the HFile is opened and kept in memory. This allows lookups to be performed with a single disk seek. (类似于es/lucene里面的底层里的打开着的倒排)

image

HFile index(blockCache allover the rs)


Important Action

下面介绍hbase里面的重要操作action,

hbase写过程

hbase的写过程类似es的写过程

  1. When the client issues a Put request, the first step is to write the data to the WAL,
    • edits are appended to the end of the WAL file that is stored on disk
    • WAL is used to recover not-yet-persisted data in case a server crashes
  2. Once the data is written to the WAL, it is placed in the MemStore(write). Then, the put request acknowledgement returns to the client

image

hbase写过程

hbase读过程

  1. Scanner looks for the Row cells in the blockCache(read) module. Recently Read Key Values are cached here, and LRU are evicted when memory is needed, return when found
  2. looks in the memStore(write) module, the write cache in memory containing the most recent writes
  3. If the scanner does not find all of the row cells in the memStore and blockCache, then HBase will use the blockCache indexes and bloom filters to load HFiles into memory, which may contain the target row cells

image

hbase读过程

region flush刷新过程

Writing key-value data into the memStore, i.e., when the memStore accumulates enough data, the entire sorted set is written to a new HFile in HDFS. These files are created over time as KeyValue edits sorted in the memStores are flushed as files to disk. (类似es的flush, i.e., refresh_interval )

image

memStore flush data into disk

HFile merge合并过程

One region flush would create one small disk file, 这样hbase读的第三步可能需要打开很多小文件来查找所需的row-keyValues,每个open file都需要占用一定资源,希望尽量将文件操作描述符减少,即把多而小的文件合并为少而大的文件。整体类似es的merge过程index.store.throttle.max_bytes_per_sec,不同的是hbase merge分了两类,

  1. minor compaction, hbase will automatically pick some smaller HFiles and rewrite them into fewer bigger Hfiles
  2. major compaction, merges and rewrites all the HFiles in a region to one HFile per column family, and in the process, drops deleted or expired cells (There is one memStore per column family)

image

minor compaction

image

major compaction

region split

默认情况下,一张表只有一个region。但是如果数据一直写入该表,导致该region放不下一张表了(一个region默认大小是1GB),那么需要将这张表进行水平split切割,分成2个region。在HMaster触发load balance负载均衡之前,这2个split region会在同一个rs里,但是在load balance之后,可能其中一个region会被转移到其他的rs中去(rs类似于街区block,整个hbase类似于城市city)。

image

region split

读负载均衡

上节提及的HMaster会将同一个rs里面的多个regions里面的某些regions迁移到别的rs中。如果一个region被迁移出原生所在的rs,那么在新rs中读取这个迁出region的数据的时候,还是要回到旧rs中获取数据(数据本地性问题,类似spark的locality )。直到一个major compaction触发,新rs即可把迁出region的数据move到其本地node。

image

本地性locality


Failover

下面介绍hbase的容灾策略,

数据副本

  • WAL, HFile都是写hdfs disk的,利用了hdfs的原生replica功能
  • memStore, blockCache都是内存级别的,没有副本备份功能

数据恢复

  1. zk通过heartbeat检查rs/node的crash
  2. zk通知HMaster不要再把新连接请求发送到那个crash rs/node了
  3. HMaster检测到rs crash之后,会将crash rs里面的region数据迁移到某个active rs里去
  4. crash rs里面的memStore和blockCache都丢失了(内存),但是WAL和HFile有副本在(disk),而blockCache是cache read请求的,即便丢了,在hbase读过程中可以利用下一层来look up。而memStore是记录着写过程的,如果memStore在crash时刻还有未flush的data,那么这些data就会丢失,fortunately写过程是一个双写过程,既写到memStore了,也写到了WAL,所以这部分丢失的data可以通过WAL replay回放来恢复
  5. 可能需要关心WAL的数据同步速率可能会落后于memStore crash前的同步速度,那么回放时就会有duplication数据,即HFile里面已经有了WAL的数据了,但是还是将WAL的数据replay到memStore,然后memStore再flush到新的HFile,那么在HFile merge的时候就需要加入一些deduplication处理,即WAL -> memStore -> HFile

image

rs crash

image

memStore data recovery

最后原文中还提及了mapr对hbase的改进,有兴趣可以去看看。


Reference