# Overview

hbase通过row key来查找相应的列，即正排，如下图，

es可以通过对每个field做倒排(基于Lucene)，从而快速查找对应field(field可以是全匹配，可以是切词后的全匹配，可以是模糊匹配等)的所属docSet。类似于hbase，如果使用docId在es中直接查找的话，就是正排，即docId = row key，如下，

doc1(doc1id) -> field1, field2, field3, ...
doc2(doc2id) -> field1, field2, field3, ...


# HBase Architecture

https://mapr.com/blog/in-depth-look-hbase-architecture

hbase主要有三个组件，

• region server(rs), serve data for reads and writes. When accessing data, clients communicate with RS directly. One RS can serve about 1,000 regions by default
• region, a region contains all rows in the table between the region’s start key and end key
• hbase master(HMaster), handle region assignment, DDL (create, delete tables) operations
• coordinating the region servers
• 在启动或者故障恢复时分配region
• 监控hbase集群中所有rs实例的状态
• 处理client发送过来的CRUD
• Zookeeper(zk), maintains a live cluster state，检测node服务的存活性与可得性，用以保证公共状态的有效传输与共享

client读写hbase的过程，

• client gets the rs that hosts the meta table from zk
• client will query the meta server to get the rs corresponding to the row key it wants to access. The client caches this information along with the meta table location
• It will get the Rows from the corresponding rs

# Important Module

### meta table

• The meta table is an HBase table that keeps a list of all regions in the system
• The meta table is like a B tree
• The meta table structure is as follows:
• Key: region start key,region id
• Values: RegionServer

### region server

A Region Server runs on an HDFS data node and has the following components:

• WAL: Write Ahead Log is a file on the distributed file system(disk). The WAL is used to store new data that hasn’t yet been persisted to permanent storage; it is used for recovery in the case of failure
• BlockCache: is the read cache. It stores frequently read data in memory. Least Recently Used(LRU) data is evicted when full
• MemStore: is the write cache. It stores new data which has not yet been written to disk. It is sorted before writing to disk. There is one MemStore per column family(CF) per region
• Hfiles: stores the rows as sorted KeyValues on disk

### region

HBase Tables are divided horizontally by row key range into “Regions”. A region contains all rows in the table between the region’s start key and end key. Regions are assigned to the nodes in the cluster, called rs, and these serve data for reads and writes.

• A table can be divided horizontally into one or more regions. A region contains a contiguous, sorted range of rows between a start key and an end key
• Each region is 1GB in size (default)
• A region of a table is served to the client by a rs
• A rs can serve about 1,000 regions

### memStore

The memStore stores updates(writing ops) in memory as sorted KeyValues, the same as it would be stored in an HFile. There is one memStore per column family. The updates are sorted per column family.

### HFile

Data is stored in an HFile which contains sorted key/values. When the memStore accumulates enough data, the entire sorted KeyValue set is written to a new HFile in HDFS. This is a sequential write. It is very fast, as it avoids moving the disk drive head. 顺序写disk，避免磁盘的寻道耗时。

#### HFile structure

A HFile contains a multi-layered index which allows HBase to seek to the data without having to read the whole file. The multi-level index is like a B+ tree. HFile分为数据块，索引块，bloom过滤器以及trailer，

• data block stores increasing order keyValues with size in 64KB
• each block has it own leaf-index
• bloom filters help to skip files that do not contain a certain row key
• Trailer主要记录了HFile的基本信息，各个部分的偏移和地址信息

#### HFile Index

The index is loaded when the HFile is opened and kept in memory. This allows lookups to be performed with a single disk seek. (类似于es/lucene里面的底层里的打开着的倒排)

# Important Action

### hbase写过程

hbase的写过程类似es的写过程

1. When the client issues a Put request, the first step is to write the data to the WAL,
• edits are appended to the end of the WAL file that is stored on disk
• WAL is used to recover not-yet-persisted data in case a server crashes
2. Once the data is written to the WAL, it is placed in the MemStore(write). Then, the put request acknowledgement returns to the client

### hbase读过程

1. Scanner looks for the Row cells in the blockCache(read) module. Recently Read Key Values are cached here, and LRU are evicted when memory is needed, return when found
2. looks in the memStore(write) module, the write cache in memory containing the most recent writes
3. If the scanner does not find all of the row cells in the memStore and blockCache, then HBase will use the blockCache indexes and bloom filters to load HFiles into memory, which may contain the target row cells

### region flush刷新过程

Writing key-value data into the memStore, i.e., when the memStore accumulates enough data, the entire sorted set is written to a new HFile in HDFS. These files are created over time as KeyValue edits sorted in the memStores are flushed as files to disk. (类似es的flush, i.e., refresh_interval )

### HFile merge合并过程

One region flush would create one small disk file, 这样hbase读的第三步可能需要打开很多小文件来查找所需的row-keyValues，每个open file都需要占用一定资源，希望尽量将文件操作描述符减少，即把多而小的文件合并为少而大的文件。整体类似es的merge过程index.store.throttle.max_bytes_per_sec，不同的是hbase merge分了两类，

1. minor compaction, hbase will automatically pick some smaller HFiles and rewrite them into fewer bigger Hfiles
2. major compaction, merges and rewrites all the HFiles in a region to one HFile per column family, and in the process, drops deleted or expired cells (There is one memStore per column family)

# Failover

### 数据副本

• WAL, HFile都是写hdfs disk的，利用了hdfs的原生replica功能
• memStore, blockCache都是内存级别的，没有副本备份功能

### 数据恢复

1. zk通过heartbeat检查rs/node的crash
2. zk通知HMaster不要再把新连接请求发送到那个crash rs/node了
3. HMaster检测到rs crash之后，会将crash rs里面的region数据迁移到某个active rs里去
4. crash rs里面的memStore和blockCache都丢失了(内存)，但是WAL和HFile有副本在(disk)，而blockCache是cache read请求的，即便丢了，在hbase读过程中可以利用下一层来look up。而memStore是记录着写过程的，如果memStore在crash时刻还有未flush的data，那么这些data就会丢失，fortunately写过程是一个双写过程，既写到memStore了，也写到了WAL，所以这部分丢失的data可以通过WAL replay回放来恢复
5. 可能需要关心WAL的数据同步速率可能会落后于memStore crash前的同步速度，那么回放时就会有duplication数据，即HFile里面已经有了WAL的数据了，但是还是将WAL的数据replay到memStore，然后memStore再flush到新的HFile，那么在HFile merge的时候就需要加入一些deduplication处理，即WAL -> memStore -> HFile