大数据文件格式与压缩算法小结

小结一下Hadoop/Hive的文件格式压缩算法


Overview

文件格式和压缩算法在大数据系统里面是一个高关注的优化点,双方常常是配合着一起调优使用。


1. 文件格式

A file format is the way in which information is stored or encoded in a computer file. In Hive it refers to how records are stored inside the file. As we are dealing with structured data, each record has to be its own structure. How records are encoded in a file defines a file format.

file format characteristics hive storage option
TextFile plain text, default format STORED AS TEXTFILE
SequenceFile row-based, binary key-value, splittable STORED AS SEQUENCEFILE
Avro row-based, binary or JSON, splittable STORED AS AVRO
RCFile columnar, RLE STORED AS RCFILE
ORCFile Optimized RC, Flatten STORED AS ORC
Parquet column-oriented binary file, Nested STORED AS PARQUET

2. 压缩算法

To balance the processing capacity required to compress and uncompress the data, the CPU required to processing compress or uncompress data, the disk IO required to read and write the data, and the network bandwidth required to send the data across the network.

Compression is not recommended if your data is already compressed (such as images in JPEG format). In fact, the resulting file can actually be larger than the original.

compression format characteristics splittable
DEFLATE DefaultCodec no
GZip uses more CPU resources than Snappy or LZO; provides a higher compression ratio; A good choice for cold data no
BZip2 more compression than GZip yes
LZO better choice for hot data yes if indexed
LZ4 significantly faster than LZO no
Snappy performs better than LZO, better choice for hot data yes?

Others