大数据文件格式与压缩算法小结

小结一下Hadoop/Hive的文件格式和压缩算法，

Overview

文件格式和压缩算法在大数据系统里面是一个高关注的优化点，双方常常是配合着一起调优使用。

1. 文件格式

A file format is the way in which information is stored or encoded in a computer file. In Hive it refers to how records are stored inside the file. As we are dealing with structured data, each record has to be its own structure. How records are encoded in a file defines a file format.

file format	characteristics	hive storage option
TextFile	plain text, default format	`STORED AS TEXTFILE`
SequenceFile	row-based, binary key-value, splittable	`STORED AS SEQUENCEFILE`
Avro	row-based, binary or JSON, splittable	`STORED AS AVRO`
RCFile	columnar, RLE	`STORED AS RCFILE`
ORCFile	Optimized RC, Flatten	`STORED AS ORC`
Parquet	column-oriented binary file, Nested	`STORED AS PARQUET`

2. 压缩算法

To balance the processing capacity required to compress and uncompress the data, the CPU required to processing compress or uncompress data, the disk IO required to read and write the data, and the network bandwidth required to send the data across the network.

Compression is not recommended if your data is already compressed (such as images in JPEG format). In fact, the resulting file can actually be larger than the original.

compression format	characteristics	splittable
DEFLATE	DefaultCodec	no
GZip	uses more CPU resources than Snappy or LZO; provides a higher compression ratio; A good choice for cold data	no
BZip2	more compression than GZip	yes
LZO	better choice for hot data	yes if indexed
LZ4	significantly faster than LZO	no
Snappy	performs better than LZO, better choice for hot data	yes?