Hadoop-specific file formats include columnar format such as Parquet and RCFile, serialization formats like Avro, and file-based data structures such as sequence files.
Splittability and compression are the key consideration for storing data in Hadoop. It allows large files to be split for input to MapReduce and other types of jobs. Splittability is a fundamental part of parallel processing.
SequenceFiles
SequenceFiles store data as binary key-value pairs and can be uncompressed or compressed. SequenceFiles are well supported within the Hadoop ecosystem, however their support outside of the ecosystem is limited. They are also only supported in Java.
Storing a largenumber of small files in Hadoop can cause a couple of issues. A common use case for SequenceFiles is as a container for smaller files.
Seriaization Formats
Serialization is the process of turning data structures into byte streams. Data storage and transmission are main purpose of serialization. The main serialization format utilized by Hadoop is Writables. Writables are compact and fast, but limited to Java.
However, other serialization frameworks getting more reputation within the Hadoop ecosystem, including Thrift, Protocol Buffers, and Avro. Avro is the most efficient and specifically created to address limitations of Hadoop Writables.
Thrift
Thrift was designed at Facebook as a framework for developing cross-language interfaces to services. Using Thrift allowed Facebook to implement a single interface that can be used with different languages to access different underlying systes.
Thrift does not support compression of records, it’s not splittable, and have no native MapReduce support.
Protocol Buffers
The Protocol Buffer (protobuf) was developed at Google to facilitate data exchange between services written in different languages. Protocol Buffers are not splittable, do not support internal compression of records, and have no native MapReduce support.
Avro
Avro is a language-neutral data serialization system designed to address the downside of Hadoop Writables: lack of language portability. Since Avro stores
the schema in the header of each file, it’s self-describing and Avro files can easily be read from a different language than the one used to write the file. Avro is splittable.
Avro stores the data definition in JSON format making it easy to read and interpret, the data itself is stored in binary format making it compact and efficient.
Avro supports native MapReduce and schema evolution. The scehma used to read a file does not need to match the schema used to write the file which provides great flexibility with requirement change.
Avro supports a number of data types such as Boolean, int, float, and string. It also supports complex types such as array, map, and enum.
Columnar Formats
Most RDBMS stored data in a row-oriented format. This is efficient when many columns of the record need to be fetched. This
option can also be more efficient when you’re writing data, particularly if all columns of the record are available at write time because the record can be written with a single disk seek.
More recently, a number of databases have introduced columnar data storage which is well suited for data warehousing and queries that only access a small subset of columns. Columnar data sets provides more efficient compression.
Columnar file formats supported on Hadoop include the RCFile format, Optimized Row Columnar (ORC) and Parquet.
RCFile
The RCFile was developed to provide fast data loading, fast query processing, and efficient processing for MapReduce applications, although it’s only seen use as a Hive storage format.
The RCFile format breaks files into row splits, then within each split uses column-oriented storage. It also has some deficiencies that prevent optimal performance for query times and compression. RCFile is still a fairly common format used with Hive storage.
ORC
The ORC format has come to life to address some of the weaknesses with the RCFile format, specifically around storage and query performance efficiency. The ORC provides lightweight, always-on compression provided by type-specific readers
and writers. Supports the Hive type model, including new primitives such as decimal and complex types. Is a splittable storage format.
Parquet
Parquet documents says: Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.
Parquet shares many of the same design goals as ORC, but is intended to be a general-purpose storage format for Hadoop. The goal is to create a format that’s suitable for different MapReduce interfaces such as Java, Hive, and Pig, and also suitable for other processing engines
such as Impala and Spark. Parquet provides the following benefits, many of which it shares with ORC:
• Similar to ORC files, Parquet allows for returning only required data fields, thereby reducing I/O and increasing performance.
• Is designed to support complex nested data structures.
• Compression can be specified on a per-column level.
• Fully supports being able to read and write to with Avro and Thrift APIs.
• Stores full metadata at the end of files, so Parquet files are self-documenting.
• Uses efficient and extensible encoding schemas—for example, bit-packaging/run length encoding (RLE).
Conclusion
Having a single interface to all the files in your Hadoop cluster is valuable. Speaking of picking a file format, you will want to pick one with a schema because, in the end, most data in Hadoop will be structured or semistructured data.
So if you need a schema, Avro and Parquet are great options. However, we don’t want to have to worry about making an Avro version of the schema and a Parquet version.
Thankfully, this isn’t an issue because Parquet can be read and written to with Avro APIs and Avro schemas.
We can meet our goal of having one interface to interact with our Avro and Parquet files, and we can have a block and columnar options for storing our data.