HDFS Schema Design
6 Mar
Hadoop’s Schema-on-Read model does not impose any requirements when loading data into Hadoop.
Data can be simply loaded into HDFS without association of a schema or preprocess the data. Although creating a carefully structured and organized repository of your data will provide many benefits. It allows for enforcing access and quota controls to prevent accidental deletion or corruption.
Data model will be highly dependent on the specific use case. For example, data warehouse implementations and other event stores are likely to use a schema similar to the traditional star schema, including structured fact and dimension tables. Unstructured and semi-structured data, on the other hand, are likely to focus more on directory placement and metadata management.
Develop standard practices and enforce them, especially when multiple teams are sharing the data.
Make sure your design will work well with the tools you are planning to use. The schema design is highly dependent on the way the data will be queried.
Keep usage patterns in mind when designing a schema. Different data processing and querying patterns work better with different schema designs. Understanding the main use cases and data retrieval requirements will result in a schema that will be easier to maintain and support in the long term as well as improve data processing performance.
Optimize organisation of data with partitioning, bucketing, and denormalizing strategies. Keeping in mind, storing a large number of small files in Hadoop can lead to excessive memory use for the NameNode.
A good average bucket size is a few multiples of the HDFS block size. Having an even distribution of data when hashed on the bucketing column is important because it leads to consistent bucketing. Also, having the number of buckets as a power of two is quite common.
Hadoop schema consolidates many of the small dimension tables into a few larger dimensions by joining them during the ETL process.