When Hadoop jobs need to share data, they can use any database. Avro is a serialization system that bundles the data together with a schema for understanding it. Each packet comes with a JSON data structure explaining how the data can be parsed. This header specifies the structure for the data up top avoiding the need to write out extra tags in the data to mark the fields. The result can be considerably more compact than traditional formats, such as XML or JSON, when the data is regular.
The illustration shows an Avro schema for a file with three different fields: name, favorite number, and favorite color.
Avro is another Apache project with APIs and code in Java, C++, Python, and other languages at http://avro.apache.org.