Google Protobuf vs. Apache Avro

Background
Serialization and deserialization are technologies that are often used in our daily data persistence and network transmission, but the various serialization frameworks are currently dazzling, and it is unclear which serialization framework is used in which scenario. This article will compare google ProtoBuf and Apache avro, which support cross-language and cross-platform elections.

Saving and Reloading Data with File Serialization in Java | by Ashley  Raigosa | Medium

Google Protobuf
Introduction

Google Protocol Buffer (Protobuf for short) is a mixed-language data standard within Google. Protocol Buffers is a lightweight and efficient structured data storage format that can be used for structured data serialization, or serialization. It is very suitable for data storage or rpc data interchange format. A language-independent, platform-independent, and extensible serialized structured data format that can be used in communication protocols, data storage, and other fields. Currently, APIs are provided in three languages: c++, JAVA, and PYTHON.

Understanding Protocol Buffers. A deep dive into Protobufs | by Arun Mathew  Kurian | Better Programming

Features

Advantages

Binary messages, good performance/efficiency (good space and time efficiency)
proto file to generate object code, easy to use
Serialization and deserialization directly correspond to the data classes in the program, and do not need to be mapped after parsing (XML, json are all in this way)
Supports forward compatibility (newly added fields use default values) and backwards compatibility (newly added fields are ignored), simplifying upgrades
Supports multiple languages ​​(you can think of proto files as IDL files)
Some framework integration such as Netty

Netty堆外内存泄露排查与总结- 知乎


Shortcoming

Officially only supports C++, JAVA and Python language bindings
Poor binary readability (seems to provide Text_Fromat function)
Binary is not self-describing
Does not have dynamic features by default (you can generate message types through dynamic definition or dynamic compilation support)
Only involves serialization and deserialization technology, does not involve RPC functions (parsers like XML or JSON)

Type of data

ProtoBuf has two language versions: v2 and v3, which needs to be marked with syntax=”proto3″ in the first line of the *.proto file. There are some differences in syntax between v3 and v2. For example: v3 removes optional, required, etc., in syntax More concise, we mainly introduce v3 here, so we will not introduce too much about v2. Protobuf is lightweight, so it cannot support too many data types. The following is a list of basic types supported by protobuf and compared with C++ types, which generally meet the requirements. N means that the packed bytes are not fixed. Rather, it depends on the size or length of the data.

Protocol Buffers - IntelliJ IDEs Plugin | Marketplace

Coding

Protocol buffers come with code generation tools that can generate friendly data access storage interfaces. So it is more convenient for developers to use it to code. For example, in the above example, if you use C++ to read the user’s name and email, you can directly call the corresponding get method (the codes of all attribute get and set methods are automatically generated, you only need to call them), Protobuf has clearer semantics and does not require anything like an XML parser (because the Protobuf compiler will compile the .proto file to generate the corresponding data access classes to serialize and deserialize Protobuf data).

There is no need to learn the complex document object model to use Protobuf. The programming mode of Protobuf is friendly and easy to learn. At the same time, it has good documentation and examples. For people who like simple things, Protobuf is more attractive than other technologies. A final nice feature of protocol buffers is that they are “backwards” compatible, allowing people to upgrade data structures without breaking deployed programs that rely on “old” data formats. In this way, your program does not have to worry about large-scale code refactoring or migration due to changes in message structure. Because adding a field in a new message does not cause any changes to the published program (because the storage method is inherently unordered, in the form of k-v).

Apache Avro
Introduction

Avro is a sub-project in Hadoop and an independent project in Apache. Avro is a high-performance middleware based on binary data transmission. This tool is also used in other Hadoop projects such as HBase (Ref) and Hive (Ref) for data transmission between the client and the server. Avro is a data serialization system. Avro can convert data structures or objects into a format that is convenient for storage or transmission. Avro is designed to support data-intensive applications and is suitable for remote or local large-scale data storage and exchange

Apache Avro - Wikipedia

Features

Advantage

Binary messages, good performance/efficiency
Use JSON to describe schemas
Unified storage of schema and data, self-describing messages, no need to generate stub code (support for generating IDL)
RPC calls exchange schema definitions during the handshake phase
Contains a complete client/server stack for fast RPC implementation
Supports synchronous and asynchronous communication
Support for dynamic messages
Schema definitions allow to define the ordering of the data (which will be followed when serializing)
Provides services based on Jetty kernel and services based on Netty

Shortcoming

Only supports Avro’s own serialization format
Language bindings are not rich

Which Language Bindings should be used with Selenium? | BlazeMeter

Type of data

Apache avro’s schemas are represented by JSON objects and can also use IDL. Schema defines simple data types and complex data types, where complex data types contain different properties. Users can customize rich data structures through various data types.

Coding

Avro supports two serialization encoding methods: binary encoding and JSON encoding. Using binary encoding will efficiently serialize, and the results obtained after serialization will be relatively small; while JSON is generally used for debugging systems or web-based applications. When serializing/deserializing Avro data, the schema needs to be executed in a depth-first (Depth-First), left-to-right (Left-to-Right) traversal order. The serialization of basic types is easy to solve, and the serialization of mixed types will have many different rules. Binary encoding for primitive and mixed types is specified in the documentation, and the bytes are arranged in sequence in the parsing order of the pattern. For JSON encoding, Union Type behaves inconsistently with other mixed types. Avro defines a Container File Format to facilitate MapReduce processing. There can only be one mode in such a file, and all objects that need to be stored in this file need to be written in binary encoding according to this mode. Objects are organized in blocks (Block) in the file, and these objects can be compressed. There will be a synchronization marker between blocks, so that MapReduce can easily cut files for processing.

ASM Definition: Attached Synchronization Marker | Abbreviation Finder

Summarize
Protobuf has the characteristics of cross-platform, fast parsing, small size of serialized data, high scalability, and simple use, but the embedded does not provide RPC communication. Avro’s explicit schema design and dynamic schema (no code generation and great performance) make it more suitable for building general tools and platforms for data exchange and storage, especially in the background.

Protobuf is suitable for scenarios that need to exchange messages with other systems and are sensitive to message size. Then protobuf is suitable, it has nothing to do with language, and the message space saves a lot of small data compared to xml and json. If you are big data, it is not suitable to use it. The project language is c++, java, python, because they can use google’s source class library, the efficiency of serialization and deserialization is very high. Other languages ​​need to be written by a third party or by themselves, and the efficiency of serialization and deserialization is not guaranteed. In general, protobuf is still very easy to use. It is used by many open source systems for data communication tools, and it is also the core basic library in Google.

Avro is suitable for scenarios. avro is best combined with the Hadoop ecosystem. Hive table definitions can be directly declared with avro schema, which is used in Hive to serialize log files. The advantage is that avro schema can directly replace Hive’s own table structure definition, so It can be more convenient to solve the problem of schema evolution. There are also many avro in kafka and Flume. The main RPC source of flume is Avro source, which constitutes internal communication of Flume with Avro sink, FlumeSDK, etc.