data encoding

At its most basic, encoding is the act of moving in-memory data to the disk or to send on the network. Conversely, decoding is moving data in a disk or from a network to in-memory data.

Encoding formats can be split into two broad categories.

human-readable formats

Human-readable formats of encoding tend to be popular as they can be understood without needing an decoding schema. The main formats in this format are:

JSON
- No strict schema, so it’s flexible. Schema validation needs to be agreed upon outside of a strict schema.
- Widely supported in the web browser space.
XML
- Not very popular in newer technologies anymore.
- Not binary encoded, but still hard to read due to verbosity.

binary formats

Binary formats are not readable without an encoding schema, but are able to compress data more. As such, they are more useful for company-internal usecases.

Major technologies in this space include Thrift, Protocol Buffers, and Avro.

protocol buffers (protobufs)

developed by google
TODO: ddia pdf page 142

thrift

developed by facebook
binary encoding that requires an explicit schema to encode/decode
TODO: ddia pdf page 139

avro

writer schema and reader schema
TODO: ddia pdf page 144

importance of evolvability

Business requirements and logic are always changing, and a change to business requirements typically requires schema changes.

Schema changes across a large distributed system without downtime requires rolling changes, meaning that machines may be running different versions of schemas/code at the same time.

Because of this, when considering encoding formats, we need to ensure that our encoding/decoding is evolvable, meaning that it’s backward and forward compatible in the face of schema changes.

🗿