Serialization of data is usually needed for storage and for communication (data transport) purposes. Here, algorithmic overhead to perform such serialization is mostly a welcome drawback at the moment it helps to reasonably compress the data in respect to its usual representation in random access memory.
With the types found in this module, serialization is performed on bit-level instead of fixed-size integral types. While this implies a tremendous amount of bit-shifting operations, the two immediate advantages are:
int
is of undefined length).Encoding data in bit buffers usually omits any redundancy caused by attribute names. At least if a straight forward use of the classes provided here is taken. Instead of attribute naming (like with JSon or XML encoding) or attribute indexing (like done with Google Protobuf family of APIs and tools), attribute information is purely defined by the position of the data in the stream.
To encode and decode data into a stream of bits, three classes are provided. Class BitBuffer provides the memory allocation for the bit data as well as mechanics for writing the encoded byte streams to std::ostream
and likewise for retrieving them from std::istream
.
Variants of BitBuffer are classes BitBufferMA and BitBufferLocal, which provide alternative memory management.
Class BitWriter is used to fill a BitBuffer and class BitReader reads data back (either data just written by a BitWriter or data loaded using std::istream
.
The interface of the classes is rather simple and straight forward, and for that reason this manual refers its reader to the reference documentation for further information.
When writing bigger amounts of integral data, it might be reasonable to compress such data. Several different algorithms for compression are available. While the most prominent among them implements Huffman compression , real-life data is often better compressible with different approaches. The term "real-life" data here means, arrays of values which are not randomly generated but occur from an application and has some special attributes. For example, if a machine provides a certain "signal" which is logged regularly, the value of such signal often is either:
The algorithms incorporated with ALib address such specialties.
All mechanics in respect to array compression are interfaced by class alib::bitbuffer::ac_v1::ArrayCompressor, which is as usually aliased to alib::ArrayCompressor. The rationale for introducing inner namespace v1 is that future versions of ALib may incorporate changes in the binary format of one or more of the algorithms. In this case, a new namespace v2
will be introduced and the old version will still be available to support the deserialization of existing datasets. However, it is up to the programmer of software to detect the version of compression used with filed data and to add mechanics to use the right version of the class for decompression.
The list of algorithms are given with bitwise enumeration ArrayCompressor::Algorithm. For details about them, please consult the reference documentation of this type.
If parameter algorithmsToTry
is set to Algorithm::ALL when invoking method ArrayCompressor::Compress, each built-in algorithms is executed and finally the one with the best compression ratio is chosen. With that, an algorithm-identifier (currently 3 bits) is written upfront to enable method ArrayCompressor::Decompress to choose the right algorithm for decoding the data.
While this incorporates a certain degree of overhead in respect to execution performance, such overhead is exclusively done for encoding the data. With the fact that usually any data is encoded only once and decoded (very) often, such overhead should not weigh in too much. If so, the set of algorithms may be narrowed by providing a different set of algorithms with parameter algorithmsToTry
.
Most of the algorithms perform best if no unusual disruptive change in the data definition occurs. However, in real-life applications, disruptive values may occur every once in a while. Therefore, it might be efficient to constraint the arrays to a certain length and rather store the data in smaller packages than just compress all data available. To explain what we mean here, consider the following sample:
Some software captures every second value from a vehicle, including:
Now, if this vehicle takes a tour of five hours when the three values are captured every second, it might be more efficient to split the data into hourly packages of 3600 values for each of the three signals. After giving this some thought, the rationale should be quite obvious.