Struct bytekey::Encoder
[−]
[src]
pub struct Encoder<W> where W: Write {
// some fields omitted
}
An encoder for serializing data to a byte format that preserves lexicographic sort order.
The byte format is designed with a few goals:
- Order must be preserved
- Serialized representations should be as compact as possible
- Type information is not serialized with values
Supported Data Types
Unsigned Integers
u8
, u16
, u32
, and u64
are encoded into 1, 2, 4, and 8 bytes of output, respectively.
Order is preserved by encoding the bytes in big-endian (most-significant bytes first) format.
usize
is variable-length encoded into between 1 and 9 bytes. Smaller magnitude values (closer
to 0) will encode into fewer bytes. See emit_var_u64
for details on serialization
size and format.
Signed Integers
i8
, i16
, i32
, and i64
are encoded into 1, 2, 4, and 8 bytes of output, respectively.
Order is preserved by taking the bitwise complement of the value, and encoding the resulting
bytes in big-endian format.
isize
is variable-length encoded into between 1 and 9 bytes. Smaller magnitude values (closer
to 0) will encode into fewer bytes. See emit_var_i64
for details on serialization
size and format.
Floating Point Numbers
f32
and f64
are encoded into 4 and 8 bytes of output, respectively. Order is preserved
by encoding the value, or the bitwise complement of the value if negative, into bytes in
big-endian format. NAN
values will sort after all other values. In general, it is
unwise to use IEEE 754 floating point values in keys, because rounding errors are pervasive.
It is typically hard or impossible to use an approximate 'epsilon' approach when using keys for
lookup.
Characters
Characters are serialized into between 1 and 4 bytes of output.
Booleans
Booleans are serialized into a single byte of output. false
values will sort before true
values.
Strings
Strings are encoded into their natural UTF8 representation plus a single null byte suffix. In general, strings should not contain null bytes. The encoder will not check for null bytes, however their presence will break lexicographic sorting. The only exception to this rule is the case where the string is the final (or only) component of the key. If the string field is the final component of a tuple, enum-struct, or struct, then it may contain null bytes without breaking sort order.
Options
An optional wrapper type adds a 1 byte overhead to the wrapped data type. None
values will
sort before Some
values.
Structs & Tuples
Structs and tuples are encoded by serializing their consituent fields in order with no prefix, suffix, or padding bytes.
Enums
Enums are encoded with a variable-length unsigned-integer variant tag, plus the consituent fields in the case of an enum-struct. The tag adds an overhead of between 1 and 9 bytes (it will be a single byte for up to 16 variants). This encoding allows more enum variants to be added in a backwards-compatible manner, as long as variants are not removed and the variant order does not change.
Unsupported Data Types
Sequences and maps are unsupported at this time. Sequences and maps could probably be implemented with a single byte overhead per item, key, and value, but these types are not typically useful in keys.
Raw byte arrays are unsupported. The Rust Encoder
/Decoder
mechanism makes no distinction
between byte arrays and sequences, and thus the overhead for encoding a raw byte array would be
1 byte per input byte. The theoretical best-case overhead for serializing a raw (null
containing) byte array in order-preserving format is 1 bit per byte, or 9 bytes of output for
every 8 bytes of input.
Methods
impl<W> Encoder<W> where W: Write
fn new(writer: W) -> Encoder<W>
Creates a new ordered bytes encoder whose output will be written to the provided writer.
fn emit_var_u64(&mut self, val: u64) -> Result<()>
Encode a u64
into a variable number of bytes.
The variable-length encoding scheme uses between 1 and 9 bytes depending on the value.
Smaller magnitude (closer to 0) u64
s will encode to fewer bytes.
Encoding
The encoding uses the first 4 bits to store the number of trailing bytes, between 0 and 8. Subsequent bits are the input value in big-endian format with leading 0 bytes removed.
Encoded Size
range | size (bytes) |
---|---|
[0, 24) | 1 |
[24, 212) | 2 |
[212, 220) | 3 |
[220, 228) | 4 |
[228, 236) | 5 |
[236, 244) | 6 |
[244, 252) | 7 |
[252, 260) | 8 |
[260, 264) | 9 |
fn emit_var_i64(&mut self, v: i64) -> Result<()>
Encode an i64
into a variable number of bytes.
The variable-length encoding scheme uses between 1 and 9 bytes depending on the value.
Smaller magnitude (closer to 0) i64
s will encode to fewer bytes.
Encoding
The encoding uses the first bit to encode the sign: 0
for negative values and 1
for
positive values. The following 4 bits store the number of trailing bytes, between 0 and 8.
Subsequent bits are the absolute value of the input value in big-endian format with leading
0 bytes removed. If the original value was negative, than 1 is subtracted from the absolute
value before encoding. Finally, if the value is negative, all bits except the sign bit are
flipped (1s become 0s and 0s become 1s).
Encoded Size
negative range | positive range | size (bytes) |
---|---|---|
[-23, 0) | [0, 23) | 1 |
[-211, -23) | [23, 211) | 2 |
[-219, -211) | [211, 219) | 3 |
[-227, -219) | [219, 227) | 4 |
[-235, -227) | [227, 235) | 5 |
[-243, -235) | [235, 243) | 6 |
[-251, -243) | [243, 251) | 7 |
[-259, -251) | [251, 259) | 8 |
[-263, -259) | [259, 263) | 9 |