Sparse-table data is stored in its own data file "st_data.1". The data stored therein consists of the superblock followed by an arbitrary number of packets. Each packet has a byte address and a size that both remain fixed during the lifetime of the packet. In addition, packets have an address field which is used for the packing algorithm. Packets may be referring to other packets (children), and packets may be referred to by other packets (parent). These relations among packets may be subject to change during a packet's lifetime.
Packets are structured data records consisting of a header and several fields. Fields are arrays of primitive data type supported by MemCom (see Description of MemCom data types), such as NUL-terminated character strings or 8/16/32/64-bit unsigned/signed integer numbers. The header and all fields are stored in little-endian format in the data file. On computers having big-endian architecture, they must be converted accordingly when reading from or writing to the data file.
The packet header consists of two unsigned 64-bit integer numbers, stored in little-endian format. The first number contains the packet type and size. The packet type is stored in the most significant 8 bits, while the remaining 56 bits contain the total size of the packet (including the header) in bytes. The second number is normally 0 and is reserved for use by the packing algorithm.
+----------------------+ |type|packet_size | (8 bytes) +----------------------+ |reserved | (8 bytes) +----------------------+
Following this header comes the packet's internal data whose structure depends on the packet's type.
The first data packet in the data file does not begin at byte address 0, but after the superblock. This ensures that the byte address of any valid packet other than the superblock is positive.
The sparse-table data file is thus structured as follows:
+----------------------+ |superblock | (64 bytes) +----------------------+ |packet | (16 + x bytes) | | +----------------------+ |packet | (16 + x bytes) +----------------------+ |... | | | +----------------------+
The superblock is 64 bytes long (of which the first 16 bytes constitute the packet header) and consists of the following fields:
+----------------------+ |header | (16 bytes) +----------------------+ |format_header | (8 bytes) +----------------------+ |format_version | (8 bytes) +----------------------+ |total_size | (8 bytes) +----------------------+ |empty | (24 bytes) +----------------------+
The fields of the superblock are described below:
header | The packet header. The packet type has a value of 1. |
superblock_size | Size in bytes of the superblock (7 bytes). This is an unsigned 64-bit integer number having a value of at least 32. This can be set to a larger value to accommodate for additional fields in the superblock. |
reserved | For use by the packing algorithm. |
format_header | This is a NUL-terminated and NUL-padded character string of 8 bytes length, used to identify the file format. Its contents are "ST_DATA". |
format_version | Version number of the format. This is an 8-byte NUL-terminated and NUL-padded character field. The current version is identified with "2". |
total_size | Total size in bytes of all packets. This is an unsigned
64-bit integer number. New packets can be allocated at address
total_size . Whenever the total size
changes, the superblock needs to be re-written. |
empty | Not used. |
A sparse-table dataset having nrow
rows and
ncol
columns consists of the root packet, the
colindex packet, the rowindex packet, and up to
nrow
packets containing row data, the so-called row
packets. The root packet is referred to by the spos
entry of dbsetsdb
structure in the index file. The
colindex and rowindex packets are referred to by the root packet. The
row packets are referred to by the rowindex packet.
The root packet is structured as follows:
+----------------------+ |header | (16 bytes) +----------------------+ |colindex_address | (8 bytes) +----------------------+ |rowindex_address | (8 bytes) +----------------------+ |empty | (32 bytes) +----------------------+
The fields of the dataset root packet are described below:
header | The packet header. The packet type has a value of 2. |
colindex_address | An unsigned 64-bit integer number containing the byte address of the colindex packet. |
rowindex_address | An unsigned 64-bit integer number containing the byte address of the rowindex packet. |
empty | Not used. |
The root packet's size does not change during the lifetime of the sparse-table dataset.
The colindex packet is structured as follows: We assume a very small sparse-table dataset having 3 columns:
+----------------------+ |header | (16 bytes) +----------------------+ |empty | (8 bytes) +----------------------+ |num | (8 bytes) +----------------------+ |column_name | (64 bytes)+ +----------------------+ | |column_type | (4 bytes) +-- 76 bytes +----------------------+ | |column_num_elements | (8 bytes) + +----------------------+ |column_name | (64 bytes)+ +----------------------+ | |column_type | (4 bytes) +-- 76 bytes +----------------------+ | |column_num_elements | (8 bytes) + +----------------------+ |column_name | (64 bytes)+ +----------------------+ | |column_type | (4 bytes) +-- 76 bytes +----------------------+ | |column_num_elements | (8 bytes) + +----------------------+ |empty | | | | | +----------------------+
The fields of the dataset root packet are described below:
header | The packet header. The packet type has a value of 3. |
empty | Not used. |
num | An unsigned 64-bit integer number containing the number of columns. |
column_name | Name of the column. NUL-terminated string of 64 bytes. The remaining bytes are padded with NUL characters. Empty columns have this field completely padded with NUL characters. |
column_type | The element data type of the column. NUL-terminated string of 4 bytes. The remaining bytes are padded with NUL characters. Empty columns have this field completely padded with NUL characters. Valid choices for non-empty columns are "I", "J", "E", "F", "C", "Z", and "K". |
column_num_elements | The number of elements for the column (8 bytes). This is an unsigned 64-bit integer number. A positive value indicates a fixed-sized column. A value of 0 indicates an empty column. A value of -1 indicates a variable-sized column. |
empty | An arbitrary non-negative number of bytes. The colindex packet may be larger than necessary. |
Thus the minimum packet size for the colindex packet is
32 + 76 * ncol
The colindex packet's size may change during the lifetime of the sparse-table dataset. When new columns are appended, the packet's size may be too small and it may be necessary to reallocate the packet.
The rowindex packet is structured as follows: We assume a very small sparse-table dataset having 2 rows:
+----------------------+ |header | (16 bytes) +----------------------+ |empty | (8 bytes) +----------------------+ |num | (8 bytes) + +----------------------+ |row_address | (8 bytes) + +----------------------+ +-- 8 bytes * nrow |row_address | (8 bytes) + +----------------------+ |empty | | | | | +----------------------+
The fields of the dataset root packet are described below:
header | The packet header. The packet type has a value of 4. |
empty | Not used. |
num | An unsigned 64-bit integer number containing the number of rows. |
row_address | The byte start adress of the row packet. This is an unsigned 64-bit integer number. A value of 0 means there is no row packet (default). A positive value indicates a valid packet address. |
empty | An arbitrary non-negative number of bytes. The rowindex packet may be larger than necessary. |
Thus the minimum packet size for the rowindex packet is
32 + 8 * nrow
The rowindex packet's size may change during the lifetime of the sparse-table dataset. When new rows are appended, the packet's size may be too small and it may be necessary to reallocate the packet.
To read or write row packets, it is necessary to have access to
the contents of the colindex and the rowindex packets. A row packet is
structured as follows: We assume a row packet with 3 columns
(num
= 3), of which two columns are of variable
size.
+----------------------+ |header | (16 bytes) +----------------------+ |num | (2 bytes) +----------------------+ |column_index | (2 bytes) + +----------------------+ | |column_index | (2 bytes) +-- 2 bytes * num +----------------------+ | |column_index | (2 bytes) + +----------------------+ |cell_num_elements | (4 bytes) + +----------------------+ +-- Only for variable-sized cells |cell_num_elements | (4 bytes) + +----------------------+ |cell_data | (mcSizeof(column_type) * cell_num_elements bytes | | | | or | | | | mcSizeof(column_type) * column_num_elements bytes) +----------------------+ |cell_data | | | +----------------------+ |cell_data | | | | | +----------------------+ |empty | | | | | +----------------------+
The fields of the row packet are described below:
header | The packet header. The packet type has a value of 5. |
num | An unsigned 16-bit integer number containing the number of non-empty cells for this row. Valid values for this field range from 0 up to and including the current number of columns of the dataset. |
column_index | An unsigned 16-bit integer number containing the column index (starting from 0) for each cell. |
cell_num_elements | For variable-sized cells, the number of data elements (4 bytes). This is an unsigned 32-bit integer number. For fixed-sized cells, this field does not occur. |
cell_data | Region containing the cell data. The size of the cell
data in bytes is computed according to the formula mcSizeof
(column_type) * cell_num_elements for variable-sized
columns, for fixed-size columns it is mcSizeof
(column_type) *
column_num_elements . |
empty | An arbitrary non-negative number of bytes. The row packet may be larger than necessary (for instance, if cells were cleared but the packet address of the row packet was kept). |
The required size in bytes of a row packet cannot be computed by a direct formula. It is computed as follows (Python pseudo-code):
size = 16 + 2 j = 0 for i in range(num): size += 2 t = column_type[column_index[i]] s = mcSizeof(t) c = column_num_elements[column_index[i]] if c > 0: size += s * c elif c == -1: size += 4 + s * cell_num_elements[j] j += 1
The structure of the row packet limits the number of columns to 2^16, and, for variable-sized columns, the number of elements per cell to 2^32.
During the lifetime of the sparse-table dataset, the location
(byte address) of a row packet may change. This happens when the size
of the row packet is no longer sufficient to hold all data of the row
(because new cells have been added or variable-sized cells have been
assigned a larger size). A new row packet with sufficient size must
then be allocated and referred to by the
row_address
field in the rowindex packet.
The total size of a row packet may be greater than the size actually used by the packet fields. This may be the case for instance when previously non-empty cells are cleared, or when variable-sized cells are assigned a smaller size. In these situations it might be more efficient to keep the same packet rather than allocating a new, smaller packet. The sparse-table data format does not enforce any such decision however, this is left to the implementation.