Commit f3dbaed7 authored by Morten Ofstad's avatar Morten Ofstad
Browse files

Merge branch feature/johan.seland/VDSSpecification with refs/heads/master into...

Merge branch feature/johan.seland/VDSSpecification with refs/heads/master into refs/merge-requests/536/train
parents 4f0298f1 2add46af
Pipeline #88347 passed with stages
in 28 minutes
......@@ -101,6 +101,7 @@ if (BUILD_DOCS)
add_custom_command(OUTPUT ${SPHINX_INDEX_FILE}
......@@ -108,7 +109,7 @@ if (BUILD_DOCS)
COMMAND ${CMAKE_COMMAND} -E copy_if_different ${CMAKE_CURRENT_SOURCE_DIR}/_static/css/custom.css ${CMAKE_CURRENT_BINARY_DIR}/_static/css/custom.css
${Python3_EXECUTABLE} -m sphinx -b html
${Python3_EXECUTABLE} -m sphinx -b html -j 4
......@@ -12,6 +12,7 @@ Welcome to OpenVDS's documentation!
.. _vds_cloud_format:
Cloud Storage Format
When VDS data are stored in a cloud object store, such as AWS S3 or Azure BLOB Store, each chunk is represented by a
separate object. A VDS will therefore consist of numerous separate objects, rather than a single file.
URL Encoding
All VDS objects will be accessed from a base URL, usually formed from a bucket URL and a GUID for the dataset. From this
URL you can get the VolumeDataLayout and LayerStatus JSON objects by appending ``/VolumeDataLayout`` and
``/LayerStatus``. The binary chunk data are available through appending the layer name (found in the LayerStatus object,
e.g. Dimensions_012LOD0) and chunk index to the base URL. The binary chunk metadata is available through the layer name
with ``/ChunkMetadata`` and the metadata page index appended.
\ No newline at end of file
.. _vds_disk_storage_format:
Disk Storage Format
This document describes the file format used to store VDS files on disk. The file format is a generic container that can
be used to store a variety of "objects" within a single file on disk. As such, it is also possible to store non-VDS data
using this file format. For example, the Bluware HueSpace engine uses this on-disk format for properties and shape
The format is chunk-based, and allows for sparse files. It is intended to be only a container format with minimal
knowledge of the semantics of the stored data.
High-level Requirements
- The container file can only be modified from a single process at a time.
Concurrent read-only access by multiple processes is supported.
- Atomic updates: a crash, full disk, etc. must never leave behind a corrupt file.
- It must be possible to group multiple updates into a transaction. Nested transactions are not supported.
- Fast direct reads/writes from/to any position within the file.
- Metadata stored for chunk must be extensible [e.g. VDSs need min/max values in the chunk].
- Individual chunks within the same file may be of different sizes.
- Support for file versioning and delta compression.
File Structure
The overall organization is a container that has a number of named files consisting of a number of chunks. For each file
there is an index table which contains the file offset and size of each chunk. Additional metadata can be stored per
chunk (e.g. hash values for de-duplication of data, min/max values of the data in a chunk). The index table is divided
into pages so it is not necessary to read or write the entire index when updating the file.
A valid VDS file starts with a 12-byte string which reads ``HueDataStore``.
.. code-block:: c++
struct DataStoreHeader
char magic[12]; /* "HueDataStore" */
int32_t version; /* (major << 16) + minor */
int64_t file_table_offset;
int32_t file_table_num_entries;
int32_t file_table_name_length;
File Table Entry
Each file table entry describes the number of chunks in the file, the layout of the chunk index table and where to find
the page table for the file. A UTF-8 encoded file name with the length described in the file header follows the
information about the file. The file table name length must be a multiple of 8 to ensure alignment of the file table
entries. The file names are assumed to be padded with NUL characters up to the file table name length, but need not be
NUL terminated.
.. code-block:: c++
struct FileTableEntry
int64_t head_page_directory_offset;
int32_t head_num_chunks;
int32_t head_revision_number;
int32_t index_page_num_entries;
int32_t file_type; // file type four-CC
// (four ASCII-characters)
int32_t chunk_metadata_length;
int16_t file_metadata_length;
char file_name[/* file_table_name_length */];
All file table entries have the same number of bytes reserved for the file name.
Index Entry
Each chunk in the file has an index entry associated with it. This ensures that basic information about the chunk can be
kept in memory. The chunk index makes it possible to do partial updates of files where each chunk may change size as a
result of compression.
.. code-block:: c++
struct IndexEntry
int64_t offset;
int32_t size;
int32_t reserved;
char metadata[/* chunk_metadata_length */];
Index Page
In order to support atomic updates of files, the index entries are grouped into pages. This makes atomic updates of the
index entries possible without rewriting the entire index. The number of index entries per page is found in the file
table entry field ``index_page_num_entries``. Because the index pages do not have a header, it is possible to write a
contiguous index and only decide how to break it up into pages when writing the page directory. If the number of chunks
in the file is not a multiple of the number of entries per page, the last index page must be zero-padded.
Page Directory
At the ``page_directory_offset`` we find a small header and an array of ``int64_t`` file offsets where each index page
starts. The number of index pages is the number of chunks divided by the number of indexes per page (rounded up to the
nearest integer). The last index page only has as many entries as there are remaining chunks in the file. If the offset
for a page is zero, that means the page has not been written to the file yet, this ensures that is reasonably efficient
to have a sparse file.
In addition to this, the page directory provides a basic versioning scheme by being able to link to previous versions of
the page directory. The revision number must be decreasing for previous versions, and cannot be negative.
.. code-block:: c++
struct PageDirectory
int64_t previous_page_directory_offset;
int32_t previous_num_chunks;
int32_t previous_revision_number;
char metadata[/* file_metadata_length */];
int64_t index_page_offsets[/* num_index_pages */];
.. _vds_specification_examples:
VolumeDataLayout JSON Object
Below is an example of a VolumeDataLayout JSON object.
.. literalinclude:: example_VolumeDataLayout.json
:language: JSON
LayerStatus JSON Object
Below is an example of a LayerStatus JSON object.
.. literalinclude:: example_LayerStatus.json
:language: JSON
\ No newline at end of file
.. _vds_storage_format:
Storage Format
A VDS dataset is defined by a set of `axes`, each having a name, unit and number of samples, that determines the
dimensionality (up to 6D) of the VDS, and a set of data `channels`, each having a name, unit and data format, that
determine which values are stored for each position in the VDS. A VDS also has a base brick size (see below) and a
setting for how many level-of-detail (LOD) downsampled versions of the data there are.
There is always at least one channel, the `primary channel`, and it always has the same dimensionality as the number of
axes of the VDS. The VDS might also have additional channels, these channels can optionally ignore the first axis of the
VDS and instead store a fixed number of values (most commonly a single value). Additional channels can also
specify that they do not have LODs as it is not all data types that are meaningful to downsample.
The data in a VDS is organized into `layers`, which is the data for a specific channel with a specific LOD and a
specific partitioning into `chunks`. Chunks are 3D bricks or 2D tiles that can be serialized using different compression
methods (including no compression) and are stored as individual objects in the cloud, or in a container file. The VDS
formats support multiple partitionings of the same data into chunks, e.g. the data can be stored as both 3D bricks and
2D tiles to allow for faster access to slices of the data at the cost of increased storage requirements.
.. figure:: figure_channels.png
An illustration of a VDS with two channels. The primary channel contains a 3D volume of seismic amplitudes. The
samples are stored as 32-bit floating point values, and the data are partitioned into chunks (thick gridlines).
There is also a 2D auxiliary channel which contains indicates if the trace is present or not.
The channels may have different data format and compression method.
The format also specifies how `metadata`, key-value pairs pertaining to the VDS as a whole, is stored. There is a set
of `known metadata` that applications using VDS for a specific purpose (e.g. to store seismic data) are expected to
follow, both required metadata and additional optional metadata that can be used to store information that allows re-creating the original
data (e.g. SEG-Y file or other proprietary formats) exactly.
The description of the partitioning of the data and all related metadata is encoded in the JSON format (The JSON Data
Interchange Format, 2017), thus it can easily be interpreted using a variety of programming languages and technologies.
Each chunk of data is serialized in one of several available binary serialization methods, all of which have open source
deserialization code available.
.. _figure_layers:
.. figure:: figure_layers.png
A 2D example of how bricks are laid out in a layer. In this example base brick size is 128 voxels, while both
PositiveMargin and NegativeMargin are 4 voxels
Each multi-dimensional array of data is called a `layer`, there will be one layer for each partitioning of each LOD of
each data channel in the dataset. The partitioning of a layer into 3D bricks or 2D tiles is done with respect to a
dimension group which defines which dimensions of the multi-dimensional array are the 3 dimensions of the bricks. For
example a 4D array can be partitioned into 3D bricks that are either including the 012 dimensions of the 4D array, or
the 013 dimensions or the 023 dimensions or the 123 dimensions.
The name of a layer is formed by appending channel name + dimension group + LOD, and for the primary channel of the
dataset the channel name is omitted from the layer name. An example layer name is ``Dimensions_012LOD0`` for the 012
dimension group of the primary channel at LOD 0. See :numref:`figure_layers` for an illustration of how a seismic
poststack dataset is organized.
Chunk `data formats` supported include 32- and 64-bit floating point values, 8- and 16-bit unsigned integers with a
scale and offset (which can be used to represent quantized floating-point values), 32- and 64-bit unsigned integers, and
1-bit Boolean values. Null/no-values are fully supported.
Chunks can be uncompressed or compressed with a range of compression options, including wavelet compression (lossy or
lossless), zipped, or run-length encoding. Constant value chunks are marked as such in the index of the dataset and do not need to be
stored explicitly, so sparse volumes are represented in an efficient way.
The `base brick size` of a 3D brick is always a power-of-two (64, 128, 256 etc.), and bricks always have the same size
in each dimension. 2D tiles have a size that is a multiple (usually 4 times) of the base brick size in order to reduce
the overhead of having a lot of separate objects.
To enable sparse datasets to be efficiently represented, as well as chunk compression methods that can use
adaptive/progressive compression (i.e. use a prefix of the serialized chunk data to produce a lower-quality version of
the chunk), we can have a small amount of extra binary data (typically 8-40 bytes for each chunk) that are available as
a 1D array divided into pages of a fixed number of chunk entries. The interpretation of this chunk-metadata is dictated by the
serialization method for the layer in question.
.. _vds_specification_introduction:
`Volume Data Store` (VDS) is a storage format for fast random access to multi-dimensional (up to 6D) volumetric data. The
VDS format has been developed by Bluware Inc. and has seen extensive industrial deployments over the last two decades.
The format is contributed by Bluware Inc. to the Open Group Open Subsurface Data Universe Forum (OSDU) (The Open Group,
An open-source reference software implementation which can read and write VDS has been contributed by Bluware Inc. to
the OSDU. This implementation is named `OpenVDS` and supports several programming languages. It can be included in
software products with no encumberments towards Bluware Inc.
VDS has been designed to handle extremely large volumes, up to petabytes in size, with variable sized compressed bricks.
The VDS format is very flexible and can store any kind data representable as arrays with key/value-pair metadata. In
particular, data commonly used in seismic processing can be stored along with all necessary metadata. This makes it
possible to go from legacy formats to VDS and back, while retaining all metadata.
VDS files may be used to represent E&P data types such as regularized single-Z horizons/height-maps (2D), seismic lines
(2D), pre-stack volumes (3D-5D), post-stack volumes (3D), geobody volumes (3D-5D), and attributes volumes of any
dimensionality up to 6D.
The format has been designed primarily to support random access and on-demand fetching of data, this enables
applications that are responsive and interactive as well as efficient I/O for high-performance computing or machine
learning workloads.
.. note::
While the VDS specification supports user-defined named metadata, :ref:`the Metadata chapter <vds_metadata>` describes
the set of predefined required and optional metadata, as defined and managed by the OSDU Data Definitions
subcommittee, for VDS data to be used in seismic applications. Additional named metadata types should be brought up
with the OSDU Data Definitions subcommittee for inclusion in the predefined list.
There are two different storage backends for VDSs:
1. Single-file Container
The single-file version of VDS contains all data (main data, auxiliary channels, metadata, indexes) in a
single-file and is suitable for storing and retrieval on a filesystem.
2. Cloud Object Store
The object store version is an adaptation of the single-file container for storage and retrieval on cloud-based object
stores. Utilizing a similar structure in the object store version allows applications a trivial transition from local
legacy type access to cloud-based object store access.
For the remainder of this document, a VDS dataset will refer to a single dataset organized as described by this
specification. The actual realization of the dataset will not be a single object when the data is organized in a
cloud-based object store.
- Metadata
The term `metadata` used in this document refers to what is commonly called `header data`` or `name-value pair data`
embedded within the OpenVDS format or any legacy format, e.g. nDI vt, SEG-Y etc. This is distinct and separate from
the Open Subsurface Data Universe (OSDU) work product, work product component and file `metadata`. Clearly, specific,
individual OSDU metadata values may be obtained from corresponding OpenVDS metadata values.
This diff is collapsed.
.. _vds_metadata:
The VDS object annotates each dimension with a name, a unit, starting and ending coordinate. For example, a seismic
dataset with a certain number of samples in the time domain will annotate the trace dimension (typically dimension 0)
with "Sample", "ms", start time, and end time.
Each value channel is annotated with a name, a unit and an estimated value range (e.g. to use a transfer function to
show the value as a color).
The VDS system does not deal directly with spatial coordinate systems, it only defines an N-dimensional array of voxels.
Spatial coordinates are added through key/value-pair metadata that define the transformation from annotation coordinates
to a coordinate reference system.
Although VDS is a general volumetric format, for applications to properly understand the contents of a VDS dataset, the
OpenVDS subcommittee has defined a set of predefined metadata categories, and properties within each category. All OSDU
ingestion/delivery tools will be required to adhere to the list of predefined OpenVDS metadata names and types, to
ensure compatibility across OSDU implementations/instances/applications.
In the URL encoding scheme described above, the metadata described in this section are part of the VolumeDataLayout JSON
Metadata Types
Axis descriptors
Each axis of a volume has a name, unit and annotation start/stop coordinates. The set of recognized axis descriptors
given in the name[unit] format. Where multiple options are accepted, the different options are separated by the ‘|’
character. The recognized axis descriptors are:
=============== ==================
Descriptor Unit
=============== ==================
Inline [unitless]
Crossline [unitless]
CDP [unitless]
Gather [unitless]
Trace [unitless]
Trace (offset) [m|ft|ussft]
Trace (angle) [deg|rad]
Trace (azimuth) [deg|rad]
Sample [ms|s|m|ft|ussft]
Shot [unitless]
Receiver [unitless]
Frequency [Hz]
Time [ms|s]
Depth [m|ft|ussft]
Velocity [m/s|ft/s|ussft/s]
Easting [m|ft|ussft]
Northing [m|ft|ussft]
=============== ==================
Key/Value pairs
“Key/Value pairs” are how you typically store single instance values, or simple arrays of data which is true for the
whole dataset and all of the channels. Examples include Survey Name, Survey Coordinate Systems, etc. In OpenVDS
key/value pairs are also associated with a category which gives a structure to key/value pairs that relate to each
These key/value pairs can be of the following types:
============= =========================================================
Type Description
============= =========================================================
Int An integer type
IntVector2 A 2-component integer vector type
IntVector3 A 3-component integer vector type
IntVector4 A 4-component integer vector type
Float A floating point type
FloatVector2 A 2-component floating point vector type
FloatVector3 A 3-component floating point vector type
FloatVector4 A 4-component floating point vector type
Double A double precision floating point type
DoubleVector2 A 2-component double precision vector type
DoubleVector3 A 3-component double precision vector type
DoubleVector4 A 4-component double precision vector type
String A string type (UTF-8)
BLOB A base-64 encoded binary large object type
============= =========================================================
Channel Descriptor
Named channels are useful for storing additional information for individual voxel locations and/or individual trace
locations. Examples include Angles and/or Offsets, Trace Header, Trace number, Trace Coordinate, Mute, etc. Each named
channel is defined by the name, unit and one of two mapping types:
1. Volume Data
This means that said named channel has the same dimensionality as the primary channel, so each value in the primary channel has a corresponding value in the named channel.
This is useful for things like dip/azimuth for the corresponding seismic voxel.
2. Per Trace
This means that said named channel has 1 dimension less than the primary channel, thus a set of values (each entry
can be an array) is valid for a whole trace in the primary channel. This is useful for trace headers in SEGY, trace
mute, etc.
Recognized Volume Types
The set of axis descriptors defines the volume type. In the following table the axis descriptors are listed as
Name[unit] dimension 0/…/Name[unit] dimension N, where dimension 0 is the fastest running indices (i.e. consecutive
values in memory). Where multiple options are accepted, the different options are separated by the ‘|’ character. The
recognized volume types are:
============ =====================================================================================================
Volume Type Axis Descriptors
============ =====================================================================================================
3D Poststack Sample|Time|Depth/Crossline/Inline
3D Prestack Sample|Time|Depth/Trace(offset)|Velocity|Frequency/Crossline/Inline
2D Poststack Sample|Time|Depth/Gather|CDP|Shot|Receiver
2D Prestack Sample|Time|Depth/Trace(offset)|Velocity|Frequency/Gather|CDP|Shot|Receiver
3D Poststack Horizon: Crossline/Inline, primary data channel is Sample|Time|Depth
3D Prestack Horizon: Trace(offset)|Velocity|Frequency/Crossline/Inline, primary data channel is Sample|Time|Depth
============ =====================================================================================================
Recognized Key/Value Pairs
The following is the recognized key/value categories and properties:
Category - ``SurveyCoordinateSystem``
This category of OpenVDS metadata contains two families that provide information to position a dataset in a coordinate
reference system.
In the absence of any of these families, a default setting is considered. In the following, these metadata families are
1. The Inline/Crossline system
In this system a DoubleVector2 defines the origin, and two more DoubleVector2 define the inline and crossline spacing.
These are applied to transform the OpenVDS dimensions with name Inline/Crossline to Easting/Northing coordinates. The
Z coordinate is defined by the axis descriptor for the first dimension of the volume.
2. The 3D IJK System
The Inline/Crossline system has flexibility for only two dimensions. In order to have more freedom, the 3DIJK metadata
is defined. A DoubleVector3 is used to represent the origin and three step vectors that corresponded to the dimensions
named "I", "J" and "K" respectively.
3. The Default System
Other dimension names that are recognized and transformed to XYZ coordinates are X, Y and Z, which will be mapped
directly to the corresponding XYZ coordinate.
.. table:: Keys in the SurveyCoordinateSystem category
================ ============= ==================================================================================================
Key Type Description
================ ============= ==================================================================================================
CrosslineSpacing DoubleVector2 The XY spacing between units in the Crossline annotation dimension.
CRSWkt String The appropriate OpenGIS Well Known Text description of the coordinate system used.
InlineSegments BLOB An array of IntVector3 defining the inline, and crossline start and end for each inline segment.
InlineSpacing DoubleVector2 The XY spacing between units in the Inline annotation dimension.
IStepVector DoubleVector3 The step vector corresponded to dimension named 'I'.
JStepVector DoubleVector3 The step vector corresponded to dimension named 'J'.
KStepVector DoubleVector3 The step vector corresponded to dimension named 'K'.
LatticeScale Int Scaling factor from SEG-Y import used on X/Y coordinates.
Origin DoubleVector2 The XY position of the origin of the annotation (Inline/Crossline/Time) coordinate system.
Origin3D DoubleVector3 The XYZ position of the origin of the annotation (I/J/K) coordinate system.
Unit String String of the Lattice unit used, typically meter, decimeter, centimeter, kilometer, foot, or mile.
================ ============= ==================================================================================================
Category - ``SEGY``
The metadata in this category is only meant for round-tripping original SEG-Y data, and not for application parsing. The
SEG-Y specific metadata allow for exporting a SEG-Y identical to the original SEG-Y. Note that bitwise identity cannot
be achieved with Wavelet compressed files unless the WaveletLossless method is applied — a certain signal-loss
corresponding to the compression tolerance is to be expected.
.. table:: Keys in the SEGY category
================ ============= ==========================================================================================================================================================================================================================================
Key Type Description
================ ============= ==========================================================================================================================================================================================================================================
BinaryHeader BLOB The original SEG-Y binary header (400 bytes).
TextHeader BLOB The original SEG-Y textual header (3200 bytes x binary header record count)
PrimaryKey String The primary sort key of the original SEG-Y. For crossline-sorted files, the PrimaryKey will be ‘Crossline’ to indicate that this was the original order of the file even if the VDS has been normalized to a Time/Crossline/Inline volume.
================ ============= ==========================================================================================================================================================================================================================================
Category – ``TraceCoordinates``
A seismic line may populate the PositionProperty, VerticalOffsetProperty, EnergySourcePointNumberProperty and
EnsembleNumberProperty from metadata BLOBs found in the "TraceCoordinates" category.
The PositionProperty and VerticalOffsetProperty define the position of a seismic line.
.. table:: Keys in the TraceCoordinates category
======================== ============= ==========================================================================================================================================
Key Type Description
======================== ============= ==========================================================================================================================================
TracePositions BLOB An array of DoubleVector2 defining the position for each trace, where (0, 0) is treated as an undefined position.
TraceVerticalOffsets BLOB An array of doubles defining the offset for each trace from the vertical start position in the Time/Depth/Sample dimension of the OpenVDS.
EnergySourcePointNumbers BLOB An array of scalar int32 values defining the energy source point number for each trace.
EnsembleNumbers BLOB An array of scalar int32 values defining the ensemble number for each trace.
======================== ============= ==========================================================================================================================================
Category - ``ImportInformation``
This category of VDS metadata contains information about the initial import to VDS. That is, information about the
original file (file name, last modification time etc.) and when/how it was imported. The intended use is e.g., to give a
default file name for an export operation or to inform the user about whether the VDS was imported from some particular
.. table:: Keys in the ImportInformation category
======================== ============= =====================================================================================================================================================================================
Key Type Description
======================== ============= =====================================================================================================================================================================================
DisplayName String An informative name (e.g. the survey name) that can be displayed to a user but is not necessarily a valid file name.
InputFileName String The original input file name. In cases where the input is not a simple file this should still be a valid file name that can be used as the default for a subsequent export operation.
InputFileSize Double The total size (in bytes) of the input file(s), which is an integer stored as a double because there is no 64-bit integer metadata type.
InputTimeStamp String The last modified time of the input in ISO8601 format.
ImportTimeStamp String The time in ISO8601 format when the data was imported to VDS.
======================== ============= =====================================================================================================================================================================================
Named Channels
The following are known volume data channels:
================== =====================================================
Name Description
================== =====================================================
Amplitude Amplitude values are stored as float values.
Semblance Semblance values are stored as float values.
Frequency Frequency values are stored as float values.
Vrms/Vint/Vavg Velocity values are stored as float values.
Intercept/Gradient Intercept/Gradient values are stored as float values.
================== =====================================================
The following are known Per Trace channels:
================== ================================================================================================
Name Description
================== ================================================================================================
Mute Mute values are stored as 2-component 16-bits values, representing mute start time and end time.
Offset Offset values are stored as float values
Trace Trace is a bool-mask if trace was present or not during conversion
Azimuth Azimuth values are stored as float values
Angle Angle values are stored as float values
SEGYTraceHeader SEG-Y Trace headers are stored as 240x byte values
================== ================================================================================================
\ No newline at end of file
# Configuration file for building the PDF version of the VDS specification.
# It requires the rhinotype pip package (pip install rhinotype)
# The PDF can be built by executing the following in this directory:
# $ python -m sphinx -b rinoh . <outdir>
# -- Path setup --------------------------------------------------------------
# If extensions (or modules to document with autodoc) are in another directory,
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.
import os
import sys
docroot = os.path.abspath('.')
import rinoh.frontend.sphinx
# -- Project information -----------------------------------------------------