Commit e15343e1 authored by Paal Kvamme's avatar Paal Kvamme
Browse files

Merge branch 'kvamme62/uuids' into 'master'

Expose the ZGY files's uuid.

See merge request !51
parents fc1e5de7 6c61d952
Pipeline #30551 passed with stages
in 6 minutes and 39 seconds
......@@ -12,13 +12,13 @@
</head>
<body bgcolor="#ffffff">
<p class="version">This is version 0.5 of the document, last updated 2020-12-20.</p>
<p class="version">This is version 0.6 of the document, last updated 2021-03-06.</p>
<!-- <h1 style="color: red">DRAFT DOCUMENT</h1> -->
<h1>Copyright</h1>
<pre>
Copyright 2017-2020, Schlumberger
Copyright 2017-2021, Schlumberger
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
......@@ -309,6 +309,16 @@ limitations under the License.
level of bricking is used. The brick size is explicitly specified.
But currently only bricks of 64*64*64 samples will work.
</p>
<p>
The uuids in a ZGY file on disk are stored not big-endian as
<a href="https://tools.ietf.org/html/rfc4122">RFC 4122</a>
requires. Instead they are stored piecewise little endian.
This affects how the raw bytes are converted to a canonical
string. To avoid confusion an access library should prevent
application code from seeing the raw bytes. Only the canonical
string representation should be accessible. See doc/uuid.md
for more details.
</p>
<table border="1" style="border-collapse: collapse">
<tr>
......
# Universally Unique Identifiers
uuids and guids (the names are usually used interchangeably) are
essentially just 128 random bits but are notoriously difficult to deal
with. I need a function to generate a new uuid that can be stored on a
ZGY file as 16 consecutive bytes. I also need a function to convert
those 16 consecutive bytes to a human readable string. That can't be
too difficult, right?
See [RFC 4122](https://tools.ietf.org/html/rfc4122)
and [Wikipedia](https://en.wikipedia.org/wiki/Universally_unique_identifier).
## Non standard representation in ZGY files
The uuids in a ZGY file on disk are stored not big-endian as
[RFC 4122](https://tools.ietf.org/html/rfc4122)
requires. Instead they are stored piecewise little endian.
This affects how the raw bytes are converted to a canonical string.
In more detail: uuids can be stored as:
1. A canonical 36-character string xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx.
2. An array uint8[16] corresponding 1:1 to the canonical string.
3. A struct {uint32,uint16[2],uint8[8]} of size 16 bytes.
Since the representation in (3) contains multi-byte integers the
layout will differ between little-endian and big-endian machines.
[RFC 4122](https://tools.ietf.org/html/rfc4122)
says the canonical representation is big-endian so (2) and (3) will be
byte-by-byte identical only on a big-endian architecture. ZGY stores
uuids as little-endian instead.
This cannot easily be changed because that would break code that reads
an uuid from a ZGY file and converts it to a string. Existing ZGY
files would then appear to have suddenly changed their uuids.
Note that code that tries to "convert" between (2) and (3) by casting
is wrong in any case. It will only work on one (little-endian or
big-endian) architecture.
To avoid confusion in ZGY an access library should prevent application
code from seeing the raw bytes. Only the canonical string
representation should be accessible. This isn't quite the case for the
old ZGY-Public accessor but the issue is manageable,
* ZGY-Public hides the uuids completely from the application code
* OpenZGY/C++ will hide the raw bytes in the API and only expose the
stringized version that is computed in native/src/impl/guid.cpp.
* OpenZGY/Python will try to do the same, or possibly just hide the
uuids completely. In Python the data hiding is implemented just as a
naming convention so it isn't quite as well protected as in
OpenZGY/C++.
* ZGY-Internal, used only by Petrel via Salmon, exposes raw bytes but
they are quickly converted to strings by
Salmon/Shared/PublicTypes/UniqueId.cpp.
To my knowledge this is the only place in Salmon that converts.
This one is more difficult to verify since the raw bytes are in fact
allowed to escape from the ZGY-Internal API. Note that the code in
UniqueId.cpp is technically incorrect and will give wrong results if
compiled on a big-endian machine. This is not a problem for Petrel.
But please don't copy/paste that code anywhere.
# API for dealing with uuids
On Linux there are two libraries, libuuid.so and libossp-uuid.so, that
can convert from raw 16 bytes to a string and on Windows there is at
least one. These algorithms **DO NOT MATCH**. This is where the
problems start. I can roll my own library that try to mimic either of
those algorithms but there is a risk that I end up implementing yet
another incompatible version.
I believe cause of the problem is this:
An uuid stored in memory is traditionally laid out as
uint32,unit16[2],uint8[8]. On disk it just looks like 16 consecutive
bytes. Files on disk clearly need to be interoperable between
little-endian and big-endian machines. It is less obvious that the
long and the two shorts need to be accessible as such, but this has to
do with uuids based on a timestamp instead of random bits. The
standard says to use big-endian for interop. So, on the x86
architecture there is some non-obvious byte swapping going on to
convert between a uuid struct and a 16-byte array that can be written
to disk.
If the APIs you are working on expect uuids to be passed as uint8[16]
buffers that are already in big endian order then no byte swapping is
needed.
If you are on a little endian machine and are using an api that
expects or returns a struct uuid and you simply cast that to an
uint8[16] then you are doing something very wrong. This is what Petrel
(or actually Salmon) is doing today in UniqueId.cpp.
So the root cause is that the api on windows expect a struct gui while
those on linux expect an uint8[16].
To complicate the issue, the
[wikipedia page](https://en.wikipedia.org/wiki/Universally_unique_identifier)
very clearly explains that the older type of guids created by
microsoft, tagged as variant 2, have the opposite byte ordering.
Testing suggests that this is incorrect.
[RFC 4122](https://tools.ietf.org/html/rfc4122)
is less clear on that subject. I choose to trust my testing and assume
wikipedia is wrong or at least what they say doesn't apply to that
particular API that I am using.
## UUID types
The simple way of generating a new uuid is to use a good random number
generator to create ~128 bits of random data. The assumption being
that the random number generator being used has enough entropy to
virtually guarantee there will never be a collision. In spite of the
birthday paradox.
Unfortunately, sufficient entropy **CANNOT BE GUARANTEED**. The
quality depends on the implementation of the random generator used.
Which might be OS dependent. It can even depend on hardware. The
alternative is to use a much more complex algorithm using time,
ethernet MAC address, and other information. This has its own set of
problems. Such as identifying the machine on which the uuid was
created (privacy issue) and what to do on a machine where there is no
ethernet address or (for virtual machines) where the ethernet MAC
address is not unique.
## OpenZGY choices
Both the generate and the format algorithms should be implemented
locally in the OpenZGY code. Because interoperability between Linux
and Windows is the most important criteria.
## Generate
When generating a new uuid I trust that the random number generator my
code uses has sufficient entropy. It helps that the user typically
doesn't deal with very many ZGY files. They might be numbered in the
thousands but hardly in the millions. RISK: Bad number generators and
the birthday paradox. MITIGATION: Be as specific as possible when
choosing the random generator to use. MITIGATION: Share the random
number generator between threads.
I will generate uuids tagged as variant 1 (DCE) and version 4
(random). This also appears to be the default on windows when calling
UuidCreate().
This also mitigates the uncertainty about byteswapping of variant 2 uuids.
## Format
For backwards compatibility try to make the storage to string
conversion match what is done in Salmon today. This means treating the
uuid stored on file as piecewise little-endian in violation of RFC
4122. The reason is that only the ZGY-Internal API exposes the uuids
from the ZGY file, and only Petrel uses ZGY-Internal. So only Petrel /
Salmon will have done an explicit raw to string conversion of uuids.
And Petrel, last time I checked, only runs on Windows. ZGY-Public
which is built on top if ZGY-Internal hides the uuids on read. RISK: I
might not manage to replicate the Windows algorithm correctly. RISK:
There might be a mismatch between OpenZGY/Python and OpenZGY/C++.
OpenZGY should only expose string uuids in its api. This makes the
conversion from bytes stored on file to string an implementation
detail. Just like the choice of using mostly little-endian storage for
integers. RISK: This in itself won't help with files created by the
old ZGY-Public or ZGY-Internal libraries. Hence the previous bullet.
Mitigation: Identify the place or places in Salmon where the
conversion is done. It looks like this is just one place: class
UniqueId in Salmon/Shared/PublicTypes/UniqueId.cpp is a candidate.
## Deja vu
It is not surprising that this issue has shown up before. See the mail
thread with subject gv34oban_218install.bin from 2009. The solution
being discussed was to fix it on the seismic server side. RISK: It is
possible that a kludge was put into Salmon or Petrel instead. The last
entry in that mail thread indicates that the plan was to fix it inside
the seismic server. But, that mail was sent while the Trondheim office
was being shut down and several of the developers were leaving the
company.
Mitigation: If we are confident that
Salmon/Shared/PublicTypes/UniqueId.cpp is the only place where
conversion is done then whatever happened with that issue is
irrelevant.
## TL;DR
All of these steps taken together should end up with a very minor residual risk.
## APIs and packaging on Linux
The following is only relevant if we do not use a private
implementation for guids. So this is about understanding the old ZGY
accessor. It is probably a bad idea to change anything there. Except
possibly moving away from libossp-uuid in CentOS or at least linking
it statically.
### APIs on Linux
On Linux systems there are two different uuid libraries to choose
from. Note that the naming convention is really really bad making it
easy to confuse the two.
The situation on Debian and Ubuntu is as follows:
- libuuid1, uuid-dev, uuid-runtime:
- installs libuuid.so, /usr/bin/uuidgen, <uuid/uuid.h>, etc.
- libossp-uuid16, libossp-uuid-dev, uuid
- installs libossp-uuid.so.16.*, /usr/bin/uuid, <ossp/uuid.h>, etc.
- Note that the "uuid" command line tool is actually an "ossp-uid".
- Note that the "uuid-dev" is NOT the dev package (headers) for "uuid".
- Note that in CentOS the OSSP headers go in <uuid.h> not <ossp/uuid.h>
The situation on CentOS and Fedora is as follows:
- libuuid, libuuid-devel, util-linux:
- installs libuuid.so, /usr/bin/uuidgen, <uuid/uuid.h>, etc.
- uuid, uuid-devel
- installs libossp-uuid.so.16.*, /usr/bin/uuid, <uuid.h>, etc.
- On CentOS 8 you need to enable powertools to get uuid-devel.
It looks like libuuid1 is always installed by default on all distros.
In the Salmon build, which uuid library is used depends partly on which
packages have been installed in the build environment and partly on how
the tests in two Makefiles work. Thanks to Docker the behavior is now
deterministic but it takes some time to understand what is going on.
On Debian and Ubuntu: Salmon and ZGY chooses to use libuuid1.
Installing the OSSP version in the build environment will not change
this because the package ends up in a different place than where the
Makefile looks for it.
On CentOS and Fedora: Salmon and ZGY chooses to use the ossp version.
Removing that package will automatically switch to use libuuid1.
Nowadays when all official builds use Docker the ossp libraries will
always be used for CentOS and Fedora. Prior to that it was a bit
arbitrary which library was used since it depended on the build
machine. Ouch.
All platforms link dynamically with the chosen library, libuuid.so.1
or libossp-uuid.so.16
With 20/20 hindsight Salmon ought to have used libuuid1
unconditionally. The reason it doesn't is probably the confusing
naming convention. Making it less than obvious that libuuid1 would
always be available at runtime. And the only tricky bit is to figure
out how to install the needed headers.
The situation today is that users of ZGY-Public on CentOS / RedHat
will need to install the "uuid" package. If we really need to fix this
then either switch to using libuuid1 for all distros or switch to
linking libossp-uuid statically. Probably not worth the trouble.
### APIs on Windows.
There is a method UuidToStringA() in <rpc.h> and rpcrt4.lib. Beware
that the Windows API expects struct UUID as input while the Linux apis
deal with 16-byte arrays. There is also class UUID in .NET but that is
not relevant for OpenZGY.
......@@ -27,6 +27,7 @@
#include "impl/histogramdata.h"
#include "impl/genlod.h"
#include "impl/compression.h"
#include "impl/guid.h"
#ifdef _MSC_VER
#pragma warning(push)
......@@ -278,6 +279,23 @@ public:
return _meta->ih().nlods();
}
// Currently not needed by any client.
//virtual std::string dataid() const override
//{
// return InternalZGY::GUID(_meta->ih().dataid()).toString();
//}
virtual std::string verid() const override
{
return InternalZGY::GUID(_meta->ih().verid()).toString();
}
// Currently not needed by any client.
//virtual std::string previd() const override
//{
// return InternalZGY::GUID(_meta->ih().previd()).toString();
//}
virtual void
dump(std::ostream& os) const override
{
......
......@@ -898,6 +898,13 @@ public:
virtual size3i_t bricksize() const = 0; /**< \brief Size of one brick. Almost always (64,64,64), change at your own peril. */
virtual std::vector<size3i_t> brickcount() const = 0; /**< \brief Number of bricks at each resolution (LOD) level. */
virtual int32_t nlods() const = 0; /**< \brief Number of resolution (LOD) levels. */
// Only expose the guids we think the application will need.
// Note that OpenZGY doesn't really support updates,
// so dataid() and previd() are not very useful.
//virtual std::string dataid() const = 0; /**< GUID set on file creation. */
virtual std::string verid() const = 0; /**< GUID set each time the file is changed. */
//virtual std::string previd() const = 0; /**< GUID before last change. */
// The Python version has meta() as a dict holding all the meta data,
// this isn't really useful in C++ and just makes it harder to see which
// metadata is being used. [set_]numthreads is N/A in this accessor.
......
// Copyright 2017-2020, Schlumberger
// Copyright 2017-2021, Schlumberger
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
......@@ -58,14 +58,20 @@ GUID::copyTo(std::uint8_t *ptr, std::int64_t len)
* Generate a random GUID/UUID, according to the instructions at
* https://en.wikipedia.org/wiki/Universally_unique_identifier.
*
* I hope I got it right. But consequences of failure aren't severe.
* The result is returned as an array of bytes but *not* in
* big-endian order as specified by RFC 4122. Instead use the
* piecewise little-endian format that would allow casting the
* buffer to a struct UUID on a little-endian nachine.
*
* See doc/uuid.md.
*
* Implementing general support for all GUID versions and variants is
* ridiculously complicated. So, go the easy route and generate a
* random one. Then the only trick is to tell any reader that the guid
* is stored big-endian (variant 1) and is in fact a random number
* (version 4).
*
* TODO-Worry: Is the granularity of the random seed good enough?
* TODO-Worry: Is the entropy of the random seed good enough?
*/
GUID::guid_bytes_t
GUID::generate()
......@@ -82,24 +88,77 @@ GUID::generate()
int one = distribution(generator);
result[ii] = static_cast<std::uint8_t>(one);
}
result[8] = (result[8] & 0x3f) | 0x80; // variant 1 (big-endian)
result[6] = (result[6] & 0x0f) | 0x40; // version 4 (random) in msb nibble.
// The variant is encoded into byte 8 i.e. the non-byteswapped part.
// The version is encoded into the highest nibble in the int16 field
// in bytes 6 and 7. This is where I need to take the odd
// byteswapping into account. The highest nibble is now in
// the 7th byte, not the 6th.
result[8] = (result[8] & 0x3f) | 0x80; // variant 1 (DCE)
result[7] = (result[7] & 0x0f) | 0x40; // version 4 (random)
return result;
}
/**
* Convert a guid to a user readable string.
* Only version 1 (big-endian) guids are supported; version 2 will be
* displayed but won't match the string the standard requires.
* \brief Convert a ZGY stored guid to a user readable string.
*
* \details
* The expected input is a 16-byte array that matches what is stored
* in the ZGY file i.e. a little-endian struct GUID. This is
* not standard. RFC 4122 states that big-endian is expected.
* This method should only be used in code that reads ZGY files.
*
* If you copy the code to use it in a different context you need
* to remove the explicit byteswapping. I still cannot guarantee
* that the logic will then match the standard apis in Linux and
* Windows, or even whether those two are even compatible by each
* other. But I think all three will match.
* See doc/uuid.md for more details.
*
* Note that I have not defined a specific type for the guid;
* it is just a std::array. So I cannot overload operator<< to
* automatically invoke GUID::format().
*
* The contents of byte 8 high nibble i.e. (result[8]>>4):
* - 0..7 -- NCS, a very old format.
* - 8,9,a,b -- variant 1, RFC 4122/DCE 1.1.
* - c,d -- variant 2, old Microsoft.
* - e,f -- reserved.
*
* Most systems today use variant 1.
* Variant 1 should be converted from a struct UUID to big-endian if a
* binary format is desired. Variant 2 should according to the wiki page
* be byteswapped differently but that information appears to be incorrect
* or for a different context. Note that byte 8 is in the not byteswapped
* part. Byte 6, indicating the version, is in the byteswapped part and
* will be found in byte 7 if the guid struct was not converted to big-endian.
* The remaining variants are according to the RFC not expected to be
* interoperable between systems. So the binary to sting conversion
* should not matter. Treating them as variant 1 should be good enough.
*/
std::string
GUID::format(const GUID::guid_bytes_t& guid)
GUID::format(const GUID::guid_bytes_t& guid_in)
{
const char hex[17] {"0123456789abcdef"};
GUID::guid_bytes_t guid = guid_in;
//bool variant_2 = (guid_in[8] & 0xE0) == 0xC0;
// Mimic the behavior in rpcrt4.dll on little-endian machines,
// always convert as if the input was cast from a struct UUID
// to a byte array.
if (true) {
// First part byteswaps as an uint32_t
guid[0] = guid_in[3];
guid[1] = guid_in[2];
guid[2] = guid_in[1];
guid[3] = guid_in[0];
// Second and third part byteswaps as two uint16_t
guid[4] = guid_in[5];
guid[5] = guid_in[4];
guid[6] = guid_in[7];
guid[7] = guid_in[6];
// Remaining two parts are not byteswapped.
}
// RFC 4122 requires lower case output, case-insensitive input.
const char hex[17]{"0123456789abcdef"};
char result[37]{0};
char *cp = &result[0];
char *end = &result[sizeof(result)-1];
......@@ -121,4 +180,3 @@ Formatters::operator<<(std::ostream& os, const GUID& guid)
}
} // namespace
// Copyright 2017-2020, Schlumberger
// Copyright 2017-2021, Schlumberger
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
......@@ -28,15 +28,13 @@ namespace InternalZGY {
/**
* \file guid.h
* \brief Simplified %GUID handling. Only big endian, random number guids.
* \details
* See doc/uuid.md
*/
/**
* \brief Simplified %GUID handling. Only big endian, random number guids.
*
* The old ZGY accessor relies on an external dependency to an "uuid"
* package which tends to change between linux distros. This has caused
* a number of headaches. So, don't do that.
*
* Thread safety:
* Modification may lead to a data race. This should not be an issue,
* because instances are only meant to be modified when created or
......
......@@ -194,6 +194,29 @@ public:
return writer_->nlods();
}
// Currently not needed by any client.
//virtual std::string dataid() const override
//{
// // Not allowed to change after file is opened.
// // std::lock_guard<std::mutex> lk(mutex_);
// return writer_->dataid();
//}
virtual std::string verid() const override
{
// Not allowed to change after file is opened.
// std::lock_guard<std::mutex> lk(mutex_);
return writer_->verid();
}
// Currently not needed by any client.
//virtual std::string previd() const override
//{
// // Not allowed to change after file is opened.
// // std::lock_guard<std::mutex> lk(mutex_);
// return writer_->previd();
//}
virtual void dump(std::ostream& os) const override
{
std::lock_guard<std::mutex> lk(mutex_);
......
......@@ -61,6 +61,9 @@ public:
virtual size3i_t bricksize() const {return size3i_t{64,64,64};}
virtual std::vector<size3i_t> brickcount() const {throw std::runtime_error("brickcount() has not been mocked");}
virtual int32_t nlods() const {throw std::runtime_error("nlods() has not been mocked");}
virtual std::string dataid() const {return "0000000a-000b-000c-000d-333333333333";}
virtual std::string verid() const {return "0000000a-000b-000c-000d-222222222222";}
virtual std::string previd() const {return "0000000a-000b-000c-000d-111111111111";}
virtual void meta() const {throw std::runtime_error("meta() has not been mocked");}
virtual int32_t numthreads() const {throw std::runtime_error("numthreads() has not been mocked");}
virtual void set_numthreads(int32_t) {throw std::runtime_error("set_numthreads() has not been mocked");}
......
......@@ -284,6 +284,9 @@ dump_basic(std::shared_ptr<OpenZGY::IZgyReader> r, const std::string& filename,
os << "File name = '" << filename << "'\n"
<< "File size (bytes) = " << filestats.fileSize() << "\n"
<< "File format and version = " << r->datatype() << " ZGY version " << filestats.fileVersion() << "\n"
//<< "Data identifier = " << r->dataid() << "\n"
<< "Current data Version = " << r->verid() << "\n"
//<< "Previous data version = " << r->previd() << "\n"
<< "Brick size I,J,K = " << "(" << r->bricksize()[0] << ", " << r->bricksize()[1] << ", " << r->bricksize()[2] << ")\n"
<< "Number of bricks I,J,K = " << "(" << r->brickcount()[0][0] << ", " << r->brickcount()[0][1] << ", " << r->brickcount()[0][2] << ")\n"
<< "Number of LODs = " << r->nlods() << "\n"
......
......@@ -360,6 +360,43 @@ class ZgyMeta:
"""
return self._meta._ih._nlods
@staticmethod
def _formatUUID(uuid):
"""
Convert a little-endian binary UUID to a big-endian string version.
See the C++ version for details.
First part byteswaps as an uint32_t.
Second and third part byteswaps as two uint16_t.
Remaining two parts are not byteswapped.
Hyphens added between parts.
"""
return ("{3:02x}{2:02x}{1:02x}{0:02x}-" +
"{5:02x}{4:02x}-{7:02x}{6:02x}-" +
"{8:02x}{9:02x}-" +
"{10:02x}{11:02x}{12:02x}{13:02x}{14:02x}{15:02x}").format(*uuid)
#@property
#def dataid(self):
# """
# GUID set on file creation.
# """
# return self._formatUUID(self._meta._ih._dataid)
@property
def verid(self):
"""
GUID set each time the file is changed.
"""
return self._formatUUID(self._meta._ih._verid)
#@property
#def previd(self):
# """
# GUID before last change.
# """
# return self._formatUUID(self._meta._ih._previd)
@property
def meta(self):
"""
......
......@@ -1439,8 +1439,15 @@ class ZgyInternalMeta:
# Meta information that might be updated after creation.
# Except for dataid.
ih._dataid = bytes([random.randint(0,255) for i in range(16)])
ih._verid = bytes([random.randint(0,255) for i in range(16)])
def makeUUID():
# See the C++ version for details.
# TODO-Worry: Is the entropy of the random seed good enough?
uuid = bytearray([random.randint(0,255) for i in range(16)])
uuid[8] = (uuid[8] & 0x3f) | 0x80 # variant 1 (DCE)
uuid[7] = (uuid[7] & 0x0f) | 0x40 # version 4 (random)
return uuid
ih._dataid = makeUUID()
ih._verid = makeUUID()
ih._previd = bytes(16)
ih._srcname = ""
ih._srcdesc = ""
......
......@@ -10,6 +10,7 @@ _brief_info = """
File name = '{name}'
File size (bytes) = {r._fd.xx_eof:,d}
File format and version = {r.datatype.name} ZGY version {r._accessor._metadata._fh._version}
Current data Version = {r.verid}
Brick size I,J,K = {r.bricksize}
Number of bricks I,J,K = {r.brickcount[0]}
Number of LODs = {r.nlods}
......
......@@ -1878,6 +1878,24 @@ ZgyReader_getnlods(ZgyClass *self, void *closure)
return Py_BuildValue("i", self->pimpl_->meta_->nlods());
}
//static PyObject *
//ZgyReader_getdataid(ZgyClass *self, void *closure)
//{
// return Py_BuildValue("s", self->pimpl_->reader_->dataid().c_str());
//}
static PyObject *
ZgyReader_getverid(ZgyClass *self, void *closure)
{
return Py_BuildValue("s", self->pimpl_->reader_->verid().c_str());
}
//static PyObject *
//ZgyReader_getprevid(ZgyClass *self, void *closure)