[System/Storage] Relax id validation to support OSDU relationship definitions/constraints
OSDU defines entity-types as a compound reference <group-type>/<individual-type>
. These OSDU entity-type specifications are used to constrain relationships, e.g. identify a relationship target type via a pattern.
latest conclusion
Jump toThe Storage service constrains the id
using this regular expression in ValidationDoc.java:
-
"[\\w-\\.]+:[\\w-\\.]+:[\\w-\\.]+"
describing the following parts: -
<data-partition-id>:<entity-type>:<unique-instance-id>
where entity-type means group-type/individual-type.
The corresponding JSON schema pattern regex using ECMAScript style is
"^[\\w-\\.]+:[\\w-\\.]+:[\\w-\\.]+$"
it should be changed to at least
-
"^[\\w-\\.]+:[\\w-\\.\\/]+:[\\w-\\.]+$"
-- see revision below.
to support <data-partition-id>:<group-type>/<individual-type>:<unique-instance-id>
.
Furthermore it should be decided which other characters to allow in the unqiue <unique-instance-id>
. My suggestion is to relax this to support GUIDs (already supported) and url-encoded strings. There are a number of use cases for deterministic <unique-instance-id>
for reference data.
Decision as per November 3rd
The regex expression for id will change to:
"^[\\w-\\.]+:[\\w-\\.\\/]+:.+$"
The actual validation regex must be published with the Storage service. In turn, OSDU data definitions must adopt the constraints in their schema definitions. At the moment validation pattern for id
are entirely unconstrained, except :
, i.e. [^:\]+
for each of the id
parts.
Addition December 6th:
The regex for the kind in ValidationDoc.java line 27 seems to be incorrect as well. It lacks the ^
and $
symbols at the beginning and end (otherwise any invalid characters can be added at the beginning and end). The condition for the semantic version number also doesn't filter invalid separators. Instead this expression should work:
^[\w\-\.]+:[\w\-\.]+:[\w\-\.]+:[0-9]+.[0-9]+.[0-9]+$
or as string:
"^[\\w\\-\\.]+:[\\w\\-\\.]+:[\\w\\-\\.]+:[0-9]+.[0-9]+.[0-9]+$"
Summary January 6, 2021
The following regex expressions have been tested in https://regex101.com/ using the ECMAScript option (JSON standard):
RECORD_ID_REGEX = "^[\\w\\-\\.]+:[\\w-\\.\\/]+:.+$"
as used in regex101: ^[\w\-\.]+:[\w-\.\/]+:.+$
RECORD_ID_WITH_VERSION_REGEX = "^[\\w\\-\\.]+:[\\w-\\.\\/]+:.+:[0-9]+$"
as used in regex101: ^[\w\-\.]+:[\w-\.\/]+:.+:[0-9]+$
KIND_REGEX = "^[\\w\\-\\.]+:[\\w\\-\\.]+:[\\w\\-\\.\\/]+:[0-9]+.[0-9]+.[0-9]+$"
as used in regex101 ^[\w\-\.]+:[\w\-\.]+:[\w\-\.\/]+:[0-9]+.[0-9]+.[0-9]+$
If we eventually support 'optionally versioned' id references in the Storage API, there is another regex required:
RECORD_ID_WITH_OPTIONAL_VERSION_REGEX = "^[\\w\\-\\.]+:[\\w-\\.\\/]+:.+:[0-9]*$"
as used in regex101 ^[\w\-\.]+:[\w-\.\/]+:.+:[0-9]*$
It turned out that all these 'wishes' were made without (seriously) checking the implementations. /
is a reserved character in at least one implementation. Therefore, change of plans, again.
Summary January 26, 2021
To preserve 'business' id
s, like unit symbols, it is required to url-encode the desired IDs, e.g. in reference-data. This stops the otherwise reserve characters. :
is already used as a separator in kind
and id
. It is a desired symbol for certain business desired id
s. This means the last part of the id
should use this regex: [\w\-\.\:\%]+
alpha-numeric characters, underscore, dash, dot, colon and percent.
RECORD_ID_REGEX = "^[\\w\\-\\.]+:[\\w-\\.]+:[\\w\\-\\.\\:\\%]+$"
as used in regex101: ^[\w\-\.]+:[\w-\.]+:[\w\-\.\:\%]+$
RECORD_ID_WITH_VERSION_REGEX = "^[\\w\\-\\.]+:[\\w-\\.]+:[\\w\\-\\.\\:\\%]+:[0-9]+$"
as used in regex101: ^[\w\-\.]+:[\w-\.\/]+:[\w\-\.\:\%]+:[0-9]+$
KIND_REGEX = "^[\\w\\-\\.]+:[\\w\\-\\.]+:[\\w\\-\\.]+:[0-9]+.[0-9]+.[0-9]+$"
as used in regex101 ^[\w\-\.]+:[\w\-\.]+:[\w\-\.]+:[0-9]+.[0-9]+.[0-9]+$
If we eventually support 'optionally versioned' id references in the Storage API, there is another regex required:
RECORD_ID_WITH_OPTIONAL_VERSION_REGEX = "^[\\w\\-\\.]+:[\\w-\\.]+:[\\w\\-\\.\\:\\%]+:[0-9]*$"
as used in regex101 ^[\w\-\.]+:[\w-\.]+:[\w\-\.\:\%]+:[0-9]*$