Speed up OpenZGY by adding several new OpenMP loops.
On read, decompression and copy-out results are now multi-threaded inside OpenZGY. This means that applications can get good read performance even without requesting data from multiple threads. As long as the requests are reasonably large.
On write, finalize() is now significantly faster due to the above. When finalize is called it needs to eventually read back all the full resolution bricks. And finalize itself makes little sense to parallelize because each thread would then get much less memory to work on.
Several lower level functions have also been optimized and/or parallelized. Such as conversion of user-supplied float data to int8/int16 on write