Commit 4e94d830 authored by Vicente Bolea's avatar Vicente Bolea 💬
Browse files

Add release notes for v1.6.0

parent 9d345733
This diff is collapsed.
# Remove VTKDataSetWriter::WriteDataSet just_points parameter
In the method `VTKDataSetWriter::WriteDataSet`, `just_points` parameter has been
removed due to lack of usage.
The purpose of `just_points` was to allow exporting only the points of a
DataSet without its cell data.
# Add Kokkos backend
Adds a new device backend `Kokkos` which uses the kokkos library for parallelism.
User must provide the kokkos build and Vtk-m will use the default configured execution
space.
# Extract component arrays from unknown arrays
One of the problems with the data structures of VTK-m is that non-templated
classes like `DataSet`, `Field`, and `UnknownArrayHandle` (formally
`VariantArrayHandle`) internally hold an `ArrayHandle` of a particular type
that has to be cast to the correct task before it can be reasonably used.
That in turn is problematic because the list of possible `ArrayHandle`
types is very long.
At one time we were trying to compensate for this by using
`ArrayHandleVirtual`. However, for technical reasons this class is
infeasible for every use case of VTK-m and has been deprecated. Also, this
was only a partial solution since using it still required different code
paths for, say, handling values of `vtkm::Float32` and `vtkm::Vec3f_32`
even though both are essentially arrays of 32-bit floats.
The extract component feature compensates for this problem by allowing you
to extract the components from an `ArrayHandle`. This feature allows you to
create a single code path to handle `ArrayHandle`s containing scalars or
vectors of any size. Furthermore, when you extract a component from an
array, the storage gets normalized so that one code path covers all storage
types.
## `ArrayExtractComponent`
The basic enabling feature is a new function named `ArrayExtractComponent`.
This function takes takes an `ArrayHandle` and an index to a component. It
then returns an `ArrayHandleStride` holding the selected component of each
entry in the original array.
We will get to the structure of `ArrayHandleStride` later. But the
important part is that `ArrayHandleStride` does _not_ depend on the storage
type of the original `ArrayHandle`. That means whether you extract a
component from `ArrayHandleBasic`, `ArrayHandleSOA`,
`ArrayHandleCartesianProduct`, or any other type, you get back the same
`ArrayHandleStride`. Likewise, regardless of whether the input
`ArrayHandle` has a `ValueType` of `FloatDefault`, `Vec2f`, `Vec3f`, or any
other `Vec` of a default float, you get the same `ArrayHandleStride`. Thus,
you can see how this feature can dramatically reduce code paths if used
correctly.
It should be noted that `ArrayExtractComponent` will (logically) flatten
the `ValueType` before extracting the component. Thus, nested `Vec`s such
as `Vec<Vec3f, 3>` will be treated as a `Vec<FloatDefault, 9>`. The
intention is so that the extracted component will always be a basic C type.
For the purposes of this document when we refer to the "component type", we
really mean the base component type.
Different `ArrayHandle` implementations provide their own implementations
for `ArrayExtractComponent` so that the component can be extracted without
deep copying all the data. We will visit how `ArrayHandleStride` can
represent different data layouts later, but first let's go into the main
use case.
## Extract components from `UnknownArrayHandle`
The principle use case for `ArrayExtractComponent` is to get an
`ArrayHandle` from an unknown array handle without iterating over _every_
possible type. (Rather, we iterate over a smaller set of types.) To
facilitate this, an `ExtractComponent` method has been added to
`UnknownArrayHandle`.
To use `UnknownArrayHandle::ExtractComponent`, you must give it the
component type. You can check for the correct component type by using the
`IsBaseComponentType` method. The method will then return an
`ArrayHandleStride` for the component type specified.
### Example
As an example, let's say you have a worklet, `FooWorklet`, that does some
per component operation on an array. Furthermore, let's say that you want
to implement a function that, to the best of your ability, can apply
`FooWorklet` on an array of any type. This function should be pre-compiled
into a library so it doesn't have to be compiled over and over again.
(`MapFieldPermutation` and `MapFieldMergeAverage` are real and important
examples that have this behavior.)
Without the extract component feature, the implementation might look
something like this (many practical details left out):
``` cpp
struct ApplyFooFunctor
{
template <typename ArrayType>
void operator()(const ArrayType& input, vtkm::cont::UnknownArrayHandle& output) const
{
ArrayType outputArray;
vtkm::cont::Invoke invoke;
invoke(FooWorklet{}, input, outputArray);
output = outputArray;
}
};
vtkm::cont::UnknownArrayHandle ApplyFoo(const vtkm::cont::UnknownArrayHandle& input)
{
vtkm::cont::UnknownArrayHandle output;
input.CastAndCallForTypes<vtkm::TypeListAll, VTKM_DEFAULT_STORAGE_LIST_TAG>(
ApplyFooFunctor{}, output);
return output;
}
```
Take a look specifically at the `CastAndCallForTypes` call near the bottom
of this example. It calls for all types in `vtkm::TypeListAll`, which is
about 40 instances. Then, it needs to be called for any type in the desired
storage list. This could include basic arrays, SOA arrays, and lots of
other specialized types. It would be expected for this code to generate
over 100 paths for `ApplyFooFunctor`. This in turn contains a worklet
invoke, which is not a small amount of code.
Now consider how we can use the `ExtractComponent` feature to reduce the
code paths:
``` cpp
struct ApplyFooFunctor
{
template <typename T>
void operator()(T,
const vtkm::cont::UnknownArrayHandle& input,
cont vtkm::cont::UnknownArrayHandle& output) const
{
if (!input.IsBasicComponentType<T>()) { return; }
VTKM_ASSERT(output.IsBasicComponentType<T>());
vtkm::cont::Invoke invoke;
invoke(FooWorklet{}, input.ExtractComponent<T>(), output.ExtractComponent<T>());
}
};
vtkm::cont::UnknownArrayHandle ApplyFoo(const vtkm::cont::UnknownArrayHandle& input)
{
vtkm::cont::UnknownArrayHandle output = input.NewInstanceBasic();
output.Allocate(input.GetNumberOfValues());
vtkm::cont::ListForEach(ApplyFooFunctor{}, vtkm::TypeListScalarAll{}, input, output);
return output;
}
```
The number of lines of code is about the same, but take a look at the
`ListForEach` (which replaces the `CastAndCallForTypes`). This calling code
takes `TypeListScalarAll` instead of `TypeListAll`, which reduces the
instances created from around 40 to 13 (every basic C type). It is also no
longer dependent on the storage, so these 13 instances are it. As an
example of potential compile savings, changing the implementation of the
`MapFieldMergePermutation` and `MapFieldMergeAverage` functions in this way
reduced the filters_common library (on Mac, Debug build) by 24 MB (over a
third of the total size).
Another great advantage of this approach is that even though it takes less
time to compile and generates less code, it actually covers more cases.
Have an array containg values of `Vec<short, 13>`? No problem. The values
were actually stored in an `ArrayHandleReverse`? It will still work.
## `ArrayHandleStride`
This functionality is made possible with the new `ArrayHandleStride`. This
array behaves much like `ArrayHandleBasic`, except that it contains an
_offset_ parameter to specify where in the buffer array to start reading
and a _stride_ parameter to specify how many entries to skip for each
successive entry. `ArrayHandleStride` also has optional parameters
`divisor` and `modulo` that allow indices to be repeated at regular
intervals.
Here are how `ArrayHandleStride` extracts components from several common
arrays. For each of these examples, we assume that the `ValueType` of the
array is `Vec<T, N>`. They are each extracting _component_.
### Extracting from `ArrayHandleBasic`
When extracting from an `ArrayHandleBasic`, we just need to start at the
proper component and skip the length of the `Vec`.
* _offset_: _component_
* _stride_: `N`
### Extracting from `ArrayHandleSOA`
Since each component is held in a separate array, they are densly packed.
Each component could be represented by `ArrayHandleBasic`, but of course we
use `ArrayHandleStride` to keep the type consistent.
* _offset_: 0
* _stride_: 1
### Extracting from `ArrayHandleCartesianProduct`
This array is the basic reason for implementing the _divisor_ and _modulo_
parameters. Each of the 3 components have different parameters, which are
the following (given that _dims_[3] captures the size of the 3 arrays for
each dimension).
* _offset_: 0
* _stride_: 1
* case _component_ == 0
* _divisor_: _ignored_
* _modulo_: _dims_[0]
* case _component_ == 1
* _divisor_: _dims_[0]
* _modulo_: _dims_[1]
* case _component_ == 2
* _divisor_: _dims_[0]
* _modulo_: _ignored_
### Extracting from `ArrayHandleUniformPointCoordinates`
This array cannot be represented directly because it is fully implicit.
However, it can be trivially converted to `ArrayHandleCartesianProduct` in
typically very little memory. (In fact, EAVL always represented uniform
point coordinates by explicitly storing a Cartesian product.) Thus, for
very little overhead the `ArrayHandleStride` can be created.
## Runtime overhead of extracting components
These benefits come at a cost, but not a large one. The "biggest" cost is
the small cost of computing index arithmetic for each access into
`ArrayHandleStride`. To make this as efficient as possible, there are
conditions that skip over the modulo and divide steps if they are not
necessary. (Integer modulo and divide tend to take much longer than
addition and multiplication.) It is for this reason that we probably do not
want to use this method all the time.
Another cost is the fact that not every `ArrayHandle` can be represented by
`ArrayHandleStride` directly without copying. If you ask to extract a
component that cannot be directly represented, it will be copied into a
basic array, which is not great. To make matters worse, for technical
reasons this copy happens on the host rather than the device.
# Create `ArrayHandleOffsetsToNumComponents`
`ArrayHandleOffsetsToNumComponents` is a fancy array that takes an array of
offsets and converts it to an array of the number of components for each
packed entry.
It is common in VTK-m to pack small vectors of variable sizes into a single
contiguous array. For example, cells in an explicit cell set can each have
a different amount of vertices (triangles = 3, quads = 4, tetra = 4, hexa =
8, etc.). Generally, to access items in this list, you need an array of
components in each entry and the offset for each entry. However, if you
have just the array of offsets in sorted order, you can easily derive the
number of components for each entry by subtracting adjacent entries. This
works best if the offsets array has a size that is one more than the number
of packed vectors with the first entry set to 0 and the last entry set to
the total size of the packed array (the offset to the end).
When packing data of this nature, it is common to start with an array that
is the number of components. You can convert that to an offsets array using
the `vtkm::cont::ConvertNumComponentsToOffsets` function. This will create
an offsets array with one extra entry as previously described. You can then
throw out the original number of components array and use the offsets with
`ArrayHandleOffsetsToNumComponents` to represent both the offsets and num
components while storing only one array.
This replaces the use of `ArrayHandleDecorator` in `CellSetExplicit`.
The two implementation should do the same thing, but the new
`ArrayHandleOffsetsToNumComponents` should be less complex for
compilers.
# `ArrayRangeCompute` works on any array type without compiling device code
Originally, `ArrayRangeCompute` required you to know specifically the
`ArrayHandle` type (value type and storage type) and to compile using any
device compiler. The method is changed to include only overloads that have
precompiled versions of `ArrayRangeCompute`.
Additionally, an `ArrayRangeCompute` overload that takes an
`UnknownArrayHandle` has been added. In addition to allowing you to compute
the range of arrays of unknown types, this implementation of
`ArrayRangeCompute` serves as a fallback for `ArrayHandle` types that are
not otherwise explicitly supported.
If you really want to make sure that you compute the range directly on an
`ArrayHandle` of a particular type, you can include
`ArrayRangeComputeTemplate.h`, which contains a templated overload of
`ArrayRangeCompute` that directly computes the range of an `ArrayHandle`.
Including this header requires compiling for device code.
# Implemented ArrayHandleRandomUniformBits and ArrayHandleRandomUniformReal
ArrayHandleRandomUniformBits and ArrayHandleRandomUniformReal were added to provide
an efficient way to generate pseudo random numbers in parallel. They are based on the
Philox parallel pseudo random number generator. ArrayHandleRandomUniformBits provides
64-bits random bits in the whole range of UInt64 as its content while
ArrayHandleRandomUniformReal provides random Float64 in the range of [0, 1). User can
either provide a seed in the form of Vec<vtkm::Uint32, 1> or use the default random
source provided by the C++ standard library. Both of the ArrayHandles are lazy evaluated
as other Fancy ArrayHandles such that they only have O(1) memory overhead. They are
stateless and functional and does not change once constructed. To generate a new set of
random numbers, for example as part of a iterative algorithm, a new ArrayHandle
needs to be constructed in each iteration. See the user's guide for more detail and
examples.
# `ArrayHandleGroupVecVariable` holds now one more offset.
This change affects the usage of both `ConvertNumComponentsToOffsets` and
`make_ArrayHandleGroupVecVariable`.
The reason of this change is to remove a branch in
`ArrayHandleGroupVecVariable::Get` which is used to avoid an array overflow,
this in theory would increases the performance since at the CPU level it will
remove penalties due to wrong branch predictions.
The change affects `ConvertNumComponentsToOffsets` by both:
1. Increasing the numbers of elements in `offsetsArray` (its second parameter)
by one.
2. Setting `sourceArraySize` as the sum of all the elements plus the new one
in `offsetsArray`
Note that not every specialization of `ConvertNumComponentsToOffsets` does
return `offsetsArray`. Thus, some of them would not be affected.
Similarly, this change affects `make_ArrayHandleGroupVecVariable` since it
expects its second parameter (offsetsArray) to be one element bigger than
before.
# Algorithms for Control and Execution Environments
The `<vtkm/Algorithms.h>` header has been added to provide common STL-style
generic algorithms that are suitable for use in both the control and execution
environments. This is necessary as the STL algorithms in the `<algorithm>`
header are not marked up for use in execution environments such as CUDA.
In addition to the markup, these algorithms have convenience overloads to
support ArrayPortals directly, simplifying their usage with VTK-m data
structures.
Currently, three related algorithms are provided: `LowerBounds`, `UpperBounds`,
and `BinarySearch`. `BinarySearch` differs from the STL `std::binary_search`
algorithm in that it returns an iterator (or index) to a matching element,
rather than just a boolean indicating whether a or not key is present.
The new algorithm signatures are:
```c++
namespace vtkm
{
template <typename IterT, typename T, typename Comp>
VTKM_EXEC_CONT
IterT BinarySearch(IterT first, IterT last, const T& val, Comp comp);
template <typename IterT, typename T>
VTKM_EXEC_CONT
IterT BinarySearch(IterT first, IterT last, const T& val);
template <typename PortalT, typename T, typename Comp>
VTKM_EXEC_CONT
vtkm::Id BinarySearch(const PortalT& portal, const T& val, Comp comp);
template <typename PortalT, typename T>
VTKM_EXEC_CONT
vtkm::Id BinarySearch(const PortalT& portal, const T& val);
template <typename IterT, typename T, typename Comp>
VTKM_EXEC_CONT
IterT LowerBound(IterT first, IterT last, const T& val, Comp comp);
template <typename IterT, typename T>
VTKM_EXEC_CONT
IterT LowerBound(IterT first, IterT last, const T& val);
template <typename PortalT, typename T, typename Comp>
VTKM_EXEC_CONT
vtkm::Id LowerBound(const PortalT& portal, const T& val, Comp comp);
template <typename PortalT, typename T>
VTKM_EXEC_CONT
vtkm::Id LowerBound(const PortalT& portal, const T& val);
template <typename IterT, typename T, typename Comp>
VTKM_EXEC_CONT
IterT UpperBound(IterT first, IterT last, const T& val, Comp comp);
template <typename IterT, typename T>
VTKM_EXEC_CONT
IterT UpperBound(IterT first, IterT last, const T& val);
template <typename PortalT, typename T, typename Comp>
VTKM_EXEC_CONT
vtkm::Id UpperBound(const PortalT& portal, const T& val, Comp comp);
template <typename PortalT, typename T>
VTKM_EXEC_CONT
vtkm::Id UpperBound(const PortalT& portal, const T& val);
}
```
# `vtkm::cont::internal::Buffer` now can have ownership transferred
Memory once transferred to `Buffer` always had to be managed by VTK-m. This is problematic
for applications that needed VTK-m to allocate memory, but have the memory ownership
be longer than VTK-m.
`Buffer::TakeHostBufferOwnership` allows for easy transfer ownership of memory out of VTK-m.
When taking ownership of an VTK-m buffer you are provided the following information:
- Memory: A `void*` pointer to the array
- Container: A `void*` pointer used to free the memory. This is necessary to support cases such as allocations transferred into VTK-m from a `std::vector`.
- Delete: The function to call to actually delete the transferred memory
- Reallocate: The function to call to re-allocate the transferred memory. This will throw an exception if users try
to reallocate a buffer that was 'view' only
- Size: The size in number of elements of the array
To properly steal memory from VTK-m you do the following:
```cpp
vtkm::cont::ArrayHandle<T> arrayHandle;
...
auto stolen = arrayHandle.GetBuffers()->TakeHostBufferOwnership();
...
stolen.Delete(stolen.Container);
```
# Redesign of ArrayHandle to access data using typeless buffers
The original implementation of `ArrayHandle` is meant to be very generic.
To define an `ArrayHandle`, you actually create a `Storage` class that
maintains the data and provides portals to access it (on the host). Because
the `Storage` can provide any type of data structure it wants, you also
need to define an `ArrayTransfer` that describes how to move the
`ArrayHandle` to and from a device. It also has to be repeated for every
translation unit that uses them.
This is a very powerful mechanism. However, one of the major problems with
this approach is that every `ArrayHandle` type needs to have a separate
compile path for every value type crossed with every device. Because of
this limitation, the `ArrayHandle` for the basic storage has a special
implementation that manages the actual data allocation and movement as
`void *` arrays. In this way all the data management can be compiled once
and put into the `vtkm_cont` library. This has dramatically improved the
VTK-m compile time.
This new design replicates the basic `ArrayHandle`'s success to all other
storage types. The basic idea is to make the implementation of
`ArrayHandle` storage slightly less generic. Instead of requiring it to
manage the data it stores, it instead just builds `ArrayPortal`s from
`void` pointers that it is given. The management of `void` pointers can be
done in non-templated classes that are compiled into a library.
This initial implementation does not convert all `ArrayHandle`s to avoid
making non-backward compatible changes before the next minor revision of
VTK-m. In particular, it would be particularly difficult to convert
`ArrayHandleVirtual`. It could be done, but it would be a lot of work for a
class that will likely be removed.
## Buffer
Key to these changes is the introduction of a
`vtkm::cont::internal::Buffer` object. As the name implies, the `Buffer`
object manages a single block of bytes. `Buffer` is agnostic to the type of
data being stored. It only knows the length of the buffer in bytes. It is
responsible for allocating space on the host and any devices as necessary
and for transferring data among them. (Since `Buffer` knows nothing about
the type of data, a precondition of VTK-m would be that the host and all
devices have to have the same endian.)
The idea of the `Buffer` object is similar in nature to the existing
`vtkm::cont::internal::ExecutionArrayInterfaceBasicBase` except that it
will manage a buffer of data among the control and all devices rather than
in one device through a templated subclass.
As explained below, `ArrayHandle` holds some fixed number of `Buffer`
objects. (The number can be zero for implicit `ArrayHandle`s.) Because all
the interaction with the devices happen through `Buffer`, it will no longer
be necessary to compile any reference to `ArrayHandle` for devices (e.g.
you won't have to use nvcc just because the code links `ArrayHandle.h`).
## Storage
The `vtkm::cont::internal::Storage` class changes dramatically. Although an
instance will be kept, the intention is for `Storage` itself to be a
stateless object. It will manage its data through `Buffer` objects provided
from the `ArrayHandle`.
That said, it is possible for `Storage` to have some state. For example,
the `Storage` for `ArrayHandleImplicit` must hold on to the instance of the
portal used to manage the state.
## ArrayTransport
The `vtkm::cont::internal::ArrayTransfer` class will be removed completely.
All data transfers will be handled internally with the `Buffer` object
## Portals
A big change for this design is that the type of a portal for an
`ArrayHandle` will be the same for all devices and the host. Thus, we no
longer need specialized versions of portals for each device. We only have
one portal type. And since they are constructed from `void *` pointers, one
method can create them all.
## Advantages
The `ArrayHandle` interface should not change significantly for external
uses, but this redesign offers several advantages.
### Faster Compiles
Because the memory management is contained in a non-templated `Buffer`
class, it can be compiled once in a library and used by all template
instances of `ArrayHandle`. It should have similar compile advantages to
our current specialization of the basic `ArrayHandle`, but applied to all
types of `ArrayHandle`s.
### Fewer Templates
Hand-in-hand with faster compiles, the new design should require fewer
templates and template instances. We have immediately gotten rid of
`ArrayTransport`. `Storage` is also much shorter. Because all
`ArrayPortal`s are the same for every device and the host, we need many
fewer versions of those classes. In the device adapter, we can probably
collapse the three `ArrayManagerExecution` classes into a single, much
simpler class that does simple memory allocation and copy.
### Fewer files need to be compiled for CUDA
Including `ArrayHandle.h` no longer adds code that compiles for a device.
Thus, we should no longer need to compile for a specific device adapter
just because we access an `ArrayHandle`. This should make it much easier to
achieve our goal of a "firewall". That is, code that just calls VTK-m
filters does not need to support all its compilers and flags.
### Simpler ArrayHandle specialization
The newer code should simplify the implementation of special `ArrayHandle`s
a bit. You need only implement an `ArrayPortal` that operates on one or
more `void *` arrays and a simple `Storage` class.
### Out of band memory sharing
With the current version of `ArrayHandle`, if you want to take data from
one `ArrayHandle` you pretty much have to create a special template to wrap
another `ArrayHandle` around that. With this new design, it is possible to
take data from one `ArrayHandle` and give it to another `ArrayHandle` of a
completely different type. You can't do this willy-nilly since different
`ArrayHandle` types will interpret buffers differently. But there can be
some special important use cases.
One such case could be an `ArrayHandle` that provides strided access to a
buffer. (Let's call it `ArrayHandleStride`.) The idea is that it interprets
the buffer as an array for a particular type (like a basic `ArrayHandle`)
but also defines a stride, skip, and repeat so that given an index it looks
up the value `((index / skip) % repeat) * stride`. The point is that it can
take an AoS array of tuples and represent an array of one of the
components.
The point would be that if you had a `VariantArrayHandle` or `Field`, you
could pull out an array of one of the components as an `ArrayHandleStride`.
An `ArrayHandleStride<vtkm::Float32>` could be used to represent that data
that comes from any basic `ArrayHandle` with `vtkm::Float32` or a
`vtkm::Vec` of that type. It could also represent data from an
`ArrayHandleCartesianProduct` and `ArrayHandleSoA`. We could even represent
an `ArrayHandleUniformPointCoordinates` by just making a small array. This
allows us to statically access a whole bunch of potential array storage
classes with a single type.
### Potentially faster device transfers
There is currently a fast-path for basic `ArrayHandle`s that does a block
cuda memcpy between host and device. But for other `ArrayHandle`s that do
not defer their `ArrayTransfer` to a sub-array, the transfer first has to
copy the data into a known buffer.
Because this new design stores all data in `Buffer` objects, any of these
can be easily and efficiently copied between devices.
## Disadvantages
This new design gives up some features of the original `ArrayHandle` design.
### Can only interface data that can be represented in a fixed number of buffers
Because the original `ArrayHandle` design required the `Storage` to
completely manage the data, it could represent it in any way possible. In
this redesign, the data need to be stored in some fixed number of memory
buffers.
This is a pretty open requirement. I suspect most data formats will be
storable in this. The user's guide has an example of data stored in a
`std::deque` that will not be representable. But that is probably not a
particularly practical example.
### VTK-m would only be able to support hosts and devices with the same endian
Because data are transferred as `void *` blocks of memory, there is no way
to correct words if the endian on the two devices does not agree. As far as
I know, there should be no issues with the proposed ECP machines.