Commit 774d7a56 authored by Robert Maynard's avatar Robert Maynard
Browse files

Add release notes for v1.4.0

parent 0eac06f5
VTK-m 1.4 Release Notes
=======================
# Table of Contents
1. [Core](#Core)
- Remove templates from `ControlSignature` field tags
- Worklets can now be specialized for a specific device adapter
- Worklets now support an execution mask
- Redesign VTK-m Runtime Device Tracking
- `vtkm::cont::Initialize` added to make setting up VTK-m runtime state easier
2. [ArrayHandle](#ArrayHandle)
- Add `vtkm::cont::ArrayHandleVirtual`
- `vtkm::cont::ArrayHandleZip` provides a consistent API even with non-writable handles
- `vtkm::cont::VariantArrayHandle` replaces `vtkm::cont::DynamicArrayHandle`
- `vtkm::cont::VariantArrayHandle` CastAndCall supports casting to concrete types
- `vtkm::cont::VariantArrayHandle::AsVirtual<T>()` performs casting
- `StorageBasic::StealArray()` now provides delete function to new owner
3. [Control Environment](#Control-Environment)
- `vtkm::cont::CellLocatorGeneral` has been added
- `vtkm::cont::CellLocatorTwoLevelUniformGrid` has been renamed to `vtkm::cont::CellLocatorUniformBins`
- `vtkm::cont::Timer` now supports asynchronous and device independent timers
- `vtkm::cont::DeviceAdapterId` construction from strings are now case-insensitive
- `vtkm::cont::Initialize` will only parse known arguments
4. [Execution Environment](#Execution-Environment)
- VTK-m logs details about each CUDA kernel launch
- VTK-m CUDA allocations can have managed memory (cudaMallocManaged) enabled/disabled from C++
- VTK-m CUDA kernel scheduling improved including better defaults, and user customization support
- VTK-m Reduction algorithm now supports differing input and output types
- Added specialized operators for ArrayPortalValueReference
5. [Worklets and Filters](#Worklets-and-Filters)
- `vtkm::worklet::Invoker` now supports worklets which require a Scatter object
- `BitFields` are now a support field input/out type for VTK-m worklets
- Added a Point Merging worklet
- `vtkm::filter::CleanGrid` now can do point merging
- Added a connected component worklets and filters
6. [Build](#Build)
- CMake 3.8+ now required to build VTK-m
- VTK-m now can verify that it installs itself correctly
- VTK-m now requires `CUDA` separable compilation to build
- VTK-m provides a `vtkm_filter` CMake target
- `vtkm::cont::CellLocatorBoundingIntervalHierarchy` is compiled into `vtkm_cont`
7. [Other](#Other)
- LodePNG added as a thirdparty package
- Optionparser added as a thirdparty package
- Thirdparty diy now can coexist with external diy
- Merge benchmark executables into a device dependent shared library
- Merge rendering testing executables to a shared library
- Merge worklet testing executables into a device dependent shared library
- VTK-m runtime device detection properly handles busy CUDA devices
# Core
## Remove templates from `ControlSignature` field tags
Previously, several of the `ControlSignature` tags had a template to
specify a type list. This was to specify potential valid value types for an
input array. The importance of this typelist was to limit the number of
code paths created when resolving a `vtkm::cont::VariantArrayHandle`
(formerly a `DynamicArrayHandle`). This (potentially) reduced the compile
time, the size of libraries/executables, and errors from unexpected types.
Much has changed since this feature was originally implemented. Since then,
the filter infrastructure has been created, and it is through this that
most dynamic worklet invocations happen. However, since the filter
infrastrcture does its own type resolution (and has its own policies) the
type arguments in `ControlSignature` are now of little value.
### Script to update code
This update requires changes to just about all code implementing a VTK-m
worklet. To facilitate the update of this code to these new changes (not to
mention all the code in VTK-m) a script is provided to automatically remove
these template parameters from VTK-m code.
This script is at
[Utilities/Scripts/update-control-signature-tags.sh](../../Utilities/Scripts/update-control-signature-tags.sh).
It needs to be run in a Unix-compatible shell. It takes a single argument,
which is a top level directory to modify files. The script processes all C++
source files recursively from that directory.
### Selecting data types for auxiliary filter fields
The main rational for making these changes is that the types of the inputs
to worklets is almost always already determined by the calling filter.
However, although it is straightforward to specify the type of the "main"
(active) scalars in a filter, it is less clear what to do for additional
fields if a filter needs a second or third field.
Typically, in the case of a second or third field, it is up to the
`DoExecute` method in the filter implementation to apply a policy to that
field. When applying a policy, you give it a policy object (nominally
passed by the user) and a traits of the filter. Generally, the accepted
list of types for a field should be part of the filter's traits. For
example, consider the `WarpVector` filter. This filter only works on
`Vec`s of size 3, so its traits class looks like this.
``` cpp
template <>
class FilterTraits<WarpVector>
{
public:
// WarpVector can only applies to Float and Double Vec3 arrays
using InputFieldTypeList = vtkm::TypeListTagFieldVec3;
};
```
However, the `WarpVector` filter also requires two fields instead of one.
The first (active) field is handled by its superclass (`FilterField`), but
the second (auxiliary) field must be managed in the `DoExecute`. Generally,
this can be done by simply applying the policy with the filter traits.
### The corner cases
Most of the calls to worklets happen within filter implementations, which
have their own way of narrowing down potential types (as previously
described). The majority of the remainder either use static types or work
with a variety of types.
However, there is a minority of corner cases that require a reduction of
types. Since the type argument of the worklet `ControlSignature` arguments
are no longer available, the narrowing of types must be done before the
call to `Invoke`.
This narrowing of arguments is not particularly difficult. Such type-unsure
arguments usually come from a `VariantArrayHandle` (or something that uses
one). You can select the types from a `VariantArrayHandle` simply by using
the `ResetTypes` method. For example, say you know that a variant array is
supposed to be a scalar.
``` cpp
dispatcher.Invoke(variantArray.ResetTypes(vtkm::TypeListTagFieldScalar()),
staticArray);
```
Even more common is to have a `vtkm::cont::Field` object. A `Field` object
internally holds a `VariantArrayHandle`, which is accessible via the
`GetData` method.
``` cpp
dispatcher.Invoke(field.GetData().ResetTypes(vtkm::TypeListTagFieldScalar()),
staticArray);
```
### Change in executable size
The whole intention of these template parameters in the first place was to
reduce the number of code paths compiled. The hypothesis of this change was
that in the current structure the code paths were not being reduced much
if at all. If that is true, the size of executables and libraries should
not change.
Here is a recording of the library and executable sizes before this change
(using `ds -h`).
```
3.0M libvtkm_cont-1.2.1.dylib
6.2M libvtkm_rendering-1.2.1.dylib
312K Rendering_SERIAL
312K Rendering_TBB
22M Worklets_SERIAL
23M Worklets_TBB
22M UnitTests_vtkm_filter_testing
5.7M UnitTests_vtkm_cont_serial_testing
6.0M UnitTests_vtkm_cont_tbb_testing
7.1M UnitTests_vtkm_cont_testing
```
After the changes, the executable sizes are as follows.
```
3.0M libvtkm_cont-1.2.1.dylib
6.0M libvtkm_rendering-1.2.1.dylib
312K Rendering_SERIAL
312K Rendering_TBB
21M Worklets_SERIAL
21M Worklets_TBB
22M UnitTests_vtkm_filter_testing
5.6M UnitTests_vtkm_cont_serial_testing
6.0M UnitTests_vtkm_cont_tbb_testing
7.1M UnitTests_vtkm_cont_testing
```
As we can see, the built sizes have not changed significantly. (If
anything, the build is a little smaller.)
## Worklets can now be specialized for a specific device adapter
This change adds an execution signature tag named `Device` that passes
a `DeviceAdapterTag` to the worklet's parenthesis operator. This allows the
worklet to specialize its operation. This features is available in all
worklets.
The following example shows a worklet that specializes itself for the CUDA
device.
```cpp
struct DeviceSpecificWorklet : vtkm::worklet::WorkletMapField
{
using ControlSignature = void(FieldIn, FieldOut);
using ExecutionSignature = _2(_1, Device);
// Specialization for the Cuda device.
template <typename T>
T operator()(T x, vtkm::cont::DeviceAdapterTagCuda) const
{
// Special cuda implementation
}
// General implementation
template <typename T, typename Device>
T operator()(T x, Device) const
{
// General implementation
}
};
```
### Effect on compile time and binary size
This change necessitated adding a template parameter for the device that
followed at least from the schedule all the way down. This has the
potential for duplicating several of the support methods (like
`DoWorkletInvokeFunctor`) that would otherwise have the same type. This is
especially true between the devices that run on the CPU as they should all
be sharing the same portals from `ArrayHandle`s. So the question is whether
it causes compile to take longer or cause a significant increase in
binaries.
To informally test, I first ran a clean debug compile on my Windows machine
with the serial and tbb devices. The build itself took **3 minutes, 50
seconds**. Here is a list of the binary sizes in the bin directory:
```
kmorel2 0> du -sh *.exe *.dll
200K BenchmarkArrayTransfer_SERIAL.exe
204K BenchmarkArrayTransfer_TBB.exe
424K BenchmarkAtomicArray_SERIAL.exe
424K BenchmarkAtomicArray_TBB.exe
440K BenchmarkCopySpeeds_SERIAL.exe
580K BenchmarkCopySpeeds_TBB.exe
4.1M BenchmarkDeviceAdapter_SERIAL.exe
5.3M BenchmarkDeviceAdapter_TBB.exe
7.9M BenchmarkFieldAlgorithms_SERIAL.exe
7.9M BenchmarkFieldAlgorithms_TBB.exe
22M BenchmarkFilters_SERIAL.exe
22M BenchmarkFilters_TBB.exe
276K BenchmarkRayTracing_SERIAL.exe
276K BenchmarkRayTracing_TBB.exe
4.4M BenchmarkTopologyAlgorithms_SERIAL.exe
4.4M BenchmarkTopologyAlgorithms_TBB.exe
712K Rendering_SERIAL.exe
712K Rendering_TBB.exe
708K UnitTests_vtkm_cont_arg_testing.exe
1.7M UnitTests_vtkm_cont_internal_testing.exe
13M UnitTests_vtkm_cont_serial_testing.exe
14M UnitTests_vtkm_cont_tbb_testing.exe
18M UnitTests_vtkm_cont_testing.exe
13M UnitTests_vtkm_cont_testing_mpi.exe
736K UnitTests_vtkm_exec_arg_testing.exe
136K UnitTests_vtkm_exec_internal_testing.exe
196K UnitTests_vtkm_exec_serial_internal_testing.exe
196K UnitTests_vtkm_exec_tbb_internal_testing.exe
2.0M UnitTests_vtkm_exec_testing.exe
83M UnitTests_vtkm_filter_testing.exe
476K UnitTests_vtkm_internal_testing.exe
148K UnitTests_vtkm_interop_internal_testing.exe
1.3M UnitTests_vtkm_interop_testing.exe
2.9M UnitTests_vtkm_io_reader_testing.exe
548K UnitTests_vtkm_io_writer_testing.exe
792K UnitTests_vtkm_rendering_testing.exe
3.7M UnitTests_vtkm_testing.exe
320K UnitTests_vtkm_worklet_internal_testing.exe
65M UnitTests_vtkm_worklet_testing.exe
11M vtkm_cont-1.3.dll
2.1M vtkm_interop-1.3.dll
21M vtkm_rendering-1.3.dll
3.9M vtkm_worklet-1.3.dll
```
After making the singular change to the `Invocation` object to add the
`DeviceAdapterTag` as a template parameter (which should cause any extra
compile instances) the compile took **4 minuts and 5 seconds**. Here is the
new list of binaries.
```
kmorel2 0> du -sh *.exe *.dll
200K BenchmarkArrayTransfer_SERIAL.exe
204K BenchmarkArrayTransfer_TBB.exe
424K BenchmarkAtomicArray_SERIAL.exe
424K BenchmarkAtomicArray_TBB.exe
440K BenchmarkCopySpeeds_SERIAL.exe
580K BenchmarkCopySpeeds_TBB.exe
4.1M BenchmarkDeviceAdapter_SERIAL.exe
5.3M BenchmarkDeviceAdapter_TBB.exe
7.9M BenchmarkFieldAlgorithms_SERIAL.exe
7.9M BenchmarkFieldAlgorithms_TBB.exe
22M BenchmarkFilters_SERIAL.exe
22M BenchmarkFilters_TBB.exe
276K BenchmarkRayTracing_SERIAL.exe
276K BenchmarkRayTracing_TBB.exe
4.4M BenchmarkTopologyAlgorithms_SERIAL.exe
4.4M BenchmarkTopologyAlgorithms_TBB.exe
712K Rendering_SERIAL.exe
712K Rendering_TBB.exe
708K UnitTests_vtkm_cont_arg_testing.exe
1.7M UnitTests_vtkm_cont_internal_testing.exe
13M UnitTests_vtkm_cont_serial_testing.exe
14M UnitTests_vtkm_cont_tbb_testing.exe
19M UnitTests_vtkm_cont_testing.exe
13M UnitTests_vtkm_cont_testing_mpi.exe
736K UnitTests_vtkm_exec_arg_testing.exe
136K UnitTests_vtkm_exec_internal_testing.exe
196K UnitTests_vtkm_exec_serial_internal_testing.exe
196K UnitTests_vtkm_exec_tbb_internal_testing.exe
2.0M UnitTests_vtkm_exec_testing.exe
86M UnitTests_vtkm_filter_testing.exe
476K UnitTests_vtkm_internal_testing.exe
148K UnitTests_vtkm_interop_internal_testing.exe
1.3M UnitTests_vtkm_interop_testing.exe
2.9M UnitTests_vtkm_io_reader_testing.exe
548K UnitTests_vtkm_io_writer_testing.exe
792K UnitTests_vtkm_rendering_testing.exe
3.7M UnitTests_vtkm_testing.exe
320K UnitTests_vtkm_worklet_internal_testing.exe
68M UnitTests_vtkm_worklet_testing.exe
11M vtkm_cont-1.3.dll
2.1M vtkm_interop-1.3.dll
21M vtkm_rendering-1.3.dll
3.9M vtkm_worklet-1.3.dll
```
So far the increase is quite negligible.
## Worklets now support an execution mask
There have recently been use cases where it would be helpful to mask out
some of the invocations of a worklet. The idea is that when invoking a
worklet with a mask array on the input domain, you might implement your
worklet more-or-less like the following.
```cpp
VTKM_EXEC void operator()(bool mask, /* other parameters */)
{
if (mask)
{
// Do interesting stuff
}
}
```
This works, but what if your mask has mostly false values? In that case,
you are spending tons of time loading data to and from memory where fields
are stored for no reason.
You could potentially get around this problem by adding a scatter to the
worklet. However, that will compress the output arrays to only values that
are active in the mask. That is problematic if you want the masked output
in the appropriate place in the original arrays. You will have to do some
complex (and annoying and possibly expensive) permutations of the output
arrays.
Thus, we would like a new feature similar to scatter that instead masks out
invocations so that the worklet is simply not run on those outputs.
### New Interface
The new "Mask" feature that is similar (and orthogonal) to the existing
"Scatter" feature. Worklet objects now define a `MaskType` that provides on
object that manages the selections of which invocations are skipped. The
following Mask objects are defined.
* `MaskNone` - This removes any mask of the output. All outputs are
generated. This is the default if no `MaskType` is explicitly defined.
* `MaskSelect` - The user to provides an array that specifies whether
each output is created with a 1 to mean that the output should be
created an 0 the mean that it should not.
* `MaskIndices` - The user provides an array with a list of indices for
all outputs that should be created.
It will be straightforward to implement other versions of masks. (For
example, you could make a mask class that selectes every Nth entry.) Those
could be made on an as-needed basis.
### Implementation
The implementation follows the same basic idea of how scatters are
implemented.
#### Mask Classes
The mask class is required to implement the following items.
* `ThreadToOutputType` - A type for an array that maps a thread index (an
index in the array) to an output index. A reasonable type for this
could be `vtkm::cont::ArrayHandle<vtkm::Id>`.
* `GetThreadToOutputMap` - Given the range for the output (e.g. the
number of items in the output domain), returns an array of type
`ThreadToOutputType` that is the actual map.
* `GetThreadRange` - Given a range for the output (e.g. the number of
items in the output domain), returns the range for the threads (e.g.
the number of times the worklet will be invoked).
#### Dispatching
The `vtkm::worklet::internal::DispatcherBase` manages a mask class in
the same way it manages the scatter class. It gets the `MaskType` from
the worklet it is templated on. It requires a `MaskType` object during
its construction.
Previously the dispatcher (and downstream) had to manage the range and
indices of inputs and threads. They now have to also manage a separate
output range/index as now all three may be different.
The `vtkm::Invocation` is changed to hold the ThreadToOutputMap array from
the mask. It likewises has a templated `ChangeThreadToOutputMap` method
added (similar to those already existing for the arrays from a scatter).
This method is used in `DispatcherBase::InvokeTransportParameters` to add
the mask's array to the invocation before calling `InvokeSchedule`.
#### Thread Indices
With the addition of masks, the `ThreadIndices` classes are changed to
manage the actual output index. Previously, the output index was always the
same as the thread index. However, now these two can be different. The
`GetThreadIndices` methods of the worklet base classes have an argument
added that is the portal to the ThreadToOutputMap.
The worklet `GetThreadIndices` is called from the `Task` classes. These
classes are changed to pass in this additional argument. Since the `Task`
classes get an `Invocation` object from the dispatcher, which contains the
`ThreadToOutputMap`, this change is trivial.
### Interaction Between Mask and Scatter
Although it seems weird, it should work fine to mix scatters and masks. The
scatter will first be applied to the input to generate a (potential) list
of output elements. The mask will then be applied to these output elements.
## Redesign VTK-m Runtime Device Tracking
The device tracking infrastructure in VTK-m has been redesigned to
remove multiple redundant codes paths and to simplify reasoning
about around what an instance of RuntimeDeviceTracker will modify.
`vtkm::cont::RuntimeDeviceTracker` tracks runtime information on
a per-user thread basis. This is done to allow multiple calling
threads to use different vtk-m backends such as seen in this
example:
```cpp
vtkm::cont::DeviceAdapterTagCuda cuda;
vtkm::cont::DeviceAdapterTagOpenMP openmp;
{ // thread 1
auto& tracker = vtkm::cont::GetRuntimeDeviceTracker();
tracker->ForceDevice(cuda);
vtkm::worklet::Invoker invoke;
invoke(LightTask{}, input, output);
vtkm::cont::Algorithm::Sort(output);
invoke(HeavyTask{}, output);
}
{ // thread 2
auto& tracker = vtkm::cont::GetRuntimeDeviceTracker();
tracker->ForceDevice(openmp);
vtkm::worklet::Invoker invoke;
invoke(LightTask{}, input, output);
vtkm::cont::Algorithm::Sort(output);
invoke(HeavyTask{}, output);
}
```
Note: `GetGlobalRuntimeDeviceTracker` has ben refactored to be `GetRuntimeDeviceTracker`
as it always returned a unique instance for each control side thread. This design allows
for different threads to have different runtime device settings. By removing the term `Global`
from the name it becomes more clear what scope this class has.
While this address the ability for threads to specify what
device they should run on. It doesn't make it easy to toggle
the status of a device in a programmatic way, for example
the following block forces execution to only occur on
`cuda` and doesn't restore previous active devices after
```cpp
{
vtkm::cont::DeviceAdapterTagCuda cuda;
auto& tracker = vtkm::cont::GetRuntimeDeviceTracker();
tracker->ForceDevice(cuda);
vtkm::worklet::Invoker invoke;
invoke(LightTask{}, input, output);
}
//openmp/tbb/... still inactive
```
To resolve those issues we have `vtkm::cont::ScopedRuntimeDeviceTracker` which
has the same interface as `vtkm::cont::RuntimeDeviceTracker` but additionally
resets any per-user thread modifications when it goes out of scope. So by
switching over the previous example to use `ScopedRuntimeDeviceTracker` we
correctly restore the threads `RuntimeDeviceTracker` state when `tracker`
goes out of scope.
```cpp
{
vtkm::cont::DeviceAdapterTagCuda cuda;
vtkm::cont::ScopedRuntimeDeviceTracker tracker(cuda);
vtkm::worklet::Invoker invoke;
invoke(LightTask{}, input, output);
}
//openmp/tbb/... are now again active
```
The `vtkm::cont::ScopedRuntimeDeviceTracker` is not limited to forcing
execution to occur on a single device. When constructed it can either force
execution to a device, disable a device or enable a device. These options
also work with the `DeviceAdapterTagAny`.
```cpp
{
//enable all devices
vtkm::cont::DeviceAdapterTagAny any;
vtkm::cont::ScopedRuntimeDeviceTracker tracker(any,
vtkm::cont::RuntimeDeviceTrackerMode::Enable);
...
}
{
//disable only cuda
vtkm::cont::DeviceAdapterTagCuda cuda;
vtkm::cont::ScopedRuntimeDeviceTracker tracker(cuda,
vtkm::cont::RuntimeDeviceTrackerMode::Disable);
...
}
```
## `vtkm::cont::Initialize` added to make setting up VTK-m runtime state easier
A new initialization function, `vtkm::cont::Initialize`, has been added.
Initialization is not required, but will configure the logging utilities (when
enabled) and allows forcing a device via a `-d` or `--device` command line
option.
Usage:
```cpp
#include <vtkm/cont/Initialize.h>
int main(int argc, char *argv[])
{
auto config = vtkm::cont::Initialize(argc, argv);
...
}
```
# ArrayHandle
## Add `vtkm::cont::ArrayHandleVirtual`
Added a new class named `vtkm::cont::ArrayHandleVirtual` that allows you to type erase an
ArrayHandle storage type by using virtual calls. This simplification makes
storing `Fields` and `Coordinates` significantly easier as VTK-m doesn't
need to deduce both the storage and value type when executing worklets.
To construct an `vtkm::cont::ArrayHandleVirtual` one can do the following:
```cpp
vtkm::cont::ArrayHandle<vtkm::Float32> pressure;
vtkm::cont::ArrayHandleConstant<vtkm::Float32> constant(42.0f);
// constrcut from an array handle
vtkm::cont::ArrayHandleVirtual<vtkm::Float32> v(pressure);
// or assign from an array handle
v = constant;
```
To help maintain performance `vtkm::cont::ArrayHandleVirtual` provides a collection of helper
functions/methods to query and cast back to the concrete storage and value type:
```cpp
vtkm::cont::ArrayHandleConstant<vtkm::Float32> constant(42.0f);
vtkm::cont::ArrayHandleVirtual<vtkm::Float32> v = constant;
const bool isConstant = vtkm::cont::IsType< decltype(constant) >(v);
if(isConstant)
vtkm::cont::ArrayHandleConstant<vtkm::Float32> t = vtkm::cont::Cast< decltype(constant) >(v);
```
Lastly, a common operation of calling code using `ArrayHandleVirtual` is a desire to construct a new instance
of an existing virtual handle with the same storage type. This can be done by using the `NewInstance` method
as seen below
```cpp
vtkm::cont::ArrayHandle<vtkm::Float32> pressure;
vtkm::cont::ArrayHandleVirtual<vtkm::Float32> v = pressure;
vtkm::cont::ArrayHandleVirtual<vtkm::Float32> newArray = v->NewInstance();
bool isConstant = vtkm::cont::IsType< vtkm::cont::ArrayHandle<vtkm::Float32> >(newArray); //will be true
```
## `vtkm::cont::ArrayHandleZip` provides a consistent API even with non-writable handles
Previously `vtkm::cont::ArrayHandleZip` could not wrap an implicit handle and provide a consistent experience.
The primary issue was that if you tried to use the PortalType returned by `GetPortalControl()` you
would get a compile failure. This would occur as the PortalType returned would try to call `Set`
on an ImplicitPortal which doesn't have a set method.
Now with this change, the `ZipPortal` use SFINAE to determine if `Set` and `Get` should call the
underlying zipped portals.
## `vtkm::cont::VariantArrayHandle` replaces `vtkm::cont::DynamicArrayHandle`
`vtkm::cont::ArrayHandleVariant` replaces `vtkm::cont::DynamicArrayHandle` as the