Commit 774d7a56 authored by Robert Maynard's avatar Robert Maynard
Browse files

Add release notes for v1.4.0

parent 0eac06f5
# `vtkm::worklet::Invoker` now able to worklets that have non-default scatter type
This change allows the `Invoker` class to support launching worklets that require
a custom scatter operation. This is done by providing the scatter as the second
argument when launch a worklet with the `()` operator.
The following example shows a scatter being provided with a worklet launch.
```cpp
struct CheckTopology : vtkm::worklet::WorkletMapPointToCell
{
using ControlSignature = void(CellSetIn cellset, FieldOutCell);
using ExecutionSignature = _2(FromIndices);
using ScatterType = vtkm::worklet::ScatterPermutation<>;
...
};
vtkm::worklet::Ivoker invoke;
invoke( CheckTopology{}, vtkm::worklet::ScatterPermutation{}, cellset, result );
```
# LodePNG added as a thirdparty
The lodepng library was brought is an thirdparty library.
This has allowed the VTK-m rendering library to have a robust
png decode functionality.
# Allow masking of worklet invocations
There have recently been use cases where it would be helpful to mask out
some of the invocations of a worklet. The idea is that when invoking a
worklet with a mask array on the input domain, you might implement your
worklet more-or-less like the following.
```cpp
VTKM_EXEC void operator()(bool mask, /* other parameters */)
{
if (mask)
{
// Do interesting stuff
}
}
```
This works, but what if your mask has mostly false values? In that case,
you are spending tons of time loading data to and from memory where fields
are stored for no reason.
You could potentially get around this problem by adding a scatter to the
worklet. However, that will compress the output arrays to only values that
are active in the mask. That is problematic if you want the masked output
in the appropriate place in the original arrays. You will have to do some
complex (and annoying and possibly expensive) permutations of the output
arrays.
Thus, we would like a new feature similar to scatter that instead masks out
invocations so that the worklet is simply not run on those outputs.
## New Interface
The new "Mask" feature that is similar (and orthogonal) to the existing
"Scatter" feature. Worklet objects now define a `MaskType` that provides on
object that manages the selections of which invocations are skipped. The
following Mask objects are defined.
* `MaskNone` - This removes any mask of the output. All outputs are
generated. This is the default if no `MaskType` is explicitly defined.
* `MaskSelect` - The user to provides an array that specifies whether
each output is created with a 1 to mean that the output should be
created an 0 the mean that it should not.
* `MaskIndices` - The user provides an array with a list of indices for
all outputs that should be created.
It will be straightforward to implement other versions of masks. (For
example, you could make a mask class that selectes every Nth entry.) Those
could be made on an as-needed basis.
## Implementation
The implementation follows the same basic idea of how scatters are
implemented.
### Mask Classes
The mask class is required to implement the following items.
* `ThreadToOutputType` - A type for an array that maps a thread index (an
index in the array) to an output index. A reasonable type for this
could be `vtkm::cont::ArrayHandle<vtkm::Id>`.
* `GetThreadToOutputMap` - Given the range for the output (e.g. the
number of items in the output domain), returns an array of type
`ThreadToOutputType` that is the actual map.
* `GetThreadRange` - Given a range for the output (e.g. the number of
items in the output domain), returns the range for the threads (e.g.
the number of times the worklet will be invoked).
### Dispatching
The `vtkm::worklet::internal::DispatcherBase` manages a mask class in
the same way it manages the scatter class. It gets the `MaskType` from
the worklet it is templated on. It requires a `MaskType` object during
its construction.
Previously the dispatcher (and downstream) had to manage the range and
indices of inputs and threads. They now have to also manage a separate
output range/index as now all three may be different.
The `vtkm::Invocation` is changed to hold the ThreadToOutputMap array from
the mask. It likewises has a templated `ChangeThreadToOutputMap` method
added (similar to those already existing for the arrays from a scatter).
This method is used in `DispatcherBase::InvokeTransportParameters` to add
the mask's array to the invocation before calling `InvokeSchedule`.
### Thread Indices
With the addition of masks, the `ThreadIndices` classes are changed to
manage the actual output index. Previously, the output index was always the
same as the thread index. However, now these two can be different. The
`GetThreadIndices` methods of the worklet base classes have an argument
added that is the portal to the ThreadToOutputMap.
The worklet `GetThreadIndices` is called from the `Task` classes. These
classes are changed to pass in this additional argument. Since the `Task`
classes get an `Invocation` object from the dispatcher, which contains the
`ThreadToOutputMap`, this change is trivial.
## Interaction Between Mask and Scatter
Although it seems weird, it should work fine to mix scatters and masks. The
scatter will first be applied to the input to generate a (potential) list
of output elements. The mask will then be applied to these output elements.
# Merge benchmark executables into a device dependent shared library
VTK-m has been updated to replace old per device benchmark executables with a device
dependent shared library so that it's able to accept a device adapter at runtime through
the "--device=" argument.
Merge rendering testing executables to a shared library
This commit allows rendering testing executables to select the device at runtime.
# Merge worklet testing executables into a device dependent shared library
VTK-m has been updated to replace old per device worklet testing executables with a device
dependent shared library so that it's able to accept a device adapter at runtime through
the "--device=" argument.
# Wrap third party optionparser.h in vtkm/cont/internal/OptionParser.h
Previously we just took the optionparser.h file and stuck it right in
our source code. That was problematic for a variety of reasons.
1. It incorrectly assigned our license to external code.
2. It made lots of unnecessary changes to the original source (like
reformatting).
3. It made it near impossible to track patches we make and updates to
the original software.
Instead, use the third-party system to track changes to optionparser.h
in a different repository and then pull that into ours.
# Allow Initialize to parse only some arguments
When a library requires reading some command line arguments through a
function like Initialize, it is typical that it will parse through
arguments it supports and then remove those arguments from `argc` and
`argv` so that the remaining arguments can be parsed by the calling
program. Recent changes to the `vtkm::cont::Initialize` function support
that.
## Use Case
Say you are creating a simple benchmark where you want to provide a command
line option `--size` that allows you to adjust the size of the data that
you are working on. However, you also want to support flags like `--device`
and `-v` that are performed by `vtkm::cont::Initialize`. Rather than have
to re-implement all of `Initialize`'s parsing, you can now first call
`Initialize` to handle its arguments and then parse the remaining objects.
The following is a simple (and rather incomplete) example:
```cpp
int main(int argc, char** argv)
{
vtkm::cont::InitializeResult initResult = vtkm::cont::Initialize(argc, argv);
if ((argc > 1) && (strcmp(argv[1], "--size") == 0))
{
if (argc < 3)
{
std::cerr << "--size option requires a numeric argument" << std::endl;
std::cerr << "USAGE: " << argv[0] << " [options]" << std::endl;
std::cerr << "Options are:" << std::endl;
std::cerr << " --size <number>\tSpecify the size of the data." << std::endl;
std::cerr << initResult.Usage << std::endl;
exit(1);
}
g_size = atoi(argv[2]);
}
std::cout << "Using device: " << initResult.Device.GetName() << std::endl;
```
## Additional Initialize Options
Because `Initialize` no longer has the assumption that it is responsible
for parsing _all_ arguments, some options have been added to
`vtkm::cont::InitializeOptions` to manage these different use cases. The
following options are now supported.
* `None` A placeholder for having all options off, which is the default.
(Same as before this change.)
* `RequireDevice` Issue an error if the device argument is not specified.
(Same as before this change.)
* `DefaultAnyDevice` If no device is specified, treat it as if the user
gave --device=Any. This means that DeviceAdapterTagUndefined will never
be return in the result.
* `AddHelp` Add a help argument. If `-h` or `--help` is provided, prints
a usage statement. Of course, the usage statement will only print out
arguments processed by VTK-m.
* `ErrorOnBadOption` If an unknown option is encountered, the program
terminates with an error and a usage statement is printed. If this
option is not provided, any unknown options are returned in `argv`. If
this option is used, it is a good idea to use `AddHelp` as well.
* `ErrorOnBadArgument` If an extra argument is encountered, the program
terminates with an error and a usage statement is printed. If this
option is not provided, any unknown arguments are returned in `argv`.
* `Strict` If supplied, Initialize treats its own arguments as the only
ones supported by the application and provides an error if not followed
exactly. This is a convenience option that is a combination of
`ErrorOnBadOption`, `ErrorOnBadArgument`, and `AddHelp`.
## InitializeResult Changes
The changes in `Initialize` have also necessitated the changing of some of
the fields in the `InitializeResult` structure. The following fields are
now provided in the `InitializeResult` struct.
* `Device` Returns the device selected in the command line arguments as a
`DeviceAdapterId`. If no device was selected,
`DeviceAdapterTagUndefined` is returned. (Same as before this change.)
* `Usage` Returns a string containing the usage for the options
recognized by `Initialize`. This can be used to build larger usage
statements containing options for both `Initialize` and the calling
program. See the example above.
Note that the `Arguments` field has been removed from `InitializeResult`.
This is because the unparsed arguments are now returned in the modified
`argc` and `argv`, which provides a more complete result than the
`Arguments` field did.
# Add point merge capabilities to CleanGrid filter
We have added a `PointMerge` worklet that uses a virtual grid approach to
identify nearby points. The worklet works by creating a very fine but
sparsely represented locator grid. It then groups points by grid bins and
finds those within a specified radius.
This functionality has been integrated into the `CleanGrid` filter. The
following flags have been added to `CleanGrid` to modify the behavior of
point merging.
* `Set`/`GetMergePoints` - a flag to turn on/off the merging of
duplicated coincident points. This extra operation will find points
spatially located near each other and merge them together.
* `Set`/`GetTolerance` - Defines the tolerance used when determining
whether two points are considered coincident. If the
`ToleranceIsAbsolute` flag is false (the default), then this tolerance
is scaled by the diagonal of the points. This parameter is only used
when merge points is on.
* `Set`/`GetToleranceIsAbsolute` - When ToleranceIsAbsolute is false (the
default) then the tolerance is scaled by the diagonal of the bounds of
the dataset. If true, then the tolerance is taken as the actual
distance to use. This parameter is only used when merge points is on.
* `Set`/`GetFastMerge` - When FastMerge is true (the default), some
corners are cut when computing coincident points. The point merge will
go faster but the tolerance will not be strictly followed.
# Added specialized operators for ArrayPortalValueReference
The ArrayPortalValueReference is supposed to behave just like the value it
encapsulates and does so by automatically converting to the base type when
necessary. However, when it is possible to convert that to something else,
it is possible to get errors about ambiguous overloads. To avoid these, add
specialized versions of the operators to specify which ones should be used.
Also consolidated the CUDA version of an ArrayPortalValueReference to the
standard one. The two implementations were equivalent and we would like
changes to apply to both.
# Redesign Runtime Device Tracking
The device tracking infrastructure in VTK-m has been redesigned to
remove multiple redundant codes paths and to simplify reasoning
about around what an instance of RuntimeDeviceTracker will modify.
`vtkm::cont::RuntimeDeviceTracker` tracks runtime information on
a per-user thread basis. This is done to allow multiple calling
threads to use different vtk-m backends such as seen in this
example:
```cpp
vtkm::cont::DeviceAdapterTagCuda cuda;
vtkm::cont::DeviceAdapterTagOpenMP openmp;
{ // thread 1
auto& tracker = vtkm::cont::GetRuntimeDeviceTracker();
tracker->ForceDevice(cuda);
vtkm::worklet::Invoker invoke;
invoke(LightTask{}, input, output);
vtkm::cont::Algorithm::Sort(output);
invoke(HeavyTask{}, output);
}
{ // thread 2
auto& tracker = vtkm::cont::GetRuntimeDeviceTracker();
tracker->ForceDevice(openmp);
vtkm::worklet::Invoker invoke;
invoke(LightTask{}, input, output);
vtkm::cont::Algorithm::Sort(output);
invoke(HeavyTask{}, output);
}
```
While this address the ability for threads to specify what
device they should run on. It doesn't make it easy to toggle
the status of a device in a programmatic way, for example
the following block forces execution to only occur on
`cuda` and doesn't restore previous active devices after
```cpp
{
vtkm::cont::DeviceAdapterTagCuda cuda;
auto& tracker = vtkm::cont::GetRuntimeDeviceTracker();
tracker->ForceDevice(cuda);
vtkm::worklet::Invoker invoke;
invoke(LightTask{}, input, output);
}
//openmp/tbb/... still inactive
```
To resolve those issues we have `vtkm::cont::ScopedRuntimeDeviceTracker` which
has the same interface as `vtkm::cont::RuntimeDeviceTracker` but additionally
resets any per-user thread modifications when it goes out of scope. So by
switching over the previous example to use `ScopedRuntimeDeviceTracker` we
correctly restore the threads `RuntimeDeviceTracker` state when `tracker`
goes out of scope.
```cpp
{
vtkm::cont::DeviceAdapterTagCuda cuda;
vtkm::cont::ScopedRuntimeDeviceTracker tracker(cuda);
vtkm::worklet::Invoker invoke;
invoke(LightTask{}, input, output);
}
//openmp/tbb/... are now again active
```
The `vtkm::cont::ScopedRuntimeDeviceTracker` is not limited to forcing
execution to occur on a single device. When constructed it can either force
execution to a device, disable a device or enable a device. These options
also work with the `DeviceAdapterTagAny`.
```cpp
{
//enable all devices
vtkm::cont::DeviceAdapterTagAny any;
vtkm::cont::ScopedRuntimeDeviceTracker tracker(any,
vtkm::cont::RuntimeDeviceTrackerMode::Enable);
...
}
{
//disable only cuda
vtkm::cont::DeviceAdapterTagCuda cuda;
vtkm::cont::ScopedRuntimeDeviceTracker tracker(cuda,
vtkm::cont::RuntimeDeviceTrackerMode::Disable);
...
}
```
# DeviceAdapter Reduction supports differing input and output types
It is common to want to perform a reduction where the input and output types
are of differing types. A basic example would be when the input is `vtkm::UInt8`
but the output is `vtkm::UInt64`. This has been supported since v1.2, as the input
type can be implicitly convertible to the output type.
What we now support is when the input type is not implicitly convertible to the output type,
such as when the output type is `vtkm::Pair< vtkm::UInt64, vtkm::UInt64>`. For this to work
we require that the custom binary operator implements also an `operator()` which handles
the unary transformation of input to output.
An example of a custom reduction operator for differing input and output types is:
```cxx
struct CustomMinAndMax
{
using OutputType = vtkm::Pair<vtkm::Float64, vtkm::Float64>;
VTKM_EXEC_CONT
OutputType operator()(vtkm::Float64 a) const
{
return OutputType(a, a);
}
VTKM_EXEC_CONT
OutputType operator()(vtkm::Float64 a, vtkm::Float64 b) const
{
return OutputType(vtkm::Min(a, b), vtkm::Max(a, b));
}
VTKM_EXEC_CONT
OutputType operator()(const OutputType& a, const OutputType& b) const
{
return OutputType(vtkm::Min(a.first, b.first), vtkm::Max(a.second, b.second));
}
VTKM_EXEC_CONT
OutputType operator()(vtkm::Float64 a, const OutputType& b) const
{
return OutputType(vtkm::Min(a, b.first), vtkm::Max(a, b.second));
}
VTKM_EXEC_CONT
OutputType operator()(const OutputType& a, vtkm::Float64 b) const
{
return OutputType(vtkm::Min(a.first, b), vtkm::Max(a.second, b));
}
};
```
# Renamed RuntimeDeviceTrackers to use the term Global
The `GetGlobalRuntimeDeviceTracker` never actually returned a process wide
runtime device tracker but always a unique one for each control side thread.
This was the design as it would allow for different threads to have different
runtime device settings.
By removing the term Global from the name it becomes more clear what scope this
class has.
# Add ability to specialize a worklet for a device
This change adds an execution signature tag named `Device` that passes
a `DeviceAdapterTag` to the worklet's parenthesis operator. This allows the
worklet to specialize its operation. This features is available in all
worklets.
The following example shows a worklet that specializes itself for the CUDA
device.
```cpp
struct DeviceSpecificWorklet : vtkm::worklet::WorkletMapField
{
using ControlSignature = void(FieldIn, FieldOut);
using ExecutionSignature = _2(_1, Device);
// Specialization for the Cuda device.
template <typename T>
T operator()(T x, vtkm::cont::DeviceAdapterTagCuda) const
{
// Special cuda implementation
}
// General implementation
template <typename T, typename Device>
T operator()(T x, Device) const
{
// General implementation
}
};
```
## Effect on compile time and binary size
This change necessitated adding a template parameter for the device that
followed at least from the schedule all the way down. This has the
potential for duplicating several of the support methods (like
`DoWorkletInvokeFunctor`) that would otherwise have the same type. This is
especially true between the devices that run on the CPU as they should all
be sharing the same portals from `ArrayHandle`s. So the question is whether
it causes compile to take longer or cause a significant increase in
binaries.
To informally test, I first ran a clean debug compile on my Windows machine
with the serial and tbb devices. The build itself took **3 minutes, 50
seconds**. Here is a list of the binary sizes in the bin directory:
```
kmorel2 0> du -sh *.exe *.dll
200K BenchmarkArrayTransfer_SERIAL.exe
204K BenchmarkArrayTransfer_TBB.exe
424K BenchmarkAtomicArray_SERIAL.exe
424K BenchmarkAtomicArray_TBB.exe
440K BenchmarkCopySpeeds_SERIAL.exe
580K BenchmarkCopySpeeds_TBB.exe
4.1M BenchmarkDeviceAdapter_SERIAL.exe
5.3M BenchmarkDeviceAdapter_TBB.exe
7.9M BenchmarkFieldAlgorithms_SERIAL.exe
7.9M BenchmarkFieldAlgorithms_TBB.exe
22M BenchmarkFilters_SERIAL.exe
22M BenchmarkFilters_TBB.exe
276K BenchmarkRayTracing_SERIAL.exe
276K BenchmarkRayTracing_TBB.exe
4.4M BenchmarkTopologyAlgorithms_SERIAL.exe
4.4M BenchmarkTopologyAlgorithms_TBB.exe
712K Rendering_SERIAL.exe
712K Rendering_TBB.exe
708K UnitTests_vtkm_cont_arg_testing.exe
1.7M UnitTests_vtkm_cont_internal_testing.exe
13M UnitTests_vtkm_cont_serial_testing.exe
14M UnitTests_vtkm_cont_tbb_testing.exe
18M UnitTests_vtkm_cont_testing.exe
13M UnitTests_vtkm_cont_testing_mpi.exe
736K UnitTests_vtkm_exec_arg_testing.exe
136K UnitTests_vtkm_exec_internal_testing.exe
196K UnitTests_vtkm_exec_serial_internal_testing.exe
196K UnitTests_vtkm_exec_tbb_internal_testing.exe
2.0M UnitTests_vtkm_exec_testing.exe
83M UnitTests_vtkm_filter_testing.exe
476K UnitTests_vtkm_internal_testing.exe
148K UnitTests_vtkm_interop_internal_testing.exe
1.3M UnitTests_vtkm_interop_testing.exe
2.9M UnitTests_vtkm_io_reader_testing.exe
548K UnitTests_vtkm_io_writer_testing.exe
792K UnitTests_vtkm_rendering_testing.exe
3.7M UnitTests_vtkm_testing.exe
320K UnitTests_vtkm_worklet_internal_testing.exe
65M UnitTests_vtkm_worklet_testing.exe
11M vtkm_cont-1.3.dll
2.1M vtkm_interop-1.3.dll
21M vtkm_rendering-1.3.dll
3.9M vtkm_worklet-1.3.dll
```
After making the singular change to the `Invocation` object to add the
`DeviceAdapterTag` as a template parameter (which should cause any extra
compile instances) the compile took **4 minuts and 5 seconds**. Here is the
new list of binaries.
```
kmorel2 0> du -sh *.exe *.dll
200K BenchmarkArrayTransfer_SERIAL.exe
204K BenchmarkArrayTransfer_TBB.exe
424K BenchmarkAtomicArray_SERIAL.exe
424K BenchmarkAtomicArray_TBB.exe
440K BenchmarkCopySpeeds_SERIAL.exe
580K BenchmarkCopySpeeds_TBB.exe
4.1M BenchmarkDeviceAdapter_SERIAL.exe
5.3M BenchmarkDeviceAdapter_TBB.exe
7.9M BenchmarkFieldAlgorithms_SERIAL.exe
7.9M BenchmarkFieldAlgorithms_TBB.exe
22M BenchmarkFilters_SERIAL.exe
22M BenchmarkFilters_TBB.exe
276K BenchmarkRayTracing_SERIAL.exe
276K BenchmarkRayTracing_TBB.exe
4.4M BenchmarkTopologyAlgorithms_SERIAL.exe
4.4M BenchmarkTopologyAlgorithms_TBB.exe
712K Rendering_SERIAL.exe
712K Rendering_TBB.exe
708K UnitTests_vtkm_cont_arg_testing.exe
1.7M UnitTests_vtkm_cont_internal_testing.exe
13M UnitTests_vtkm_cont_serial_testing.exe
14M UnitTests_vtkm_cont_tbb_testing.exe
19M UnitTests_vtkm_cont_testing.exe
13M UnitTests_vtkm_cont_testing_mpi.exe
736K UnitTests_vtkm_exec_arg_testing.exe
136K UnitTests_vtkm_exec_internal_testing.exe
196K UnitTests_vtkm_exec_serial_internal_testing.exe
196K UnitTests_vtkm_exec_tbb_internal_testing.exe
2.0M UnitTests_vtkm_exec_testing.exe
86M UnitTests_vtkm_filter_testing.exe
476K UnitTests_vtkm_internal_testing.exe
148K UnitTests_vtkm_interop_internal_testing.exe
1.3M UnitTests_vtkm_interop_testing.exe
2.9M UnitTests_vtkm_io_reader_testing.exe
548K UnitTests_vtkm_io_writer_testing.exe
792K UnitTests_vtkm_rendering_testing.exe
3.7M UnitTests_vtkm_testing.exe
320K UnitTests_vtkm_worklet_internal_testing.exe
68M UnitTests_vtkm_worklet_testing.exe
11M vtkm_cont-1.3.dll
2.1M vtkm_interop-1.3.dll
21M vtkm_rendering-1.3.dll
3.9M vtkm_worklet-1.3.dll
```
So far the increase is quite negligible.