release-notes.md 51 KB
Newer Older
Robert Maynard's avatar
Robert Maynard committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
VTK-m 1.4 Release Notes
=======================

# Table of Contents
1. [Core](#Core)
    - Remove templates from `ControlSignature` field tags
    - Worklets can now be specialized for a specific device adapter
    - Worklets now support an execution mask
    - Redesign VTK-m Runtime Device Tracking
    - `vtkm::cont::Initialize` added to make setting up VTK-m runtime state easier
2. [ArrayHandle](#ArrayHandle)
    - Add `vtkm::cont::ArrayHandleVirtual`
    - `vtkm::cont::ArrayHandleZip` provides a consistent API even with non-writable handles
    - `vtkm::cont::VariantArrayHandle` replaces `vtkm::cont::DynamicArrayHandle`
    - `vtkm::cont::VariantArrayHandle` CastAndCall supports casting to concrete types
    - `vtkm::cont::VariantArrayHandle::AsVirtual<T>()` performs casting
    - `StorageBasic::StealArray()` now provides delete function to new owner
3. [Control Environment](#Control-Environment)
    - `vtkm::cont::CellLocatorGeneral` has been added
    - `vtkm::cont::CellLocatorTwoLevelUniformGrid` has been renamed to `vtkm::cont::CellLocatorUniformBins`
    - `vtkm::cont::Timer` now supports  asynchronous and device independent timers
    - `vtkm::cont::DeviceAdapterId` construction from strings are now case-insensitive
    - `vtkm::cont::Initialize` will only parse known arguments
4. [Execution Environment](#Execution-Environment)
    - VTK-m logs details about each CUDA kernel launch
    - VTK-m CUDA allocations can have managed memory (cudaMallocManaged) enabled/disabled from C++
    - VTK-m CUDA kernel scheduling improved including better defaults, and user customization support
    - VTK-m Reduction algorithm now supports differing input and output types
    - Added specialized operators for ArrayPortalValueReference
5. [Worklets and Filters](#Worklets-and-Filters)
    - `vtkm::worklet::Invoker` now supports worklets which require a Scatter object
    - `BitFields` are now a support field input/out type for VTK-m worklets
    - Added a Point Merging worklet
    - `vtkm::filter::CleanGrid` now can do point merging
    - Added a connected component worklets and filters
6. [Build](#Build)
    - CMake 3.8+ now required to build VTK-m
    - VTK-m now can verify that it installs itself correctly
    - VTK-m now requires `CUDA` separable compilation to build
    - VTK-m provides a `vtkm_filter` CMake target
    - `vtkm::cont::CellLocatorBoundingIntervalHierarchy` is compiled into `vtkm_cont`
7. [Other](#Other)
    - LodePNG added as a thirdparty package
    - Optionparser added as a thirdparty package
    - Thirdparty diy now can coexist with external diy
    - Merge benchmark executables into a device dependent shared library
    - Merge rendering testing executables to a shared library
    - Merge worklet testing executables into a device dependent shared library
    - VTK-m runtime device detection properly handles busy CUDA devices

# Core

## Remove templates from `ControlSignature` field tags

Previously, several of the `ControlSignature` tags had a template to
specify a type list. This was to specify potential valid value types for an
input array. The importance of this typelist was to limit the number of
code paths created when resolving a `vtkm::cont::VariantArrayHandle`
(formerly a `DynamicArrayHandle`). This (potentially) reduced the compile
time, the size of libraries/executables, and errors from unexpected types.

Much has changed since this feature was originally implemented. Since then,
the filter infrastructure has been created, and it is through this that
most dynamic worklet invocations happen. However, since the filter
infrastrcture does its own type resolution (and has its own policies) the
type arguments in `ControlSignature` are now of little value.

### Script to update code

This update requires changes to just about all code implementing a VTK-m
worklet. To facilitate the update of this code to these new changes (not to
mention all the code in VTK-m) a script is provided to automatically remove
these template parameters from VTK-m code.

This script is at
[Utilities/Scripts/update-control-signature-tags.sh](../../Utilities/Scripts/update-control-signature-tags.sh).
It needs to be run in a Unix-compatible shell. It takes a single argument,
which is a top level directory to modify files. The script processes all C++
source files recursively from that directory.

### Selecting data types for auxiliary filter fields

The main rational for making these changes is that the types of the inputs
to worklets is almost always already determined by the calling filter.
However, although it is straightforward to specify the type of the "main"
(active) scalars in a filter, it is less clear what to do for additional
fields if a filter needs a second or third field.

Typically, in the case of a second or third field, it is up to the
`DoExecute` method in the filter implementation to apply a policy to that
field. When applying a policy, you give it a policy object (nominally
passed by the user) and a traits of the filter. Generally, the accepted
list of types for a field should be part of the filter's traits. For
example, consider the `WarpVector` filter. This filter only works on
`Vec`s of size 3, so its traits class looks like this.

``` cpp
template <>
class FilterTraits<WarpVector>
{
public:
  // WarpVector can only applies to Float and Double Vec3 arrays
  using InputFieldTypeList = vtkm::TypeListTagFieldVec3;
};
```

However, the `WarpVector` filter also requires two fields instead of one.
The first (active) field is handled by its superclass (`FilterField`), but
the second (auxiliary) field must be managed in the `DoExecute`. Generally,
this can be done by simply applying the policy with the filter traits.

### The corner cases

Most of the calls to worklets happen within filter implementations, which
have their own way of narrowing down potential types (as previously
described). The majority of the remainder either use static types or work
with a variety of types.

However, there is a minority of corner cases that require a reduction of
types. Since the type argument of the worklet `ControlSignature` arguments
are no longer available, the narrowing of types must be done before the
call to `Invoke`.

This narrowing of arguments is not particularly difficult. Such type-unsure
arguments usually come from a `VariantArrayHandle` (or something that uses
one). You can select the types from a `VariantArrayHandle` simply by using
the `ResetTypes` method. For example, say you know that a variant array is
supposed to be a scalar.

``` cpp
dispatcher.Invoke(variantArray.ResetTypes(vtkm::TypeListTagFieldScalar()),
                  staticArray);
```

Even more common is to have a `vtkm::cont::Field` object. A `Field` object
internally holds a `VariantArrayHandle`, which is accessible via the
`GetData` method.

``` cpp
dispatcher.Invoke(field.GetData().ResetTypes(vtkm::TypeListTagFieldScalar()),
                  staticArray);
```

### Change in executable size

The whole intention of these template parameters in the first place was to
reduce the number of code paths compiled. The hypothesis of this change was
that in the current structure the code paths were not being reduced much
if at all. If that is true, the size of executables and libraries should
not change.

Here is a recording of the library and executable sizes before this change
(using `ds -h`).

```
3.0M    libvtkm_cont-1.2.1.dylib
6.2M    libvtkm_rendering-1.2.1.dylib
312K    Rendering_SERIAL
312K    Rendering_TBB
 22M    Worklets_SERIAL
 23M    Worklets_TBB
 22M    UnitTests_vtkm_filter_testing
5.7M    UnitTests_vtkm_cont_serial_testing
6.0M    UnitTests_vtkm_cont_tbb_testing
7.1M    UnitTests_vtkm_cont_testing
```

After the changes, the executable sizes are as follows.

```
3.0M    libvtkm_cont-1.2.1.dylib
6.0M    libvtkm_rendering-1.2.1.dylib
312K    Rendering_SERIAL
312K    Rendering_TBB
 21M    Worklets_SERIAL
 21M    Worklets_TBB
 22M    UnitTests_vtkm_filter_testing
5.6M    UnitTests_vtkm_cont_serial_testing
6.0M    UnitTests_vtkm_cont_tbb_testing
7.1M    UnitTests_vtkm_cont_testing
```

As we can see, the built sizes have not changed significantly. (If
anything, the build is a little smaller.)


## Worklets can now be specialized for a specific device adapter

This change adds an execution signature tag named `Device` that passes
a `DeviceAdapterTag` to the worklet's parenthesis operator. This allows the
worklet to specialize its operation. This features is available in all
worklets.

The following example shows a worklet that specializes itself for the CUDA
device.

```cpp
struct DeviceSpecificWorklet : vtkm::worklet::WorkletMapField
{
  using ControlSignature = void(FieldIn, FieldOut);
  using ExecutionSignature = _2(_1, Device);

  // Specialization for the Cuda device.
  template <typename T>
  T operator()(T x, vtkm::cont::DeviceAdapterTagCuda) const
  {
    // Special cuda implementation
  }

  // General implementation
  template <typename T, typename Device>
  T operator()(T x, Device) const
  {
    // General implementation
  }
};
```

### Effect on compile time and binary size

This change necessitated adding a template parameter for the device that
followed at least from the schedule all the way down. This has the
potential for duplicating several of the support methods (like
`DoWorkletInvokeFunctor`) that would otherwise have the same type. This is
especially true between the devices that run on the CPU as they should all
be sharing the same portals from `ArrayHandle`s. So the question is whether
it causes compile to take longer or cause a significant increase in
binaries.

To informally test, I first ran a clean debug compile on my Windows machine
with the serial and tbb devices. The build itself took **3 minutes, 50
seconds**. Here is a list of the binary sizes in the bin directory:

```
kmorel2 0> du -sh *.exe *.dll
200K    BenchmarkArrayTransfer_SERIAL.exe
204K    BenchmarkArrayTransfer_TBB.exe
424K    BenchmarkAtomicArray_SERIAL.exe
424K    BenchmarkAtomicArray_TBB.exe
440K    BenchmarkCopySpeeds_SERIAL.exe
580K    BenchmarkCopySpeeds_TBB.exe
4.1M    BenchmarkDeviceAdapter_SERIAL.exe
5.3M    BenchmarkDeviceAdapter_TBB.exe
7.9M    BenchmarkFieldAlgorithms_SERIAL.exe
7.9M    BenchmarkFieldAlgorithms_TBB.exe
22M     BenchmarkFilters_SERIAL.exe
22M     BenchmarkFilters_TBB.exe
276K    BenchmarkRayTracing_SERIAL.exe
276K    BenchmarkRayTracing_TBB.exe
4.4M    BenchmarkTopologyAlgorithms_SERIAL.exe
4.4M    BenchmarkTopologyAlgorithms_TBB.exe
712K    Rendering_SERIAL.exe
712K    Rendering_TBB.exe
708K    UnitTests_vtkm_cont_arg_testing.exe
1.7M    UnitTests_vtkm_cont_internal_testing.exe
13M     UnitTests_vtkm_cont_serial_testing.exe
14M     UnitTests_vtkm_cont_tbb_testing.exe
18M     UnitTests_vtkm_cont_testing.exe
13M     UnitTests_vtkm_cont_testing_mpi.exe
736K    UnitTests_vtkm_exec_arg_testing.exe
136K    UnitTests_vtkm_exec_internal_testing.exe
196K    UnitTests_vtkm_exec_serial_internal_testing.exe
196K    UnitTests_vtkm_exec_tbb_internal_testing.exe
2.0M    UnitTests_vtkm_exec_testing.exe
83M     UnitTests_vtkm_filter_testing.exe
476K    UnitTests_vtkm_internal_testing.exe
148K    UnitTests_vtkm_interop_internal_testing.exe
1.3M    UnitTests_vtkm_interop_testing.exe
2.9M    UnitTests_vtkm_io_reader_testing.exe
548K    UnitTests_vtkm_io_writer_testing.exe
792K    UnitTests_vtkm_rendering_testing.exe
3.7M    UnitTests_vtkm_testing.exe
320K    UnitTests_vtkm_worklet_internal_testing.exe
65M     UnitTests_vtkm_worklet_testing.exe
11M     vtkm_cont-1.3.dll
2.1M    vtkm_interop-1.3.dll
21M     vtkm_rendering-1.3.dll
3.9M    vtkm_worklet-1.3.dll
```

After making the singular change to the `Invocation` object to add the
`DeviceAdapterTag` as a template parameter (which should cause any extra
compile instances) the compile took **4 minuts and 5 seconds**. Here is the
new list of binaries.

```
kmorel2 0> du -sh *.exe *.dll
200K    BenchmarkArrayTransfer_SERIAL.exe
204K    BenchmarkArrayTransfer_TBB.exe
424K    BenchmarkAtomicArray_SERIAL.exe
424K    BenchmarkAtomicArray_TBB.exe
440K    BenchmarkCopySpeeds_SERIAL.exe
580K    BenchmarkCopySpeeds_TBB.exe
4.1M    BenchmarkDeviceAdapter_SERIAL.exe
5.3M    BenchmarkDeviceAdapter_TBB.exe
7.9M    BenchmarkFieldAlgorithms_SERIAL.exe
7.9M    BenchmarkFieldAlgorithms_TBB.exe
22M     BenchmarkFilters_SERIAL.exe
22M     BenchmarkFilters_TBB.exe
276K    BenchmarkRayTracing_SERIAL.exe
276K    BenchmarkRayTracing_TBB.exe
4.4M    BenchmarkTopologyAlgorithms_SERIAL.exe
4.4M    BenchmarkTopologyAlgorithms_TBB.exe
712K    Rendering_SERIAL.exe
712K    Rendering_TBB.exe
708K    UnitTests_vtkm_cont_arg_testing.exe
1.7M    UnitTests_vtkm_cont_internal_testing.exe
13M     UnitTests_vtkm_cont_serial_testing.exe
14M     UnitTests_vtkm_cont_tbb_testing.exe
19M     UnitTests_vtkm_cont_testing.exe
13M     UnitTests_vtkm_cont_testing_mpi.exe
736K    UnitTests_vtkm_exec_arg_testing.exe
136K    UnitTests_vtkm_exec_internal_testing.exe
196K    UnitTests_vtkm_exec_serial_internal_testing.exe
196K    UnitTests_vtkm_exec_tbb_internal_testing.exe
2.0M    UnitTests_vtkm_exec_testing.exe
86M     UnitTests_vtkm_filter_testing.exe
476K    UnitTests_vtkm_internal_testing.exe
148K    UnitTests_vtkm_interop_internal_testing.exe
1.3M    UnitTests_vtkm_interop_testing.exe
2.9M    UnitTests_vtkm_io_reader_testing.exe
548K    UnitTests_vtkm_io_writer_testing.exe
792K    UnitTests_vtkm_rendering_testing.exe
3.7M    UnitTests_vtkm_testing.exe
320K    UnitTests_vtkm_worklet_internal_testing.exe
68M     UnitTests_vtkm_worklet_testing.exe
11M     vtkm_cont-1.3.dll
2.1M    vtkm_interop-1.3.dll
21M     vtkm_rendering-1.3.dll
3.9M    vtkm_worklet-1.3.dll
```

So far the increase is quite negligible.

## Worklets now support an execution mask

There have recently been use cases where it would be helpful to mask out
some of the invocations of a worklet. The idea is that when invoking a
worklet with a mask array on the input domain, you might implement your
worklet more-or-less like the following.

```cpp
VTKM_EXEC void operator()(bool mask, /* other parameters */)
{
  if (mask)
  {
    // Do interesting stuff
  }
}
```

This works, but what if your mask has mostly false values? In that case,
you are spending tons of time loading data to and from memory where fields
are stored for no reason.

You could potentially get around this problem by adding a scatter to the
worklet. However, that will compress the output arrays to only values that
are active in the mask. That is problematic if you want the masked output
in the appropriate place in the original arrays. You will have to do some
complex (and annoying and possibly expensive) permutations of the output
arrays.

Thus, we would like a new feature similar to scatter that instead masks out
invocations so that the worklet is simply not run on those outputs.

### New Interface

The new "Mask" feature that is similar (and orthogonal) to the existing
"Scatter" feature. Worklet objects now define a `MaskType` that provides on
object that manages the selections of which invocations are skipped. The
following Mask objects are defined.

  * `MaskNone` - This removes any mask of the output. All outputs are
    generated. This is the default if no `MaskType` is explicitly defined.
  * `MaskSelect` - The user to provides an array that specifies whether
    each output is created with a 1 to mean that the output should be
    created an 0 the mean that it should not.
  * `MaskIndices` - The user provides an array with a list of indices for
    all outputs that should be created.

It will be straightforward to implement other versions of masks. (For
example, you could make a mask class that selectes every Nth entry.) Those
could be made on an as-needed basis.

### Implementation

The implementation follows the same basic idea of how scatters are
implemented.

#### Mask Classes

The mask class is required to implement the following items.

  * `ThreadToOutputType` - A type for an array that maps a thread index (an
    index in the array) to an output index. A reasonable type for this
    could be `vtkm::cont::ArrayHandle<vtkm::Id>`.
  * `GetThreadToOutputMap` - Given the range for the output (e.g. the
    number of items in the output domain), returns an array of type
    `ThreadToOutputType` that is the actual map.
  * `GetThreadRange` - Given a range for the output (e.g. the number of
    items in the output domain), returns the range for the threads (e.g.
    the number of times the worklet will be invoked).

#### Dispatching

The `vtkm::worklet::internal::DispatcherBase` manages a mask class in
the same way it manages the scatter class. It gets the `MaskType` from
the worklet it is templated on. It requires a `MaskType` object during
its construction.

Previously the dispatcher (and downstream) had to manage the range and
indices of inputs and threads. They now have to also manage a separate
output range/index as now all three may be different.

The `vtkm::Invocation` is changed to hold the ThreadToOutputMap array from
the mask. It likewises has a templated `ChangeThreadToOutputMap` method
added (similar to those already existing for the arrays from a scatter).
This method is used in `DispatcherBase::InvokeTransportParameters` to add
the mask's array to the invocation before calling `InvokeSchedule`.

#### Thread Indices

With the addition of masks, the `ThreadIndices` classes are changed to
manage the actual output index. Previously, the output index was always the
same as the thread index. However, now these two can be different. The
`GetThreadIndices` methods of the worklet base classes have an argument
added that is the portal to the ThreadToOutputMap.

The worklet `GetThreadIndices` is called from the `Task` classes. These
classes are changed to pass in this additional argument. Since the `Task`
classes get an `Invocation` object from the dispatcher, which contains the
`ThreadToOutputMap`, this change is trivial.

### Interaction Between Mask and Scatter

Although it seems weird, it should work fine to mix scatters and masks. The
scatter will first be applied to the input to generate a (potential) list
of output elements. The mask will then be applied to these output elements.


## Redesign VTK-m Runtime Device Tracking

The device tracking infrastructure in VTK-m has been redesigned to
remove multiple redundant codes paths and to simplify reasoning
about around what an instance of RuntimeDeviceTracker will modify.


`vtkm::cont::RuntimeDeviceTracker` tracks runtime information on
a per-user thread basis. This is done to allow multiple calling
threads to use different vtk-m backends such as seen in this
example:

```cpp
  vtkm::cont::DeviceAdapterTagCuda cuda;
  vtkm::cont::DeviceAdapterTagOpenMP openmp;
  { // thread 1
    auto& tracker = vtkm::cont::GetRuntimeDeviceTracker();
    tracker->ForceDevice(cuda);
    vtkm::worklet::Invoker invoke;
    invoke(LightTask{}, input, output);
    vtkm::cont::Algorithm::Sort(output);
    invoke(HeavyTask{}, output);
  }

 { // thread 2
    auto& tracker = vtkm::cont::GetRuntimeDeviceTracker();
    tracker->ForceDevice(openmp);
    vtkm::worklet::Invoker invoke;
    invoke(LightTask{}, input, output);
    vtkm::cont::Algorithm::Sort(output);
    invoke(HeavyTask{}, output);
  }
```

Note: `GetGlobalRuntimeDeviceTracker` has ben refactored to be `GetRuntimeDeviceTracker`
as it always returned a unique instance for each control side thread. This design allows
for different threads to have different runtime device settings. By removing the term `Global`
from the name it becomes more clear what scope this class has.

While this address the ability for threads to specify what
device they should run on. It doesn't make it easy to toggle
the status of a device in a programmatic way, for example
the following block forces execution to only occur on
`cuda` and doesn't restore previous active devices after

```cpp
  {
  vtkm::cont::DeviceAdapterTagCuda cuda;
  auto& tracker = vtkm::cont::GetRuntimeDeviceTracker();
  tracker->ForceDevice(cuda);
  vtkm::worklet::Invoker invoke;
  invoke(LightTask{}, input, output);
  }
  //openmp/tbb/... still inactive
```

To resolve those issues we have `vtkm::cont::ScopedRuntimeDeviceTracker` which
has the same interface as `vtkm::cont::RuntimeDeviceTracker` but additionally
resets any per-user thread modifications when it goes out of scope. So by
switching over the previous example to use `ScopedRuntimeDeviceTracker` we
correctly restore the threads `RuntimeDeviceTracker` state when `tracker`
goes out of scope.
```cpp
  {
  vtkm::cont::DeviceAdapterTagCuda cuda;
  vtkm::cont::ScopedRuntimeDeviceTracker tracker(cuda);
  vtkm::worklet::Invoker invoke;
  invoke(LightTask{}, input, output);
  }
  //openmp/tbb/... are now again active
```

The  `vtkm::cont::ScopedRuntimeDeviceTracker` is not limited to forcing
execution to occur on a single device. When constructed it can either force
execution to a device, disable a device or enable a device. These options
also work with the `DeviceAdapterTagAny`.


```cpp
  {
  //enable all devices
  vtkm::cont::DeviceAdapterTagAny any;
  vtkm::cont::ScopedRuntimeDeviceTracker tracker(any,
                                                 vtkm::cont::RuntimeDeviceTrackerMode::Enable);
  ...
  }

  {
  //disable only cuda
  vtkm::cont::DeviceAdapterTagCuda cuda;
  vtkm::cont::ScopedRuntimeDeviceTracker tracker(cuda,
                                                 vtkm::cont::RuntimeDeviceTrackerMode::Disable);

  ...
  }
```


## `vtkm::cont::Initialize` added to make setting up VTK-m runtime state easier

A new initialization function, `vtkm::cont::Initialize`, has been added.
Initialization is not required, but will configure the logging utilities (when
enabled) and allows forcing a device via a `-d` or `--device` command line
option.


Usage:

```cpp
#include <vtkm/cont/Initialize.h>

int main(int argc, char *argv[])
{
  auto config = vtkm::cont::Initialize(argc, argv);

  ...
}
```


# ArrayHandle

## Add `vtkm::cont::ArrayHandleVirtual`

Added a new class named `vtkm::cont::ArrayHandleVirtual` that allows you to type erase an
ArrayHandle storage type by using virtual calls. This simplification makes
storing `Fields` and `Coordinates` significantly easier as VTK-m doesn't
need to deduce both the storage and value type when executing worklets.

To construct an `vtkm::cont::ArrayHandleVirtual` one can do the following:

```cpp
vtkm::cont::ArrayHandle<vtkm::Float32> pressure;
vtkm::cont::ArrayHandleConstant<vtkm::Float32> constant(42.0f);


// constrcut from an array handle
vtkm::cont::ArrayHandleVirtual<vtkm::Float32> v(pressure);

// or assign from an array handle
v = constant;

```

To help maintain performance `vtkm::cont::ArrayHandleVirtual` provides a collection of helper
functions/methods to query and cast back to the concrete storage and value type:
```cpp
vtkm::cont::ArrayHandleConstant<vtkm::Float32> constant(42.0f);
vtkm::cont::ArrayHandleVirtual<vtkm::Float32> v = constant;

const bool isConstant = vtkm::cont::IsType< decltype(constant) >(v);
if(isConstant)
  vtkm::cont::ArrayHandleConstant<vtkm::Float32> t = vtkm::cont::Cast< decltype(constant) >(v);

```

Lastly, a common operation of calling code using `ArrayHandleVirtual` is a desire to construct a new instance
of an existing virtual handle with the same storage type. This can be done by using the `NewInstance` method
as seen below
```cpp
vtkm::cont::ArrayHandle<vtkm::Float32> pressure;
vtkm::cont::ArrayHandleVirtual<vtkm::Float32> v = pressure;

vtkm::cont::ArrayHandleVirtual<vtkm::Float32> newArray = v->NewInstance();
bool isConstant = vtkm::cont::IsType< vtkm::cont::ArrayHandle<vtkm::Float32> >(newArray); //will be true
```


## `vtkm::cont::ArrayHandleZip` provides a consistent API even with non-writable handles

Previously `vtkm::cont::ArrayHandleZip` could not wrap an implicit handle and provide a consistent experience.
The primary issue was that if you tried to use the PortalType returned by `GetPortalControl()` you
would get a compile failure. This would occur as the PortalType returned would try to call `Set`
on an ImplicitPortal which doesn't have a set method.

Now with this change, the `ZipPortal` use SFINAE to determine if `Set` and `Get` should call the
underlying zipped portals.


## `vtkm::cont::VariantArrayHandle` replaces `vtkm::cont::DynamicArrayHandle`

`vtkm::cont::ArrayHandleVariant` replaces `vtkm::cont::DynamicArrayHandle` as the
primary method for holding onto a type erased `vtkm::cont::ArrayHandle`. The major
difference between the two implementations is how they handle the Storage component of
an array handle.

`vtkm::contDynamicArrayHandle` approach was to find the fully deduced type of the `ArrayHandle`
meaning it would check all value and storage types it knew about until it found a match.
This cross product of values and storages would cause significant compilation times when
a `DynamicArrayHandle` had multiple storage types.

`vtkm::cont::VariantArrayHandle` approach is to only deduce the value type of the `ArrayHandle`
and return a `vtkm::cont::ArrayHandleVirtual` which uses polymorpishm to hide the actual
storage type. This approach allows for better compile times, and for calling code
to always expect an `ArrayHandleVirtual` instead of the fully deduced type. This conversion
to `ArrayHandleVirtual` is usually done internally within VTK-m when a  worklet or filter
is invoked.

In certain cases users of `VariantArrayHandle` want to be able to access the concrete
`ArrayHandle<T,S>` and not have it wrapped in a `ArrayHandleVirtual`. For those occurrences
`VariantArrayHandle` provides a collection of helper functions/methods to query and
cast back to the concrete storage and value type:

```cpp
vtkm::cont::ArrayHandleConstant<vtkm::Float32> constant(42.0f);
vtkm::cont::ArrayHandleVariant v(constant);

const bool isConstant = vtkm::cont::IsType< decltype(constant) >(v);
if(isConstant)
  vtkm::cont::ArrayHandleConstant<vtkm::Float32> t = vtkm::cont::Cast< decltype(constant) >(v);

```

Lastly, a common operation of calling code using `VariantArrayHandle` is a desire to construct a new instance
of an existing virtual handle with the same storage type. This can be done by using the `NewInstance` method
as seen below:

```cpp
vtkm::cont::ArrayHandle<vtkm::Float32> pressure;
vtkm::cont::ArrayHandleVariant v(pressure);

vtkm::cont::ArrayHandleVariant newArray = v->NewInstance();
const bool isConstant = vtkm::cont::IsType< decltype(pressure) >(newArray); //will be true
```


## `vtkm::cont::VariantArrayHandle` CastAndCall supports casting to concrete types

Previously, the `VariantArrayHandle::CastAndCall` (and indirect calls through
`vtkm::cont::CastAndCall`) attempted to cast to only
`vtkm::cont::ArrayHandleVirtual` with different value types. That worked, but
it meant that whatever was called had to operate through virtual functions.

Under most circumstances, it is worthwhile to also check for some common
storage types that, when encountered, can be accessed much faster. This
change provides the casting to concrete storage types and now uses
`vtkm::cont::ArrayHandleVirtual` as a fallback when no concrete storage
type is found.

By default, `CastAndCall` checks all the storage types in
`VTKM_DEFAULT_STORAGE_LIST_TAG`, which typically contains only the basic
storage. The `ArrayHandleVirtual::CastAndCall` method also allows you to
override this behavior by specifying a different type list in the first
argument. If the first argument is a list type, `CastAndCall` assumes that
all the types in the list are storage tags. If you pass in
`vtkm::ListTagEmpty`, then `CastAndCall` will always cast to an
`ArrayHandleVirtual` (the previous behavior). Alternately, you can pass in
storage tags that might be likely under the current usage.

As an example, consider the following simple code.

``` cpp
vtkm::cont::VariantArrayHandle array;

// stuff happens

array.CastAndCall(myFunctor);
```

Previously, `myFunctor` would be called with
`vtkm::cont::ArrayHandleVirtual<T>` with different type `T`s. After this
change, `myFunctor` will be called with that and with
`vtkm::cont::ArrayHandle<T>` of the same type `T`s.

If you want to only call `myFunctor` with
`vtkm::cont::ArrayHandleVirtual<T>`, then replace the previous line with

``` cpp
array.CastAndCall(vtkm::ListTagEmpty(), myFunctor);
```

Let's say that additionally using `vtkm::cont::ArrayHandleIndex` was also
common. If you want to also specialize for that array, you can do so with
the following line.

``` cpp
array.CastAndCall(vtkm::ListTagBase<vtkm::cont::StorageBasic,
                                    vtkm::cont::ArrayHandleIndex::StorageTag>,
                  myFunctor);
```

Note that `myFunctor` will be called with
`vtkm::cont::ArrayHandle<T,vtkm::cont::ArrayHandleIndex::StorageTag>`, not
`vtkm::cont::ArrayHandleIndex`.


## `vtkm::cont::VariantArrayHandle::AsVirtual<T>()` performs casting

The `AsVirtual<T>` method of `vtkm::cont::VariantArrayHandle` now works for
any arithmetic type, not just the actual type of the underlying array. This
works by inserting an `ArrayHandleCast` between the underlying concrete array
and the new `ArrayHandleVirtual` when needed.


## `StorageBasic::StealArray()` now provides delete function to new owner

Memory that is stolen from VTK-m has to be freed correctly. This is required
as the memory could have been allocated with `new`, `malloc` or even `cudaMallocManaged`.

Previously it was very easy to transfer ownership of memory out of VTK-m and
either fail to capture the free function, or ask for it after the transfer
operation which would return a nullptr. Now stealing an array also
provides the free function reducing one source of memory leaks.

To properly steal memory from VTK-m you do the following:
```cpp
  vtkm::cont::ArrayHandle<T> arrayHandle;

  ...

  auto* stolen = arrayHandle.StealArray();
  T* ptr = stolen.first;
  auto free_function = stolen.second;

  ...

  free_function(ptr);
```


# Control Environment

## `vtkm::cont::CellLocatorGeneral` has been added

`vtkm::cont::CellLocatorUniformBins` can work with all kinds of datasets, but there are cell
locators that are more efficient for specific data sets. Therefore, a new cell
locator - `vtkm::cont::CellLocatorGeneral` has been implemented that can be configured to use
specialized cell locators based on its input data. A "configurator" function object
can be specified using the `SetConfigurator()` function. The configurator should
have the following signature:

```cpp
void (std::unique_ptr<vtkm::cont::CellLocator>&,
     const vtkm::cont::DynamicCellSet&,
     const vtkm::cont::CoordinateSystem&);
```

The configurator is invoked whenever the `Update` method is called and the input
has changed. The current cell locator is passed in a `std::unique_ptr`. Based on
the types of the input cellset and coordinates, and possibly some heuristics on
their values, the current cell locator's parameters can be updated, or a different
cell-locator can be instantiated and transferred to the `unique_ptr`. The default
configurator configures a `vtkm::cont::CellLocatorUniformGrid` for uniform grid datasets,
a `vtkm::cont::CellLocatorRecitlinearGrid` for rectilinear datasets, and
`vtkm::cont::CellLocatorUniformBins` for all other dataset types.

The class `CellLocatorHelper` that implemented similar functionality to `CellLocatorGeneral`
 has been removed.

## `vtkm::cont::CellLocatorTwoLevelUniformGrid` has been renamed to `vtkm::cont::CellLocatorUniformBins`

`CellLocatorTwoLevelUniformGrid` has been renamed to `CellLocatorUniformBins`
for brevity. It has been modified to be a subclass of `vtkm::cont::CellLocator`
and can be used wherever a `CellLocator` is accepted.

## `vtkm::cont::Timer` now supports  asynchronous and device independent timers

`vtkm::cont::Timer` can now track execution time on a single device or across all
enabled devices as seen below:

```cpp
vtkm::cont::Timer tbb_timer{vtkm::cont::DeviceAdaptertagTBB()};
vtkm::cont::Timer all_timer;

all_timer.Start();
tbb_timer.Start();
// Run blocking algorithm on tbb
tbb_timer.Stop();
// Run async-algorithms cuda
all_timer.Stop();

// Do more work

//Now get time for all tbb work, and tbb_cuda work
auto tbb_time = tbb_timer.GetElapsedTime();
auto all_time = tbb_timer.GetElapsedTime();
```

When `Timer` is constructed without an explicit `vtkm::cont::DeviceAdapterId` it
will track all device adapters and return the maximum elapsed time over all devices
when `GetElapsedTime` is called.


## `vtkm::cont::DeviceAdapterId` construction from strings are now case-insensitive

You can now construct a `vtkm::cont::DeviceAdapterId` from a string no matter
the case of it. The following all will construct the same `vtkm::cont::DeviceAdapterId`.

```cpp
vtkm::cont::DeviceAdapterId id1 = vtkm::cont::make_DeviceAdapterId("cuda");
vtkm::cont::DeviceAdapterId id2 = vtkm::cont::make_DeviceAdapterId("CUDA");
vtkm::cont::DeviceAdapterId id3 = vtkm::cont::make_DeviceAdapterId("Cuda");

auto& tracker = vtkm::cont::GetRuntimeDeviceTracker();
vtkm::cont::DeviceAdapterId id4 = tracker.GetDeviceAdapterId("cuda");
vtkm::cont::DeviceAdapterId id5 = tracker.GetDeviceAdapterId("CUDA");
vtkm::cont::DeviceAdapterId id6 = tracker.GetDeviceAdapterId("Cuda");
```

## `vtkm::cont::Initialize` will only parse known arguments

When a library requires reading some command line arguments through a
function like Initialize, it is typical that it will parse through
arguments it supports and then remove those arguments from `argc` and
`argv` so that the remaining arguments can be parsed by the calling
program. Recent changes to the `vtkm::cont::Initialize` function support
that.

### Use Case

Say you are creating a simple benchmark where you want to provide a command
line option `--size` that allows you to adjust the size of the data that
you are working on. However, you also want to support flags like `--device`
and `-v` that are performed by `vtkm::cont::Initialize`. Rather than have
to re-implement all of `Initialize`'s parsing, you can now first call
`Initialize` to handle its arguments and then parse the remaining objects.

The following is a simple (and rather incomplete) example:

```cpp
int main(int argc, char** argv)
{
  vtkm::cont::InitializeResult initResult = vtkm::cont::Initialize(argc, argv);

  if ((argc > 1) && (strcmp(argv[1], "--size") == 0))
  {
    if (argc < 3)
	{
	  std::cerr << "--size option requires a numeric argument" << std::endl;
	  std::cerr << "USAGE: " << argv[0] << " [options]" << std::endl;
	  std::cerr << "Options are:" << std::endl;
	  std::cerr << "  --size <number>\tSpecify the size of the data." << std::endl;
	  std::cerr << initResult.Usage << std::endl;
	  exit(1);
	}

	g_size = atoi(argv[2]);
  }

  std::cout << "Using device: " << initResult.Device.GetName() << std::endl;
```

### Additional Initialize Options

Because `vtkm::cont::Initialize` no longer has the assumption that it is responsible
for parsing _all_ arguments, some options have been added to
`vtkm::cont::InitializeOptions` to manage these different use cases. The
following options are now supported.

  * `None` A placeholder for having all options off, which is the default.
    (Same as before this change.)
  * `RequireDevice` Issue an error if the device argument is not specified.
    (Same as before this change.)
  * `DefaultAnyDevice` If no device is specified, treat it as if the user
    gave --device=Any. This means that DeviceAdapterTagUndefined will never
    be return in the result.
  * `AddHelp` Add a help argument. If `-h` or `--help` is provided, prints
    a usage statement. Of course, the usage statement will only print out
    arguments processed by VTK-m.
  * `ErrorOnBadOption` If an unknown option is encountered, the program
    terminates with an error and a usage statement is printed. If this
    option is not provided, any unknown options are returned in `argv`. If
    this option is used, it is a good idea to use `AddHelp` as well.
  * `ErrorOnBadArgument` If an extra argument is encountered, the program
    terminates with an error and a usage statement is printed. If this
    option is not provided, any unknown arguments are returned in `argv`.
  * `Strict` If supplied, Initialize treats its own arguments as the only
    ones supported by the application and provides an error if not followed
    exactly. This is a convenience option that is a combination of
    `ErrorOnBadOption`, `ErrorOnBadArgument`, and `AddHelp`.

### InitializeResult Changes

The changes in `Initialize` have also necessitated the changing of some of
the fields in the `InitializeResult` structure. The following fields are
now provided in the `InitializeResult` struct.

  * `Device` Returns the device selected in the command line arguments as a
    `DeviceAdapterId`. If no device was selected,
    `DeviceAdapterTagUndefined` is returned. (Same as before this change.)
  * `Usage` Returns a string containing the usage for the options
    recognized by `Initialize`. This can be used to build larger usage
    statements containing options for both `Initialize` and the calling
    program. See the example above.

Note that the `Arguments` field has been removed from `InitializeResult`.
This is because the unparsed arguments are now returned in the modified
`argc` and `argv`, which provides a more complete result than the
`Arguments` field did.

# Execution Environment

## VTK-m logs details about each CUDA kernel launch

The VTK-m logging infrastructure has been extended with a new log level
`KernelLaunches` which exists between `MemTransfer` and `Cast`.

This log level reports the number of blocks, threads per block, and the
PTX version of each CUDA kernel launched.

This logging level was primarily introduced to help developers that are
tracking down issues that occur when VTK-m components have been built with
different `sm_XX` flags and help people looking to do kernel performance
tuning.


## VTK-m CUDA allocations can have managed memory (cudaMallocManaged) enabled/disabled from C++

Previously it was impossible for calling code to explicitly disable cuda managed memory. This can be desirable for projects that know they don't need managed memory and are super performance critical.

```cpp
const bool usingManagedMemory = vtkm::cont::cuda::internal::CudaAllocator::UsingManagedMemory();
if(usingManagedMemory)
  {  //disable managed memory
  vtkm::cont::cuda::internal::CudaAllocator::ForceManagedMemoryOff();
  }
```


## VTK-m CUDA kernel scheduling improved including better defaults, and user customization support

VTK-m now offers a more GPU aware set of defaults for kernel scheduling.
When VTK-m first launches a kernel we do system introspection and determine
what GPU's are on the machine and than match this information to a preset
table of values. The implementation is designed in a way that allows for
VTK-m to offer both specific presets for a given GPU ( V100 ) or for
an entire generation of cards ( Pascal ).

Currently VTK-m offers preset tables for the following GPU's:
- Tesla V100
- Tesla P100

If the hardware doesn't match a specific GPU card we than try to find the
nearest know hardware generation and use those defaults. Currently we offer
defaults for
- Older than Pascal Hardware
- Pascal Hardware
- Volta+ Hardware

Some users have workloads that don't align with the defaults provided by
VTK-m. When that is the cause, it is possible to override the defaults
by binding a custom function to `vtkm::cont::cuda::InitScheduleParameters`.
As shown below:

```cpp
  ScheduleParameters CustomScheduleValues(char const* name,
                                          int major,
                                          int minor,
                                          int multiProcessorCount,
                                          int maxThreadsPerMultiProcessor,
                                          int maxThreadsPerBlock)
  {

    ScheduleParameters params  {
        64 * multiProcessorCount,  //1d blocks
        64,                        //1d threads per block
        64 * multiProcessorCount,  //2d blocks
        { 8, 8, 1 },               //2d threads per block
        64 * multiProcessorCount,  //3d blocks
        { 4, 4, 4 } };             //3d threads per block
    return params;
  }
  vtkm::cont::cuda::InitScheduleParameters(&CustomScheduleValues);
```


## VTK-m Reduction algorithm now supports differing input and output types

It is common to want to perform a reduction where the input and output types
are of differing types. A basic example would be when the input is `vtkm::UInt8`
but the output is `vtkm::UInt64`. This has been supported since v1.2, as the input
type can be implicitly convertible to the output type.

What we now support is when the input type is not implicitly convertible to the output type,
such as when the output type is `vtkm::Pair< vtkm::UInt64, vtkm::UInt64>`. For this to work
we require that the custom binary operator implements also an `operator()` which handles
the unary transformation of input to output.

An example of a custom reduction operator for differing input and output types is:

```cxx

  struct CustomMinAndMax
  {
    using OutputType = vtkm::Pair<vtkm::Float64, vtkm::Float64>;

    VTKM_EXEC_CONT
    OutputType operator()(vtkm::Float64 a) const
    {
    return OutputType(a, a);
    }

    VTKM_EXEC_CONT
    OutputType operator()(vtkm::Float64 a, vtkm::Float64 b) const
    {
      return OutputType(vtkm::Min(a, b), vtkm::Max(a, b));
    }

    VTKM_EXEC_CONT
    OutputType operator()(const OutputType& a, const OutputType& b) const
    {
      return OutputType(vtkm::Min(a.first, b.first), vtkm::Max(a.second, b.second));
    }

    VTKM_EXEC_CONT
    OutputType operator()(vtkm::Float64 a, const OutputType& b) const
    {
      return OutputType(vtkm::Min(a, b.first), vtkm::Max(a, b.second));
    }

    VTKM_EXEC_CONT
    OutputType operator()(const OutputType& a, vtkm::Float64 b) const
    {
      return OutputType(vtkm::Min(a.first, b), vtkm::Max(a.second, b));
    }
  };


```

## Added specialized operators for ArrayPortalValueReference

The `ArrayPortalValueReference` is supposed to behave just like the value it
encapsulates and does so by automatically converting to the base type when
necessary. However, when it is possible to convert that to something else,
it is possible to get errors about ambiguous overloads. To avoid these, add
specialized versions of the operators to specify which ones should be used.

Also consolidated the CUDA version of an `ArrayPortalValueReference` to the
standard one. The two implementations were equivalent and we would like
changes to apply to both.


# Worklets and Filters

## `vtkm::worklet::Invoker` now supports worklets which require a Scatter object

This change allows the `Invoker` class to support launching worklets that require
a custom scatter operation. This is done by providing the scatter as the second
argument when launch a worklet with the `()` operator.

The following example shows a scatter being provided with a worklet launch.

```cpp
struct CheckTopology : vtkm::worklet::WorkletMapPointToCell
{
  using ControlSignature = void(CellSetIn cellset, FieldOutCell);
  using ExecutionSignature = _2(FromIndices);
  using ScatterType = vtkm::worklet::ScatterPermutation<>;
  ...
};


vtkm::worklet::Ivoker invoke;
invoke( CheckTopology{}, vtkm::worklet::ScatterPermutation{}, cellset, result );
```


## `BitFields` are now a support field input/out type for VTK-m worklets

`BitFields` are:
  - Stored in memory using a contiguous buffer of bits.
  - Accessible via portals, a la ArrayHandle.
  - Portals operate on individual bits or words.
  - Operations may be atomic for safe use from concurrent kernels.

The new `BitFieldToUnorderedSet` device algorithm produces an
ArrayHandle containing the indices of all set bits, in no particular
order.

The new AtomicInterface classes provide an abstraction into bitwise
atomic operations across control and execution environments and are
used to implement the BitPortals.

BitFields may be used as boolean-typed ArrayHandles using the
ArrayHandleBitField adapter. `vtkm::cont::ArrayHandleBitField` uses atomic operations to read
and write bits in the BitField, and is safe to use in concurrent code.

For example, a simple worklet that merges two arrays based on a boolean
condition:

```cpp
class ConditionalMergeWorklet : public vtkm::worklet::WorkletMapField
{
public:
using ControlSignature = void(FieldIn cond,
                              FieldIn trueVals,
                              FieldIn falseVals,
                              FieldOut result);
using ExecutionSignature = _4(_1, _2, _3);

template <typename T>
VTKM_EXEC T operator()(bool cond, const T& trueVal, const T& falseVal) const
{
  return cond ? trueVal : falseVal;
}

};

BitField bits = ...;
auto condArray = vtkm::cont::make_ArrayHandleBitField(bits);
auto trueArray = vtkm::cont::make_ArrayHandleCounting<vtkm::Id>(20, 2, NUM_BITS);
auto falseArray = vtkm::cont::make_ArrayHandleCounting<vtkm::Id>(13, 2, NUM_BITS);
vtkm::cont::ArrayHandle<vtkm::Id> output;

vtkm::worklet::Invoker invoke( vtkm::cont::DeviceAdaptertagTBB{} );
invoke(ConditionalMergeWorklet{}, condArray, trueArray, falseArray, output);

```


## Added a Point Merging worklet

We have added `vtkm::worklet::PointMerge` which uses a virtual grid approach to
identify nearby points. The worklet works by creating a very fine but
sparsely represented locator grid. It then groups points by grid bins and
finds those within a specified radius.


## `vtkm::filter::CleanGrid` now can do point merging

The `CleanGrid` filter has been extended to use `vtkm::worklet::PointMerge` to
allow for point merging. The following flags have been added to `CleanGrid` to
modify the behavior of point merging.

  - `Set`/`GetMergePoints` - a flag to turn on/off the merging of
    duplicated coincident points. This extra operation will find points
    spatially located near each other and merge them together.
  - `Set`/`GetTolerance` - Defines the tolerance used when determining
    whether two points are considered coincident. If the
    `ToleranceIsAbsolute` flag is false (the default), then this tolerance
    is scaled by the diagonal of the points. This parameter is only used
    when merge points is on.
  - `Set`/`GetToleranceIsAbsolute` - When ToleranceIsAbsolute is false (the
     default) then the tolerance is scaled by the diagonal of the bounds of
     the dataset. If true, then the tolerance is taken as the actual
     distance to use. This parameter is only used when merge points is on.
  - `Set`/`GetFastMerge` - When FastMerge is true (the default), some
     corners are cut when computing coincident points. The point merge will
     go faster but the tolerance will not be strictly followed.


## Added a connected component worklets and filters

We have added the `vtkm::filter::ImageConnectivity` and `vtkm::filter::CellSetConnectivity` filters
to identify connected components in DataSets and the corresponding worklets. The `ImageConnectivity`
identify connected components in `vtkm::cont::CellSetStructured`, based on same field value of neighboring
cells. The `CellSetConnectivit`y identify connected components based on cell connectivity.

Currently Moore neighborhood (i.e. 8 neighboring pixels for 2D and 27 neighboring pixels
for 3D) is used for `ImageConnectivity`. For `CellSetConnectivity`, neighborhood is defined
as cells sharing a common edge.


# Build

## CMake 3.8+ now required to build VTK-m

Historically VTK-m has offered the ability to build a small
subset of device adapters with CMake 3.3. As both our primary
consumers have moved to CMake 3.8, and HPC machines continue
to provide new CMake versions we have decided to simplify
our CMake build system by requiring CMake 3.8 everywhere.


## VTK-m now can verify that it installs itself correctly

It was a fairly common occurrence of VTK-m to have a broken install
tree as it had no easy way to verify that all headers would be installed.

Now VTK-m offers a testing infrastructure that creates a temporary installed
version and compile tests that build against the installed  VTK-m version. Currently
we have tests that verify each header listed in VTK-m is installed, users can
compile a custom `vtkm::filter` that uses diy, and users can call `vtkm::rendering`.

## VTK-m now requires `CUDA` separable compilation to build

With the introduction of `vtkm::cont::ArrayHandleVirtual` and the related infrastructure, vtk-m now
requires that all CUDA code be compiled using separable compilation ( -rdc ).


## VTK-m provides a `vtkm_filter` CMake target

VTK-m now provides a `vtkm_filter` target that contains pre-built components
of filters for consuming projects.


## `vtkm::cont::CellLocatorBoundingIntervalHierarchy` is compiled into `vtkm_cont`

All of the methods in CellLocatorBoundingIntervalHierarchy were listed in
header files. This is sometimes problematic with virtual methods. Since
everything implemented in it can just be embedded in a library, move the
code into the vtkm_cont library.

These changes caused some warnings in clang to show up based on virtual
methods in other cell locators. Hence, the rest of the cell locators
have also had some of their code moved to vtkm_cont.


# Other

## LodePNG added as a thirdparty package

The lodepng library was brought is an thirdparty library.
This has allowed the VTK-m rendering library to have a robust
png decode functionality.


## Optionparser added as a thirdparty package

Previously we just took the optionparser.h file and stuck it right in
our source code. That was problematic for a variety of reasons.

  - It incorrectly assigned our license to external code.
  - It made lots of unnecessary changes to the original source (like reformatting).
  - It made it near impossible to track patches we make and updates to the original software.

Now we use the third-party system to track changes to optionparser.h
in the https://gitlab.kitware.com/third-party/optionparser repository.


## Thirdparty diy now can coexist with external diy

Previously VTK-m would leak macros that would cause an external diy
to be incorrectly mangled breaking consumers of VTK-m that used diy.

Going forward to use `diy` from VTK-m all calls must use the `vtkmdiy`
namespace instead of the `diy` namespace. This allows for VTK-m to
properly forward calls to either the external or internal version correctly.


## Merge benchmark executables into a device dependent shared library

VTK-m has been updated to replace old per device benchmark executables with a single
multi-device executable. Selection of the device adapter is done at runtime through
the `--device=` argument.


## Merge rendering testing executables to a shared library

VTK-m has been updated to replace old per device rendering testing executables with a single
multi-device executable. Selection of the device adapter is done at runtime through
the `--device=` argument.


## Merge worklet testing executables into a device dependent shared library

VTK-m has been updated to replace old per device working testing executables with a single
multi-device executable. Selection of the device adapter is done at runtime through
the `--device=` argument.

## VTK-m runtime device detection properly handles busy CUDA devices

When an application that uses VTK-m is first launched it will
do a check to see if CUDA is supported at runtime. If for
some reason that CUDA card is not allowing kernel execution
VTK-m would report the hardware doesn't have CUDA support.

This was problematic as was over aggressive in disabling CUDA
support for hardware that could support kernel execution in
the future. With the fact that every VTK-m worklet is executed
through a TryExecute it is no longer necessary to be so
aggressive in disabling CUDA support.

Now the behavior is that VTK-m considers a machine to have
CUDA runtime support if it has 1+ GPU's of Kepler or
higher hardware (SM_30+).