Async Timer Log Discussion

This is a design document for the timer log implementation in VTK-m. Feedback, comments, suggestions are welcome.

Motivation

Our current timer design is insufficient:

Only two operations: Construct/Reset, GetElapsedTime to start and "stop" timer.
No way to freeze timer at specific time for later logging.
Must be synchronous. Timer can only query current elapsed time, triggering a sync.

Requirements

We need an asynchronous device timing log to properly profile execution times in asynchronous execution environments:

vtkRenderTimerLog is a good example of the sort of API we need.
Can start / stop timers without immediately retrieving elapsed time
- Removes unnecessary device synchronization
- Requires either or both:
  - API to check if result is ready (set fence after timers, need to cache/drop log frames to avoid framerate impacts).
  - Single synchronization point for all timers in log once per frame (useful for debugging, inappropriate for production)
Manage multiple timers
Timer scoping
- Starting a new timer while an existing timer is running creates a new scope.
  - Child timers must be stopped before parent timers.
- Better organization of logs
- For timers inside a scope, print fraction of parent scope's time used by child timer/scope.
- Add RAII convenience classes for easier use.
Per-thread logging
- Keeps scoping sane.
Ability to manually insert other timers / logs (e.g. combining timer logs from worker threads into parent)
- Implies that a scope's children may use more time total than the parent.
Track times in both the execution and control environments, reporting asynchronous job timings at sync points, along with time spent waiting for sync to finish.
Ability to both easily dump to a stream for debugging and provide a parseable tree structure for application use.

Example Output

[Scope indicator] [% of parent scope] [Time] | [Cont/Exec indicator] [Title]
> 100.00%      1.000000s | (C) Frame
|->  50.00%      0.500000s | (C) Simulation/Analysis
. |->  25.00%      0.125000s | (C) Filter 1
. . |->  40.00%      0.050000s | (C) Serial Preparation
. . |->  40.00%      0.050000s | (C) CPU Parallel Operation
. . . ==> 100.00%      0.050000s | (E) Worker Thread 1
. . . ==> 100.00%      0.050000s | (E) Worker Thread 2
. . |->  10.00%      0.012500s | (C) Launch "Filter 1: GPU Parallel Operation 1"
. . |->  10.00%      0.012500s | (C) Launch "Filter 1: GPU Parallel Operation 2"
. |->  50.00%      0.250000s | (C) Filter 2
. . |->  20.00%      0.025000s | (C) Serial Preparation
. . . |->  80.00%      0.020000s | (C) GPU Sync "ArrayHandle::GetPortalConstControl()"
. . . . ==> 150.00%      0.030000s | (E) Filter 1: GPU Parallel Operation 1
. . . . ==>  75.00%      0.015000s | (E) Filter 1: GPU Parallel Operation 2
. . |->  80.00%      0.100000s | (C) CPU Parallel Operation
. . . ==>  50.00%      0.050000s | (E) Worker Thread 1
. . . ==> 100.00%      0.100000s | (E) Worker Thread 2
. . . ==>  25.00%      0.025000s | (E) Worker Thread 3
. . . ==>  10.00%      0.010000s | (E) Worker Thread 4
. |->  25.00%      0.125000s | (C) Filter 3
. . |->  20.00%      0.025000s | (C) Serial Preparation
. . |->  80.00%      0.100000s | (C) CPU Parallel Operation
. . . ==> 100.00%      0.100000s | (E) Worker Thread 1
. . . ==> 100.00%      0.100000s | (E) Worker Thread 2
. . . ==> 100.00%      0.100000s | (E) Worker Thread 3
. . . ==> 100.00%      0.100000s | (E) Worker Thread 4
|->  50.00%      0.500000s | (C) Render
. |->  50.00%      0.250000s | (C) Build host-side buffers/lookup tables
. . ==>  50.00%      0.125000s | (E) Worker Thread 1
. . ==> 100.00%      0.250000s | (E) Worker Thread 2
. . ==>  20.00%      0.050000s | (E) Worker Thread 3
. . ==>  10.00%      0.025000s | (E) Worker Thread 4
. |->  25.00%      0.125000s | (C) Launch "Upload to device"
. |->  25.00%      0.125000s | (C) Launch "Draw"
|-> 100.00%      0.500000 | (C) Force Logging Sync "TimerLog::LogToStream(std::ostream&)"
. . ==>  20.00%      0.10000s | (E) Upload to device
. . ==> 125.00%      0.62500s | (E) Draw

Description of Example Output

Here, a hypothetical frame consists of two phases, simulation and rendering. Each phase is using half of the total frame time. The log includes timing from both control (C) and execution (E) environments.

The simulation phase consists of 3 filters, executed sequentially.

The first filter does some work in serial, then some work in parallel threads, and finally launches some CUDA threads. Some things to note:

The serial execution takes 20% of the filter's total runtime.
The CPU parallel execution is well balanced and magically have 0 overhead. Each thread takes the same amount of time to finish, each utilizing 100% of the total time in the CPU parallel section.
The GPU jobs are launched asynchronously and only the kernel launch time is recorded. The execution times aren't reported until the next sync occurs.

The second filter does both serial and CPU parallel work.

80% of the serial execution time is spent waiting on the GPU kernels to complete due to data dependencies.
The GPU execution times may be >100% of the parent scope, since they may have started prior to the sync event.
It is readily apparent that the CPU parallel workload is horribly balanced.

The third filter also has serial and CPU parallel sections, which are much better balanced.

The render timing consists of three phases: build, upload, and draw.

The build phase is a badly balanced CPU parallel workload.

Uploading and drawing jobs are launched on the GPU, and then the frame ends.

At the end of the frame, the log is dumped to std::cout. This causes a sync to finish up any pending async device timers (the rendering jobs), whose times are reported.