Minimize cuda synchronizations
- Have a per-thread pinned array for cuda errors
- Check for errors before scheduling new tasks and at explicit sync points
- Remove explicit synchronizations from most places
Addresses part 2 of #168 (closed)
Addresses part 2 of #168 (closed)