Currently the call to optimize can take a non-negligable amount of time for complex trees. There are a few reasons for this:
There is no notion that optimization has already been done (memoization might help)
Each optimize call returns an entirely new object
The optimize calls were built in an ad-hoc way as needed.
The optimize machinery would benefit from added memoization or perhaps a redesign to use existing AST trasnformer techniques. I'm not exactly sure where to start with this, but users of this project should be aware that this is an issue and any help with mitigating it would be greatly appreciated.
Designs
Child items
...
Show closed items
Linked items
0
Link issues together to show that they're related.
Learn more.
The logic is split over multiple classes, but all of the classes are in the delayed_image/delayed_nodes.py file (https://gitlab.kitware.com/computer-vision/delayed_image/-/blob/main/delayed_image/delayed_nodes.py?ref_type=heads). The idea is that each node could be the root of an operation tree, and when optimized is called on a node it propogates that call down the tree. For instance the optimize call for the DelayedWarp node will check to see if it is close to the identity (and if so eliminate itself), if it's successor nodes know how to optimize themselves if they are followed by a warp (e.g. two warps can fuse together), and if it can split itself into a new "DelayedWarp" and an "DelayedOverview" that factors out a scale factor.
So each node has its own logic. I think what really should be happening is that all of this logic is mapped into some sort of "tree-transformer" which contains all possible optimizations and works similarly to the way a Python AST Transformer works because these trees of operations are effectively a domain specific AST.
To futher expand on this, there are 4 main operations that could be in the tree:
* warp - a general Affine transform (perhaps projective in the future)
* crop - a slicing, translation, or band/channel sub-selection
* get_overview - reads data from an "overview" which is effectively a
precomputed downscale. It is efficient to replace downscales with
overviews, but this does require careful modification of any other
operation in the tree.
* dequantize - converts integer quantized data back to its original
floating point representation. It is important to do this operation
before applying any sort of warping interpolation.
An example of how these can be applied is as follows:
Creating the unoptimzied tree leaf.
from delayed_image.delayed_leafs import DelayedLoad import kwimage # Grab a test image that contains 3 precomputed overviews fpath = kwimage.grab_test_image_fpath(overviews=3) # Start by pointing at an image on disk. base = DelayedLoad(fpath, channels='r|g|b') # Metadata about size / channels can be specified, but if it doesn't exist # prepare will read it from disk. base = base.prepare() # We can view the tree of operations at any time with print_graph base.print_graph(fields=True)
Constructing a set of operations on top of the leaf to build out an entire operation tree (in this case a chain):
class mkslice: """ Helper to build slices """ def __class_getitem__(self, index): return index # A typical operation tree might be constructed like so delayed = base delayed = delayed.get_overview(1) delayed = delayed.scale(0.4) delayed = delayed.crop(mkslice[0:1024, 0:1024], chan_idxs=[0, 2], clip=False, wrap=False) delayed = delayed.dequantize({ 'orig_min': 0, 'orig_max': 1, 'quant_min': 0, 'quant_max': 255, 'nodata': 0 }) delayed = delayed.warp(kwimage.Affine.random(rng=0)) delayed = delayed.warp(kwimage.Affine.random(rng=1)) delayed = delayed.warp(kwimage.Affine.random(rng=2)) delayed = delayed.crop(mkslice[0:32, 0:64], clip=False, wrap=False) # We can display the tree of operations as is like delayed.print_graph()
The final optimize call that returns the root to a new optimzied tree.
# However, we have several places that could be replaced with more efficient operations. # * The linear warp operations can be fused together. # * The downscale operations can be transformed into overviews, # * And the crop operations can be moved as close to the data loading as # possible so the subsequent operations need to handle less data - as much of # the manipulated data will get cropped away. # The optimized tree looks like this optimized = delayed.optimize() optimized.print_graph()
That will show the slow parts of the decorated functions that are covered by the doctests. Note that the actual reads are expected to be slow. It's the stuff around the reads that needs to be optimized.
I think the main reason for current slowness is that each optimize call returns a new tree. That only really needs to be done for the first call to optimize, and the rest of the optimizations could be done inplace. The trick is knowing when you don't need to optimize the subtree anymore. Currently I just call optimize on the subtree each time it's potentially modified, but adding in a better way of knowing if a subtree is already done being optimized and preventing an expensive call that essentially results in a no-op might be the trick to speeing this up. I'm not sure if there is a way to hack that into the existing structure or if it requires the AST-Transformer-style rewrite.
Each node of the tree (internal or leaf) when __init__ialized will set self.cache = None.
Whenever a node is modified it invalidates its own cache by setting its own self.cache = None.
The API for each node's optimize function will now return a pair (result, hadToBeRecomputed), where result is what it has previously returned and hadToBeRecomputed is set to False only if the node and all its descendants were available in cache.
The implementation for each node's optimize function proceeds as follows.
If the present node has any direct children, call each child node's optimize() -- this is where the recursion is. Continue through all children regardless of the values each returns for hadToBeRecomputed.
If any child replies with hadToBeRecomputed==True OR if the present node's self.cache is None then there is work to do for the present node. Do the work of optimizing the present node using the results already returned by the children. Use the result of that work to set self.cache. Return that result along with hadToBeRecomputed=True.
Otherwise -- when all children report hadToBeRecomputed==False (or there are no children) AND self.cache is not None -- then don't do the work specific to this node. Instead simply return self.cache along with hadToBeRecomputed=False.
Also, choose a better name than hadToBeRecomputed.
My reply is:
Yes, I think this implementation strategy would work. The first "better" name that comes to mine is "was_recomputed".
Taking another look at the code, there are also lots of copy.copy operations littered about and I can't imagine those are helping the speed. The entire thing would probably benefit from a concept review if you want to take a peek.
I've kept the implementation of each node's optimize function as-is (how it determines which optimizations to apply), but I've removed the implementation details of each individual optimization and instead just left the _opt_<opt-name> function stub with its docstring. The idea is to make its high level code path a bit easier to see. The stub is about 250 lines, which is a bitter easier to peruse than the original 2552 lines of delayed_nodes.py.