Reduce optimize overhead

New MWE (requires latest line_profiler):

Requirements:

pip install delayed_image[headless]>=0.2.8
pip install line_profiler>=4.1.0
pip install gdal -f https://girder.github.io/large_image_wheels

from delayed_image import DelayedLoad
import kwimage
import line_profiler

line_profiler.profile.enable()
line_profiler.explicit_profiler.profile.show_config['details'] = 1


@line_profiler.profile
def line_profiler_benchmark1():
    fpath = kwimage.grab_test_image_fpath()
    delayed = DelayedLoad(fpath, channels='r|g|b').prepare()
    delayed = delayed.scale(0.5)
    delayed = delayed.crop((slice(0, 128), slice(0, 128)))
    delayed = delayed.optimize()
    delayed.finalize(optimize=False)


@line_profiler.profile
def line_profiler_benchmark2():
    quantization = {
        'orig_dtype': 'float32',
        'orig_min': 0,
        'orig_max': 1,
        'quant_min': 0,
        'quant_max': 255,
        'nodata': None,
    }
    sl = (slice(0, 128), slice(0, 128))
    fpath = kwimage.grab_test_image_fpath()
    delayed = DelayedLoad(fpath, channels='r|g|b').prepare()
    delayed = delayed.scale(6.0)
    delayed = delayed.crop(space_slice=sl, chan_idxs=[0, 2])
    delayed = delayed.crop(sl)
    delayed = delayed.dequantize(quantization)
    delayed = delayed.crop(sl)
    delayed = delayed.optimize()
    delayed.finalize(optimize=False)

N = 128

for i in range(N):
    line_profiler_benchmark1()

for i in range(N):
    line_profiler_benchmark2()

line_profiler.profile.show()

Results in:

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    20                                           @line_profiler.profile                                      
    21                                           def line_profiler_benchmark2():                             
    22       128     140874.0   1100.6      0.0      quantization = {                                        
    23       128      83539.0    652.6      0.0          'orig_dtype': 'float32',                            
    24       128      38481.0    300.6      0.0          'orig_min': 0,                                      
    25       128      26780.0    209.2      0.0          'orig_max': 1,                                      
    26       128      32249.0    251.9      0.0          'quant_min': 0,                                     
    27       128      23329.0    182.3      0.0          'quant_max': 255,                                   
    28       128      20911.0    163.4      0.0          'nodata': None,                                     
    29                                               }                                                       
    30       128     173466.0   1355.2      0.0      sl = (slice(0, 128), slice(0, 128))                     
    31       128  116792614.0 912442.3     10.2      fpath = kwimage.grab_test_image_fpath()                 
    32       128   43311120.0 338368.1      3.8      delayed = DelayedLoad(fpath, channels='r|g|b').prepare()
    33       128   23892667.0 186661.5      2.1      delayed = delayed.scale(6.0)                            
    34       128   33653639.0 262919.1      3.0      delayed = delayed.crop(space_slice=sl, chan_idxs=[0, 2])
    35       128   24853970.0 194171.6      2.2      delayed = delayed.crop(sl)                              
    36       128    1416163.0  11063.8      0.1      delayed = delayed.dequantize(quantization)              
    37       128   23155390.0 180901.5      2.0      delayed = delayed.crop(sl)                              
    38       128  404889159.0    3e+06     35.5      delayed = delayed.optimize()                            
    39       128  467093718.0    4e+06     41.0      delayed.finalize(optimize=False)                        


Total time: 1.29654 s
File: <ipython-input-1-4c90922fd53f>
Function: line_profiler_benchmark1 at line 10

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    10                                           @line_profiler.profile                                      
    11                                           def line_profiler_benchmark1():                             
    12       128  126174195.0 985735.9      9.7      fpath = kwimage.grab_test_image_fpath()                 
    13       128  100423379.0 784557.6      7.7      delayed = DelayedLoad(fpath, channels='r|g|b').prepare()
    14       128   26742734.0 208927.6      2.1      delayed = delayed.scale(0.5)                            
    15       128   31323791.0 244717.1      2.4      delayed = delayed.crop((slice(0, 128), slice(0, 128)))  
    16       128  165598290.0    1e+06     12.8      delayed = delayed.optimize()                            
    17       128  846280111.0    7e+06     65.3      delayed.finalize(optimize=False)                        


  1.14 seconds - <ipython-input-1-4c90922fd53f>:20 - line_profiler_benchmark2
  1.30 seconds - <ipython-input-1-4c90922fd53f>:10 - line_profiler_benchmark1

Adding context from internal chat:

The logic is split over multiple classes, but all of the classes are in the delayed_image/delayed_nodes.py file (https://gitlab.kitware.com/computer-vision/delayed_image/-/blob/main/delayed_image/delayed_nodes.py?ref_type=heads). The idea is that each node could be the root of an operation tree, and when optimized is called on a node it propogates that call down the tree. For instance the optimize call for the DelayedWarp node will check to see if it is close to the identity (and if so eliminate itself), if it's successor nodes know how to optimize themselves if they are followed by a warp (e.g. two warps can fuse together), and if it can split itself into a new "DelayedWarp" and an "DelayedOverview" that factors out a scale factor.

So each node has its own logic. I think what really should be happening is that all of this logic is mapped into some sort of "tree-transformer" which contains all possible optimizations and works similarly to the way a Python AST Transformer works because these trees of operations are effectively a domain specific AST.

To futher expand on this, there are 4 main operations that could be in the tree:

* warp - a general Affine transform (perhaps projective in the future)

* crop - a slicing, translation, or band/channel sub-selection

* get_overview - reads data from an "overview" which is effectively a precomputed downscale. It is efficient to replace downscales with overviews, but this does require careful modification of any other operation in the tree.

* dequantize - converts integer quantized data back to its original floating point representation. It is important to do this operation before applying any sort of warping interpolation.

An example of how these can be applied is as follows:

Creating the unoptimzied tree leaf.

    from delayed_image.delayed_leafs import DelayedLoad
    import kwimage

    # Grab a test image that contains 3 precomputed overviews
    fpath = kwimage.grab_test_image_fpath(overviews=3)

    # Start by pointing at an image on disk.
    base = DelayedLoad(fpath, channels='r|g|b')
    # Metadata about size / channels can be specified, but if it doesn't exist
    # prepare will read it from disk.
    base = base.prepare()

    # We can view the tree of operations at any time with print_graph
    base.print_graph(fields=True)

Constructing a set of operations on top of the leaf to build out an entire operation tree (in this case a chain):

    class mkslice:
        """ Helper to build slices """
        def __class_getitem__(self, index):
            return index

    # A typical operation tree might be constructed like so
    delayed = base
    delayed = delayed.get_overview(1)
    delayed = delayed.scale(0.4)
    delayed = delayed.crop(mkslice[0:1024, 0:1024], chan_idxs=[0, 2], clip=False, wrap=False)
    delayed = delayed.dequantize({
        'orig_min': 0, 'orig_max': 1,
        'quant_min': 0, 'quant_max': 255,
        'nodata': 0
    })
    delayed = delayed.warp(kwimage.Affine.random(rng=0))
    delayed = delayed.warp(kwimage.Affine.random(rng=1))
    delayed = delayed.warp(kwimage.Affine.random(rng=2))
    delayed = delayed.crop(mkslice[0:32, 0:64], clip=False, wrap=False)

    # We can display the tree of operations as is like
    delayed.print_graph()

The final optimize call that returns the root to a new optimzied tree.

    # However, we have several places that could be replaced with more efficient operations.
    # * The linear warp operations can be fused together.
    # * The downscale operations can be transformed into overviews,
    # * And the crop operations can be moved as close to the data loading as
    # possible so the subsequent operations need to handle less data - as much of
    # the manipulated data will get cropped away.

    # The optimized tree looks like this
    optimized = delayed.optimize()
    optimized.print_graph()

The trees themselves are visualized as:

The original unoptimzied tree:

    ╙── Crop dsize=(64,32),space_slice=(slice(0,32,None),slice(0,64,None))
        ╽
        Warp dsize=(4158,2765),transform={offset=(0.0851,-0.1109),scale=(1.3821,1.0222),shearx=-0.0000,theta=-0.0610}
        ╽
        Warp dsize=(2891,2710),transform={offset=(-0.9997,-0.3450),scale=(1.3647,1.6616),shearx=-0.0000,theta=-0.2739}
        ╽
        Warp dsize=(1621,1694),transform={offset=(0.1768,0.0769),scale=(1.4883,1.6561),shearx=0.0000,theta=-0.0585}
        ╽
        Dequantize dsize=(1024,1024),quantization={orig_min=0,orig_max=1,quant_min=0,quant_max=255,nodata=0}
        ╽
        Warp dsize=(1024,1024),transform={}
        ╽
        Crop dsize=(103,103),space_slice=(slice(0,103,None),slice(0,103,None))
        ╽
        Warp dsize=(103,103),transform={scale=0.4000}
        ╽
        Overview dsize=(256,256),overview=1
        ╽
        Load channels=r|g|b,dsize=(512,512),num_overviews=3,fname=astro_overviews=3.tif

The equivalent optimized tree:

    ╙── Warp dsize=(64,32),transform={offset=(-0.6713,0.1755),scale=(0.5472,0.5773),shearx=0.1653,theta=-0.3208}
        ╽
        Dequantize dsize=(109,91),quantization={orig_min=0,orig_max=1,quant_min=0,quant_max=255,nodata=0}
        ╽
        Crop dsize=(109,91),space_slice=(slice(1,92,None),slice(0,109,None))
        ╽
        Load channels=r|g|b,dsize=(512,512),num_overviews=3,fname=astro_overviews=3.tif

Something that can help identify problematic areas of the code: The slow function calls are decorated with @profile.

If you pip install xdev xdoctest delayed_image line_profiler -U and then run the doctests with profilng enable via:

XDEV_PROFILE=1 xdoctest -m delayed_image
python -m line_profiler -rtmz profile_output.lprof

That will show the slow parts of the decorated functions that are covered by the doctests. Note that the actual reads are expected to be slow. It's the stuff around the reads that needs to be optimized.

I think the main reason for current slowness is that each optimize call returns a new tree. That only really needs to be done for the first call to optimize, and the rest of the optimizations could be done inplace. The trick is knowing when you don't need to optimize the subtree anymore. Currently I just call optimize on the subtree each time it's potentially modified, but adding in a better way of knowing if a subtree is already done being optimized and preventing an expensive call that essentially results in a no-op might be the trick to speeing this up. I'm not sure if there is a way to hack that into the existing structure or if it requires the AST-Transformer-style rewrite.

Lee Newberg Suggested the following strategy:

Each node of the tree (internal or leaf) when __init__ialized will set self.cache = None.
Whenever a node is modified it invalidates its own cache by setting its own self.cache = None.
The API for each node's optimize function will now return a pair (result, hadToBeRecomputed), where result is what it has previously returned and hadToBeRecomputed is set to False only if the node and all its descendants were available in cache.
The implementation for each node's optimize function proceeds as follows.
1. If the present node has any direct children, call each child node's optimize() -- this is where the recursion is. Continue through all children regardless of the values each returns for hadToBeRecomputed.
2. If any child replies with hadToBeRecomputed==True OR if the present node's self.cache is None then there is work to do for the present node. Do the work of optimizing the present node using the results already returned by the children. Use the result of that work to set self.cache. Return that result along with hadToBeRecomputed=True.
3. Otherwise -- when all children report hadToBeRecomputed==False (or there are no children) AND self.cache is not None -- then don't do the work specific to this node. Instead simply return self.cache along with hadToBeRecomputed=False.

Also, choose a better name than hadToBeRecomputed.

My reply is:

Yes, I think this implementation strategy would work. The first "better" name that comes to mine is "was_recomputed".

Taking another look at the code, there are also lots of copy.copy operations littered about and I can't imagine those are helping the speed. The entire thing would probably benefit from a concept review if you want to take a peek.

I've stubbed out the relevant sections of the code here: https://gitlab.kitware.com/computer-vision/delayed_image/-/blob/da9dab84adadc9bf68279e15cd3be61f9e9a79a8/dev/optimize_stub.py

I've kept the implementation of each node's optimize function as-is (how it determines which optimizations to apply), but I've removed the implementation details of each individual optimization and instead just left the _opt_<opt-name> function stub with its docstring. The idea is to make its high level code path a bit easier to see. The stub is about 250 lines, which is a bitter easier to peruse than the original 2552 lines of delayed_nodes.py.

Reduce optimize overhead

Designs

Child items ...

Activity