Asynchronous rendering
As discussed in the explanations document on rendering, asynchronous
rendering is a feature that allows napari to stay usable and responsive
even when data is loading slowly. There are two experimental asynchronous
rendering features, they can be enabled using the environment variables
NAPARI_ASYNC
and NAPARI_OCTREE
.
NAPARI_ASYNC¶
Running napari with NAPARI_ASYNC=1
enables asynchronous rendering using
the existing Image
class. The
Image
class will no longer call
np.asarray()
in the GUI thread. We do this so that if np.asarray()
blocks on IO or a computation, the GUI thread will not block and the
framerate will not suffer.
To avoid blocking the GUI thread the
Image
class will load chunks using the
new ChunkLoader
class. The
ChunkLoader
will
call np.asarray()
in a worker thread. When the worker thread finishes
it will call on_chunk_loaded()
with
the loaded data. The next frame Image
can display the new data.
Time-series data¶
Without NAPARI_ASYNC
napari will block when switching slices. Napari
will hang until the new slice has loaded. If the slice loads slowly enough
you might see the “spinning wheel of death” on a Mac indicating the process
is hung.
Asynchronous rendering allows the user to interrupt the loading of a slice at any time. The user can freely move the slice slider. This is especially nice for remote or slow-loading data.
Multi-scale images¶
With today’s Image
class there are no
tiles or chunks. Instead, whenever the camera is panned or zoomed napari
fetches all the data needed to draw the entire current canvas. This
actually works amazingly well with local data. Fetching the whole canvas of
data each time can be quite fast.
With remote or other high latency data, however, this method can be very slow. Even if you pan only a tiny amount, napari has to fetch the whole canvas worth of data, and you cannot interrupt the load to further adjust the camera.
With NAPARI_ASYNC
overall performance is the same, but the advantage is
you can interrupt the load by moving the camera at any time. This is a nice
improvement, but working with slow-loading data is still slow. Most large
image viewers improve on this experience with chunks or tiles. With chunks
or tiles when the image is panned the existing tiles are translated and
re-used. Then the viewer only needs to fetch tiles which newly slid onto
the screen. This style of rendering what our NAPARI_OCTREE
flag
enables.
NAPARI_OCTREE¶
Set NAPARI_OCTREE=1
to use the experimental
OctreeImage
class
instead of the normal Image
class. The
new OctreeImage
class will use the same
ChunkLoader
that
NAPARI_ASYNC
enables. In addition, NAPARI_OCTREE
will use the new
TiledImageVisual
instead of the Vispy ImageVisual
class that napari’s
Image
class uses.
Note
The current OCTREE
implementation only fully supports a single 2D image and
may not function with 3D images or multiple images. Improving support
for 3D and multiple images is part of future work on the OCTREE
.
See Octree configuration file for Octree configuration options.
Octree visuals¶
The visual portion of Octree rendering is implemented by three classes:
VispyTiledImageLayer
,
TiledImageVisual
,
and TextureAtlas2D
.
The first two classes are named “tiled image” rather than “octree” because
currently they do not know that they are rendering out of an octree. We did
this intentionally to keep the visuals simpler and more general. However,
the approach has some limitations, and we might later need to create a
subclass of
TiledImageVisual
which is Octree-specific, see Future work: Extending TextureAtlas2D.
The TextureAtlas2D
class
is a subclass of the generic Vispy Texture2D
class. Like Texture2D
the TextureAtlas2D
class
owns one texture. However
TextureAtlas2D
uses this
one texture as an “atlas” which can hold multiple tiles.
For example, by default
TextureAtlas2D
uses a
(4096, 4096) texture that stores 256 different (256, 256) pixel tiles.
Adding or remove a single tile from the full atlas texture is very fast.
Under the hood adding one tile calls glTexSubImage2D()
which only
updates the data in that specific (256, 256) portion of the full texture.
Aside from the data transfer cost,
TextureAtlas2D
is also
fast because we do not have to modify the scene graph or rebuild any
shaders when a tile is added or removed. In an early version of tiled
rendering we created a new ImageVisual
for every tile. This resulted in
scene graph changes and shader rebuilds. At the time the scene graph
changes were causing crashes with PyQt5
, but the atlas approach is better
for multiple reasons, so even if that crash were fixed the atlas is a
better solution.
Octree rendering¶
The interface between the visuals and the Octree is the
OctreeImage
method
get_drawable_chunks()
.
The method is called by the
VispyTiledImageLayer
method
_update_drawn_chunks()
every frame so it can update which tiles are drawn.
OctreeImage
calls
the
get_intersection()
method on its
OctreeSlice
to get
an
OctreeIntersection
object which contains the “ideal chunks” that should be drawn for the
current camera position.
The ideal chunks are the chunks at the preferred level of detail, the level of detail that best matches the current canvas resolution. Drawing chunks which are more detailed that this will look fine, the graphics card will downsample them to the screen resolution, but it’s not efficient to use higher resolution chunks than are needed. Meanwhile drawing chunks that are coarser than the ideal level will look blurry, but it’s much better than drawing nothing.
The decision about what level of detail to use is made by the
OctreeLoader
class and its method
get_drawable_chunks()
.
There are many different approaches one could take here as far as what to
draw when. Today we are doing something reasonable but it could potentially
be improved. In addition to deciding what level of detail to draw for each
ideal chunk, the class initiates asynchronous loads with the
ChunkLoader
for
chunks it wants to draw in the future.
The loader will only use chunks from a higher resolution if they are already being drawn. For example when zooming out. However, it will never initiate loads on higher resolution chunks, since it’s better off loading and drawing the ideal chunks.
The loader will load lower resolution chunks in some cases. Although this can slightly delay when the ideal chunks are loaded, it’s a very quick way to get reasonable looking “coverage” of the area of interest. Often data from one or two levels up isn’t even that noticeably degraded. This table shows how many ideal chunks are “covered” by a chunk at a higher level:
Levels Above Ideal |
Coverage |
---|---|
1 |
4 |
2 |
16 |
3 |
64 |
Although data 3 levels above will be quite blurry, it’s pretty amazing you can load one chunk and it will cover 64 ideal chunks. This is the heart of the power of Octrees, Quadtrees or image pyramids.
Octree configuration file¶
Setting NAPARI_OCTREE=1
enables Octree rendering with the default
configuration. To customize the configuration set NAPARI_OCTREE
to be
the path of a JSON config file, such as NAPARI_OCTREE=/tmp/octree.json
.
See DEFAULT_OCTREE_CONFIG
for the current
config file format:
{
"loader_defaults": {
"log_path": None,
"force_synchronous": False,
"num_workers": 10,
"use_processes": False,
"auto_sync_ms": 30,
"delay_queue_ms": 100,
},
"octree": {
"enabled": True,
"tile_size": 256,
"log_path": None,
"loaders": {
0: {"num_workers": 10, "delay_queue_ms": 100},
2: {"num_workers": 10, "delay_queue_ms": 0},
},
},
}
The loader_defaults
key contains settings that will be used by the
ChunkLoader
.
Setting |
Description |
---|---|
|
Write |
|
If |
|
The number of worker threads or processes. |
|
If |
|
Switch to synchronous if loads are faster than this. |
|
Delay loads by this much. |
|
The number of worker threads or processes. |
The octree
key contains these settings:
Setting |
Description |
---|---|
|
If |
|
Size of render tiles to use for rending. |
|
Octree specific log file for debugging. |
|
Optional custom loaders, see below. |
The loaders
key lets you define and configure multiple
LoaderPool
pools. The
key of each loader is the level relative to the ideal level. In the above
example configuration we define two loaders. The first with key 0
is for
loading chunks at the ideal level or one above. While the second with key
2
will load chunks two above the ideal level or higher.
Each loader uses the loader_defaults
but you can override the
num_workers
, auto_sync_ms
and delay_queue_ms
values in
each loader defined in loaders
.
Multiple loaders¶
We allow multiple loaders to improve loading performance. There are a lot of different strategies one could use when loading chunks. For example, we tend to load chunks at a higher level prior to loading the chunks at the ideal level. This gets “coverage” on the screen quickly, and then the data can be refined by loading the ideal chunks.
One consideration is during rapid movement of the camera it’s easy to clog
up the loader pool with workers loading chunks that have already moved out
of view. The
DelayQueue
was
created to help with this problem.
While we can’t cancel a load if a worker has started working on it, we can trivially cancel loads that are still in our delay queue. If the chunk goes out of view, we cancel the load. If the user pauses for a bit, we initiate the loads.
With multiple loaders we can delay the ideal chunks, but we can configure
zero delay for the higher levels. A single chunk from two levels up will
cover 16 ideal chunks. So immediately loading them is a good way to get
data on the screen quickly. When the camera stops moving the
LoaderPool
for the
ideal layer will often be empty. So all of those workers can immediately
start loading the ideal chunks.
The ability to have multiple loaders was only recently added. We still need to experiment to figure out the best configuration. And figure out how that configuration needs to vary based on the latency of the data or other considerations.
Future work: Compatibility with the existing Image class¶
The focus for initial Octree development was Octree-specific behaviors and
infrastructure. Loading chunks asynchronously and rendering them as
individual tiles. One question we wanted to answer was will a Python/Vispy
implementation of Octree rendering be performant enough? Because if not, we
might need a totally different approach. It’s not been fully proven out,
but it seems like the performance will be good enough, so the next step is
full compatibility with the existing
Image
class.
The OctreeImage
class is derived from Image
, while
VispyTiledImageLayer
is derived from VispyImageLayer
,
and
TiledImageVisual
is
derived from the regular Vispy ImageVisual
class. To bring full
Image
capability to
OctreeImage
in most
cases we just need to duplicate what those base classes are doing, but do
it on a per-tile bases. Since there is no full image for them to operate
on. This might involve chaining to the base class or it could mean
duplicating that functionality somehow in the derived class.
Some Image
functionality that needs to
be duplicated in Octree code:
Contrast limits and color transforms¶
The contrast limit code in Vispy’s ImageVisual
needs to be moved into
the tiled visual’s
_build_texture()
.
Instead operating on self.data
it needs to transform tile’s which are newly
being added to the visual. The color transform similarly needs to be per-tile.
Blending and opacity¶
It might be hard to get opacity working correctly for tiles where loads are
in progress. The way
TiledImageVisual
works today is the
OctreeLoader
potentially passes the visual tiles of various sizes, from different levels
of the Octree. The tiles are rendered on top of each other from largest
(coarsest level) to smallest (finest level). This is a nice trick so that
bigger tiles provide “coverage” for an area, while the smaller tiles add
detail only where that data has been loaded.
However, this breaks blending and opacity. We draw multiple tiles on top of
each other, so the image is blending with itself. One solution which is
kind of a big change is keep
TiledImageVisual
for the generic “tiled” case, but introduce a new OctreeVisual
that
knows about the Octree. It can walk up and down the Octree chopping up
larger tiles to make sure we do not render anything on top of anything
else.
Until we do that, we could punt on making things look correct while loads are in progress. We could even highlight the fact that a tile has not been fully loaded (purposely making it look different until the data is fully loaded). Aside from blending, this would address a common complaint with tiled image viewers: you often can’t tell if the data is still being loaded. This could be a big issue for scientific uses, you don’t want people drawing the wrong conclusions from the data.
Time-series multiscale¶
To make time-series multiscale work should not be too hard. We just need to
correctly create a new
OctreeSlice
every
time the slice changes.
The challenge will probably be performance. For starters we probably need to stop creating the “extra” downsampled levels, as described in Future work: Extending TextureAtlas2D. We need to make sure constructing and tearing down the Octree is fast enough, and make sure loads for the previous slices are canceled and everything is cleaned up.
Future work: Extending TextureAtlas2D¶
We could improve our
TextureAtlas2D
class in
a number of ways:
Support setting the atlas’s full texture size on the fly.
Support setting the atlas’s tile size on the fly.
Support a mix of tiles sizes in one atlas.
Allow an atlas to have more than one backing texture.
One reason to consider these changes is so we could support “large tiles” in certain cases. Often the coarsest level of multi-scale data “in the wild” is much bigger than one of our (256, 256) tiles. Today we solve that by creating additional Octree levels, downsampling the data until the coarsest level fits within a single tile.
If we could support multiple tiles sizes and multiple backing textures, we could potentially have “interior tiles” which were small, but then allow large root tiles. Graphics cards can handle pretty big textures. A layer that’s (100000, 100000) obviously needs to be broken into tiles, b¡ut a layer that’s (4096, 4096) really does not need to be broken into tiles. That could be a single large tile.
Long term it would be nice if we did not have to support two image classes:
Image
and
OctreeImage
.
Maintaining two code paths and two sets of visuals will become tiresome and
lead to discrepancies and bugs.
Instead, it would be nice if
OctreeImage
became
the only image class. One image class to rule them all. For that to happen,
though, we need to render small images just as efficiently as the
Image
class does today. We do not want
Octree rendering to worsen cases which work well today. To keep today’s
performance for smaller images we probably need to add support for variable
size tiles.
Future work: Level-zero-only Octrees¶
In issue #1300 it takes 1500ms to switch slices. There we are rendering a (16384, 16384) image that is entirely in RAM. The delay is not from loading into RAM, it’s already in RAM, the delay is from transferring all that data to VRAM in one big gulp.
The image is not a multi-scale image. So can we turn it into a muli-scale image? Generally we’ve found downsampling to create multi-scale image layers is slow. So the question is how can we draw this large image without hanging? One idea is we could create an Octree that only has a level zero and no downsampled levels.
This is an option because chopping up a NumPy
array into tiles is very
fast. This chopping up phase is really just creating a bunch of “views”
into the single existing array. So creating a level zero Octree should be
very fast. For there we can use our existing Octree code and our existing
TiledImageVisual
to transfer over one tile at a time without hurting the frame rate.
The insight here is our Octree code is really two things, one is an Octree but two is a tiled or chunked image, basically a flat image chopped into a grid of tiles. How would this look to the user? With this approach switching slices would be similar to panning and zooming a multiscale Octree image, you’d see the new tiles loading in over time, but the framerate would not tank, and you could switch slices at any time.
Future work: Caching¶
Basically no work has gone into caching or memory management for Octree data. It’s very likely there are leaks and extended usage will run out of memory. This hasn’t been addressed because using Octree for long periods of time is just now becoming possible.
One caching issue is figuring out how to combine the ChunkCache
with
Dasks’s built-in caching. We probably want to keep the ChunkCache
for
rendering non-Dask arrays? But when using Dask, we defer to its cache? We
certainly don’t want to cache the data in both places.
Another issue is whether to cache OctreeChunks
or tiles in the visual,
beyond just caching the raw data. If re-creating both is fast enough, the
simpler thing is evict them fully when a chunk falls out of view. And
re-create them if it comes back in view. It’s simplest to keep nothing but
what we are currently drawing.
However if that’s not fast enough, we could have a MRU cache of
OctreeChunks
and tiles in VRAM, so that reviewing the same data is
nearly instant. This is adding complexity, but the performance might be
worth it.