Love2D 12 compute shaders 101: Finding True Content Bounds
Last time we talked about Love 12’s new compute shader support and set up a basic project to demonstrate data transport between GPU and CPU. Today, let’s build on that and do something useful, but not terribly complex: true content bounds (TCB) detection.
TCB are quite important in compositing: there might be really expensive shader effects applied to the contents of a Canvas (Texture), but the canvas itself is larger than it needs to be. For example, if we want to apply effects to the drawing operations performed on the main window, we’d first target them towards a canvas, then apply the effect. But the canvas size is likely much larger than the content, with the rest being transparent nothingness. Dragging this emptiness through the next compositing passes wastes GPU cycles and, more importantly, memory.
Ideally, we’d trim the excess at the start of the pipeline, operate on the content only, and then return the composite, with its own bounds and origin offset. This also allows us to cache effect operations based on the content, and avoid recomputing it if the content merely moves or scales.
Here’s a visual demonstration of the love app we’re going to build:
We also need to make sure this works for love apps in 1:1 and HiDPI rendering modes. This is a pretty nasty pitfall, because we’re crossing the boundary between Love’s logical pixels and the shader’s physical pixels!
Let’s start by declaring what we need:
local dpi = love.graphics.getDPIScale()
local canvas = love.graphics.newCanvas(512, 512, {
format = "rgba32f",
computewrite = true
})
local quad = love.graphics.newQuad(0, 0, 1, 1, canvas)
local time = 0
local workGroupSize = 16
local cs = love.graphics.newComputeShader("bounds.comp")
local buf = love.graphics.newBuffer(
"uint32",
4,
{ shaderstorage = true }
)
Our demo rendering area will be 512x512 logical pixels big. For later rendering we also declare a Quad (so we don’t have to copy into a “trimmed” Canvas), and load our shader. The shader output will be the physical pixel TCB, for which we just need 4 uints (other types would also work).
workGroupSize
determines the 2D edge length of the work group cube in the compute shader. GPUs organize the shader invocations (runs of the main entry point) into work groups of threads. The thread group has 1, 2 or 3 dimensions (the missing ones default to 1). We’ll work with a 16x16 2D work group size, which means that 256 threads invoke the main entry point in parallel. The GPU hardware can then also split these threads across cores, often in smaller packages of 32/64, etc. Threads within the same work group can also utilize synchronization primitives to coordinate with each other.
The group size is set at compile time (you’ll see later in the shader code), and at runtime we have to make sure to schedule enough groups to cover our data. In our example, we’re working with image data, which is 2D. So we can choose a 2D work group, so that each thread maps to one pixel. The dimensions don’t have to match, and you need to profile your shader + hardware + data together to find the right parameters.
Let’s write the function to invoke our shader:
function getContentBounds(canvas, dpi)
local W, H = canvas:getPixelDimensions()
-- [minX,minY,maxX,maxY] ← [W-1,H-1,0,0]
buf:setArrayData({ W - 1, H - 1, 0, 0 })
cs:send("Src", canvas)
cs:send("Bounds", buf)
love.graphics.dispatchThreadgroups(
cs,
math.ceil(W / workGroupSize),
math.ceil(H / workGroupSize),
1
)
local raw = love.graphics.readbackBuffer(buf) -- ByteData
local data = raw:getString() -- 16-byte string
local minX, minY, maxX, maxY = love.data.unpack("I4I4I4I4", data)
if minX > maxX or minY > maxY then return nil end -- fully transparent
return minX / dpi, minY / dpi, (maxX - minX + 1) / dpi, (maxY - minY + 1) / dpi
end
This needs to be DPI-aware. This is done by:
- reading the physical pixel dimension of the Canvas, not the logical size
- scaling the output bounds back to logical pixels using the DPI parameter (e.g.
2.0
on a Retina display)
Other than that, we pass the parameters to the shader, schedule enough work groups, and parse the output.
The core idea of the shader is to continually refine the bounds stored in buf
. Inside of the shader the buf
SSBO is expanded by each execution:
/* bounds.comp – find min/max XY of pixels whose alpha > 0 */
layout(local_size_x = 16, local_size_y = 16) in;
/* Texture to scan */
layout(rgba32f) readonly uniform image2D Src;
/* SSBO with four uints: minX, minY, maxX, maxY */
layout(std430) buffer Bounds { uint b[]; };
void computemain()
{
ivec2 id = ivec2(gl_GlobalInvocationID.xy);
vec4 p = imageLoad(Src, id);
if (p.a > 0.0) {
atomicMin(b[0], id.x);
atomicMin(b[1], id.y);
atomicMax(b[2], id.x);
atomicMax(b[3], id.y);
}
}
What’s going on here:
- declare the work group layout: 2D 16x16
- declare the readonly input canvas
- declare the mutable
buf
, nowBounds
as an unsized array - the entrypoint to a compute shader is always
computemain
gl_GlobalInvocationID
is the global 3D-based index of this thread- since our work group dimensions align with the input data (2D image), we can grab the x and y components and treat them as pixel coordinates directly
imageLoad
then uses those coordinates to pull from the input0.0
is the alpha treshold for content- content may be blurred, anti-aliased, etc and any active pixel should count as content
atomicMin
/atomicMax
are atomic compare-and-replace operations- even though 256 threads run in parallel, they fast-sync on these operations
So we move across the image data one 16x16 tile at a time, atomically expanding the TCB box.
Finally, we can call it and render a demo:
local function drawToCanvas()
love.graphics.setCanvas(canvas)
love.graphics.clear(0, 0, 0, 0)
-- rotating white square
love.graphics.push()
love.graphics.translate(256, 256)
love.graphics.rotate(time)
love.graphics.setColor(1, 1, 1, 1)
love.graphics.rectangle("fill", -64, -64, 128, 128)
love.graphics.pop()
-- moving red circle
love.graphics.setColor(1, 0, 0, 0.8)
love.graphics.circle("fill",
256 + math.cos(time * 2) * 150,
256 + math.sin(time * 2) * 150,
40)
love.graphics.setCanvas() -- back to the screen
end
function love.update(dt)
time = time + dt * 0.25
end
function love.draw()
love.graphics.clear(0.25, 0.25, 0.25)
drawToCanvas()
-- trimmed region
local x, y, w, h = getContentBounds(canvas, dpi)
if x then
quad:setViewport(x, y, w, h)
love.graphics.setColor(1, 1, 1, 1)
love.graphics.draw(canvas, quad, 32 + x, 32 + y)
love.graphics.setColor(1, 0, 1, 1)
love.graphics.rectangle("line", 32 + x, 32 + y, w, h)
love.graphics.setColor(1, 1, 1, 1)
love.graphics.print(
("trimmed %dx%d @ (%d,%d)"):format(w, h, x, y),
32, 16)
else
love.graphics.print("Canvas fully transparent", 32, 16)
end
-- full canvas
love.graphics.setColor(1, 1, 1, 1)
love.graphics.draw(canvas, 600, 32)
love.graphics.rectangle("line", 600, 32,
canvas:getWidth(), canvas:getHeight())
end
You can find the full project here: turbo/love12-true-content-bounds - be sure to play with the highdpi
setting in conf.lua
!
There are many ways to calculate TCB, this is but one of them. But it serves as a nice introduction to basic compute shader concepts.
Cheers ☕