Love2D 12 compute shaders 101: Finding True Content Bounds

May 03, 2025

Last time we talked about Love 12’s new compute shader support and set up a basic project to demonstrate data transport between GPU and CPU. Today, let’s build on that and do something useful, but not terribly complex: true content bounds (TCB) detection.

TCB are quite important in compositing: there might be really expensive shader effects applied to the contents of a Canvas (Texture), but the canvas itself is larger than it needs to be. For example, if we want to apply effects to the drawing operations performed on the main window, we’d first target them towards a canvas, then apply the effect. But the canvas size is likely much larger than the content, with the rest being transparent nothingness. Dragging this emptiness through the next compositing passes wastes GPU cycles and, more importantly, memory.

Ideally, we’d trim the excess at the start of the pipeline, operate on the content only, and then return the composite, with its own bounds and origin offset. This also allows us to cache effect operations based on the content, and avoid recomputing it if the content merely moves or scales.

Here’s a visual demonstration of the love app we’re going to build:

demo tcb

We also need to make sure this works for love apps in 1:1 and HiDPI rendering modes. This is a pretty nasty pitfall, because we’re crossing the boundary between Love’s logical pixels and the shader’s physical pixels!

Let’s start by declaring what we need:

local dpi = love.graphics.getDPIScale()

local canvas  = love.graphics.newCanvas(512, 512, {
  format       = "rgba32f",
  computewrite = true
})

local quad          = love.graphics.newQuad(0, 0, 1, 1, canvas)
local time          = 0
local workGroupSize = 16
local cs            = love.graphics.newComputeShader("bounds.comp")
local buf           = love.graphics.newBuffer(
  "uint32",
  4,
  { shaderstorage = true }
)

Our demo rendering area will be 512x512 logical pixels big. For later rendering we also declare a Quad (so we don’t have to copy into a “trimmed” Canvas), and load our shader. The shader output will be the physical pixel TCB, for which we just need 4 uints (other types would also work).

workGroupSize determines the 2D edge length of the work group cube in the compute shader. GPUs organize the shader invocations (runs of the main entry point) into work groups of threads. The thread group has 1, 2 or 3 dimensions (the missing ones default to 1). We’ll work with a 16x16 2D work group size, which means that 256 threads invoke the main entry point in parallel. The GPU hardware can then also split these threads across cores, often in smaller packages of 32/64, etc. Threads within the same work group can also utilize synchronization primitives to coordinate with each other.

The group size is set at compile time (you’ll see later in the shader code), and at runtime we have to make sure to schedule enough groups to cover our data. In our example, we’re working with image data, which is 2D. So we can choose a 2D work group, so that each thread maps to one pixel. The dimensions don’t have to match, and you need to profile your shader + hardware + data together to find the right parameters.

Let’s write the function to invoke our shader:

function getContentBounds(canvas, dpi)
  local W, H = canvas:getPixelDimensions()

  -- [minX,minY,maxX,maxY] ← [W-1,H-1,0,0]
  buf:setArrayData({ W - 1, H - 1, 0, 0 })

  cs:send("Src",    canvas)
  cs:send("Bounds", buf)

  love.graphics.dispatchThreadgroups(
    cs,
    math.ceil(W / workGroupSize),
    math.ceil(H / workGroupSize),
    1
  )

  local raw  = love.graphics.readbackBuffer(buf) -- ByteData
  local data = raw:getString()                   -- 16-byte string

  local minX, minY, maxX, maxY = love.data.unpack("I4I4I4I4", data)

  if minX > maxX or minY > maxY then return nil end   -- fully transparent

  return minX / dpi, minY / dpi, (maxX - minX + 1) / dpi, (maxY - minY + 1) / dpi
end

This needs to be DPI-aware. This is done by:

reading the physical pixel dimension of the Canvas, not the logical size
scaling the output bounds back to logical pixels using the DPI parameter (e.g. 2.0 on a Retina display)

Other than that, we pass the parameters to the shader, schedule enough work groups, and parse the output.

The core idea of the shader is to continually refine the bounds stored in buf. Inside of the shader the buf SSBO is expanded by each execution:

/* bounds.comp – find min/max XY of pixels whose alpha > 0 */

layout(local_size_x = 16, local_size_y = 16) in;

/* Texture to scan */
layout(rgba32f) readonly uniform image2D Src;

/* SSBO with four uints: minX, minY, maxX, maxY */
layout(std430) buffer Bounds { uint b[]; };

void computemain()
{
    ivec2 id = ivec2(gl_GlobalInvocationID.xy);
    vec4  p  = imageLoad(Src, id);

    if (p.a > 0.0) {
        atomicMin(b[0], id.x);
        atomicMin(b[1], id.y);
        atomicMax(b[2], id.x);
        atomicMax(b[3], id.y);
    }
}

What’s going on here:

declare the work group layout: 2D 16x16
declare the readonly input canvas
declare the mutable buf, now Bounds as an unsized array
the entrypoint to a compute shader is always computemain
gl_GlobalInvocationID is the global 3D-based index of this thread
- since our work group dimensions align with the input data (2D image), we can grab the x and y components and treat them as pixel coordinates directly
imageLoad then uses those coordinates to pull from the input
0.0 is the alpha treshold for content
- content may be blurred, anti-aliased, etc and any active pixel should count as content
atomicMin/atomicMax are atomic compare-and-replace operations
- even though 256 threads run in parallel, they fast-sync on these operations

So we move across the image data one 16x16 tile at a time, atomically expanding the TCB box.

Finally, we can call it and render a demo:

local function drawToCanvas()
  love.graphics.setCanvas(canvas)
  love.graphics.clear(0, 0, 0, 0)

  -- rotating white square
  love.graphics.push()
    love.graphics.translate(256, 256)
    love.graphics.rotate(time)
    love.graphics.setColor(1, 1, 1, 1)
    love.graphics.rectangle("fill", -64, -64, 128, 128)
  love.graphics.pop()

  -- moving red circle
  love.graphics.setColor(1, 0, 0, 0.8)
  love.graphics.circle("fill",
      256 + math.cos(time * 2) * 150,
      256 + math.sin(time * 2) * 150,
      40)

  love.graphics.setCanvas()                -- back to the screen
end

function love.update(dt)
  time = time + dt * 0.25
end

function love.draw()
  love.graphics.clear(0.25, 0.25, 0.25)

  drawToCanvas()

  -- trimmed region
  local x, y, w, h = getContentBounds(canvas, dpi)
  if x then
    quad:setViewport(x, y, w, h)

    love.graphics.setColor(1, 1, 1, 1)
    love.graphics.draw(canvas, quad, 32 + x, 32 + y)

    love.graphics.setColor(1, 0, 1, 1)
    love.graphics.rectangle("line", 32 + x, 32 + y, w, h)

    love.graphics.setColor(1, 1, 1, 1)
    love.graphics.print(
          ("trimmed %dx%d @ (%d,%d)"):format(w, h, x, y),
          32, 16)
  else
    love.graphics.print("Canvas fully transparent", 32, 16)
  end

  -- full canvas
  love.graphics.setColor(1, 1, 1, 1)
  love.graphics.draw(canvas, 600, 32)
  love.graphics.rectangle("line", 600, 32,
      canvas:getWidth(), canvas:getHeight())
end

You can find the full project here: turbo/love12-true-content-bounds - be sure to play with the highdpi setting in conf.lua!

There are many ways to calculate TCB, this is but one of them. But it serves as a nice introduction to basic compute shader concepts.

Cheers ☕

#gpu #graphics #lang-lua #love2d