For the virtual production work we do at MPC, we constantly strive to achieve high frame rate for a smooth experience.
One thing that kept cropping up over and over in the profiling sessions was culling. We often deal with a lot of geometries in the scene and we wanted to see how we could improve upon that.
The low hanging fruit in this scenario is to deal with instances as in our daily work it is common to work with huge sets with a lot of geometry complexity and instances.
There have been several advances in GPU culling in recent years, both for frustum and occlusion culling mentioned in this video from Kostas Anagnostou at Digital Dragons 2018.
We started our journey in dealing with the former.
The above is one of the first prototypes we created to test the performance of the frustum culling, a test scene with one million cubes.
Let’s have a look at how we can move the CPU algorithm to the GPU. The algorithm itself involves several stages:
• Find the instances in the scene
• Perform the culling and store the surviving instances in an array
• Render the instances
We are going to briefly cover all the steps one by one, but we will mostly focus on the Unity integration side of things, and in the next post we will have a look in how to optimize the computation.
The first prototype we did was to spawn matrices for the objects in a grid, which is a nice and easy way to get something going but does not represent a real use case scenario. In the future, we plan to use USD to manage our scene, so we can find and read those instances directly from a USD stage.
For the time being, we had to scan the scene at start up and get all the objects that had a custom instancing component.
Once the objects were found we wrapped them into a Renderable class, which is in charge of drawing a single set of instances, something on the lines:
The Renderable class is in charge of collecting and making the instances data easier to access and manage.
Dealing with nodes in the scene had a couple of down-sides for example in the first place all those game objects make the editor harder to deal with, slowing it down and resulting in hangs. In addition to this, scanning the scene at start up adds extra time before the game starts. Dealing with a USD scene directly we would not have to create the Unity nodes at all.
There is a lot of material on the Internet about CPU culling, so we will not spend much time on this, but focus mostly on what to change in moving the algorithm to the GPU.
Whether or not a transform is in frustum is performed by checking if all the vertices of the AABB of the instance are inside the 6 planes of the frustum. If any of the vertices are inside the frustum, the instance is considered visible.
What we want to achieve is a contiguous array of matrices of instances that survived the culling. This is a fairly straightforward task on CPU, not so much on the GPU.
The solution is an algorithm called scan and compact. There are a lot of online resources about it, for example:
The main idea is that you first perform a vote and generate an array of 0s and 1s, 0 means the instance did not survive the culling, 1 means that it needs to be rendered.
The scan algorithm will allow you to figure out in parallel, where each surviving matrix will need to be copied, specifically the index of the output array where to write you matrix.
The algorithm works in three steps:
• Vote – generate a boolean vote for each istance, defining whether or not it surived the culling.
• Scan – compute the final address of all the matrices that survived.
• Compact – use the compute address to copy the matrix at its final destination.
All this can be achieved with compute shaders (more on this later), you just need to load all the needed data (frustum, matrices) in Unity’s Compute buffers, bind them and kick the shader.
We now have an array of matrices representing the position of the meshes we want to render. A naive approach to render them would be to download such matrices on the CPU and call a DrawMeshInstanced method.
That will work for sure, but would be extremely inefficient as every time you perform a CPU-GPU synchronization, you lose a lot of performance. It is extremely slow to read back from the GPU and you will be stalling the queued GPU frames, with the GPU going idle having no work to do.
Usually the cost can be amortized at the cost of latency, where you will use your result a few frames later and try to compensate for the camera movement, which is how the CPU occlusion query worked for example.
Luckily this is such a common problem that there is a built in solution in the graphics API.
It is called indirect rendering, where during the GPU computation, you know exactly how many elements survived and need to be rendered, so you will just write the arguments for the render call in a small GPU buffer and just tell the driver to render the geometry and fetch the configuration from such buffer.
In this way there is no need to sync the CPU and GPU and the GPU is automatically able to generate extra work for itself.
Lets have a look at how we can do that in Unity.
In the above code, we already performed the culling computation, and the result is in the buffer named “inMatrices”.
You can bind the matrix buffer in the following way:
We then perform the render with:
Where the last argument is the indirectBuffer holding the render configuration.
To note, when rendering with DrawMeshInstanced/Indirect, you need to enable manually some shader flags to get the correct render:
The shader side of things is particularly simple, the only thing we want to do, is to fetch the correct matrix and pass it to the Vertex Shader, this can be done easily by:
The only thing to do in the shader is to patch the unity_ObjectToWorld matrix with the one coming from our StructuredBuffer, then we just forward the vertex to the normal deferred Unity function.
To note: You will also need to compute the unity_WorldToObject, which is the inverse of the unity_ObjectToWorld. If you put in place some restrictions like no scaling etc you can compute it more cheaply. Might be worth to compute it at compact time.
At this point we have a basic prototype of GPU culling, but does it work? How does it perform? We tested the culling on a real scene, an outdoor set where we had around 400,000 trees, grouped in four different instancing group of 100,000 instances each.
The regular Unity framerate was around 10-12 FPS, the GPU culling version was around 2500FPS. That is not a bad speed up! Don’t get too focused on performance right now, we will discuss this at length in the next post.
The major speed up did not only come from the rendering side, but from Unity not having to handle that many components and the GPU culling.
This is really promising, but it is only one side of the story. There is a major issue with this system, and weirdly enough, the problem is shadowing.
When Unity performs the culling, it also computes which geometry ends up in which shadow cascade, and will only submit the necessary geometries to the render of each shadow cascade.
When using DrawMeshInstanced/Indirect, Unity has no idea of what is being rendered or where it is located, so the full geometries will be rendered for each shadow cascade, with the default 4x cascade, we are submitting more than a million threes, we are rendering four times the amount of geometry we normally would.
As far as we are aware, in the normal rendering pipeline, there is no way to fix it. The solution might be a custom scriptable render pipeline or modifying the Unity’s built in HD rendering pipeline.
The SRP is the new Unity way of rendering, it completely exposes to the user the render loop, upon inspection, we were able to find where Unity renders the cascade shadows map and we should be able to hook up in there, perform culling on a per cascade base and only render the surviving instances.
On a scene with a high count of high resolution stones, our custom culling ended up being as fast as the regular Unity rendering although we were rendering 4 times as much geometry.
There is quite a bit of potential there!