Working Towards Importance Sampled Direct Lighting

I haven’t made a post in a few weeks now since I’ve been working on a number of different things all of which aren’t quite done yet. Since its been a few weeks, here’s a writeup of one of the things I’m working on and where I am with that.

One of the major features I’ve been working towards for the past few weeks is full multiple importance sampling, which will serve a couple of purposes. First, importance sampling the direct lighting contribution in the image should allow for significantly higher convergence rates for the same amount of compute, allowing for much smoother renders for the same render time. Second, MIS will serve as groundwork for future bidirectional integration schemes, such as Metropolis transport and photon mapping. I’ve been working with my friend Xing Du on understanding the math behind MIS and figuring out how exactly the math should translate into implementation.

So first off, some ground truth tests. All ground truth tests are rendered using brute force pathtracing with hundreds of thousands of iterations per pixel. Here is the test scene I’ve been using lately, with all surfaces reduced to lambert diffuse for the sake of simplification:

Ground truth global illumination render, representing 512000 samples per pixel. All lights sampled by BRDF only.

Ground truth for direct lighting contribution only, with all lights sampled by BRDF only.

Ground truth for indirect lighting contribution only.

The motivation behind importance sampling lights by directly sampling objects with emissive materials comes from the difficulty of finding useful samples from the BRDF only; for example, for the lambert diffuse case, since sampling from only the BRDF produces outgoing rays in totally random (or, slightly better, cosine weighted random) directions, the probability of any ray coming from a diffuse surface actually hitting a light is relatively low, meaning that the contribution of each sample is likely to be low as well. As a result, finding the direct lighting contribution via just BRDF sampling.

For example, here’s the direct lighting contribution only, after 64 samples per pixel with only BRDF sampling:

Direct lighting contribution only, all lights sampled by BRDF only, 64 samples per pixel.

Instead of sampling direct lighting contribution by shooting a ray off in a random direction and hoping that maybe it will hit a light, a much better strategy would be to… shoot the ray towards the light source. This way, the contribution from the sample is guaranteed to be useful. There’s one hitch though: the weighting for a sample chosen using the BRDF is relatively simple to determine. For example, in the lambert diffuse case, since the probability of any particular random sample within a hemisphere is the same as any other sample, the weighting per sample is even with all other samples. Once we selectively choose the ray direction specifically towards the light though, the weighting per sample is no longer even. Instead, we must weight each sample by the probability of a ray going in that particular direction towards the light, which we can calculate by the solid angle subtended by the light source divided by the total solid angle of the hemisphere.

So, a trivial example case would be if a point was being lit by a large area light subtending exactly half of the hemisphere visible from the point. In this case, the area light subtends Pi steradians, making its total weight Pi/(2*Pi), or one half.

The tricky part of calculating the solid angle weighting is in calculating the fractional unit-spherical surface area projection for non-uniform light sources. In other words, figuring out what solid angle a sphere subtends is easy, but figuring out what solid angle a Stanford Bunny subtends is…. less easy.

The initial approach that Xing and I arrived at was to break complex meshes down into triangles and treat each triangle as a separate light, since calculating the solid angle subtended by a triangle is once again easy. However, since treating a mesh as a cloud of triangle area lights is potentially very expensive, for each iteration the direct lighting contribution from all lights in the scene becomes potentially untenable, meaning that each iteration of the render will have to randomly select a small number of lights to directly sample.

As a result, we brainstormed some ideas for potential shortcuts. One shortcut idea we came up with was that instead of choosing an evenly distributed point on the surface of the light to serve as the target for our ray, we could instead shoot a ray at the bounding sphere for the target light and weight the resulting sample by the solid angle subtended not by the light itself, but by the bounding sphere. Our thinking was that this approach would dramatically simplify the work of calculating the solid angle weighting, while still maintaining mathematical correctness and unbiasedness since the number of rays fired at the bounding sphere that will miss the light should exactly offset the overweighting produced by using the bounding sphere’s subtended solid angle.

I went ahead and tried out this idea, and produced the following image:

Direct lighting contribution only, all lights sampled by direct sampling weighted by subtended solid angle, 64 samples per pixel.

First off, for the most part, it works! The resulting direct illumination matches the ground truth and the BRDF-sampling render, but is significantly more converged than the BRDF-sampling render for the same number of samples. BUT, there is a critical flaw: note the black circle around the light source on the ceiling. That black circle happens to fall exactly within the bounding sphere for the light source, and results from a very simple mathematical fact: calculating the solid angle subtended by the bounding sphere for a point INSIDE of the bounding sphere is undefined. In other words, this shortcut approach will fail for any points that are too close to a light source.

One possible workaround I tried was to have any points inside of a light’s bounding sphere to fall back to pure BRDF sampling, but this approach is also undesirable, as a highly visible discontinuity between the differently sampled area develops due to vastly different convergence rates.

So, while the overall solid angle weighting approach checks out, our shortcut does not. I’m now working on implementing the first approach described above, which should produce a correct result, and will post in the next few days.

Stratified versus Uniform Sampling

As part of Takua Render’s new pathtracing core, I’ve implemented a system allowing for multiple sampling methods instead of just uniform sampling. The first new sampling method I’ve added in addition to uniform sampling is stratified sampling. Basically, in stratified sampling, instead of spreading samples per iteration across the entire probability region, the probability region is first divided into a number of equal sized, non-overlapping subregions, and then for each iteration, a sample is drawn with uniform probability from within a single subregion, called a strata. The result of stratified sampling is that samples are guaranteed to be more evenly spread across the entire probability domain instead of clustered within a single area, resulting in less visible noise for the same number of samples compared to uniform sampling. At the same time, since stratified sampling still maintains a random distribution within each strata, the aliasing problems associated with a totally even sample distribution are still avoided.

Here’s a video showing a scene rendered in Takua Render with uniform and then stratified sampling. The video also shows a side-by-side comparison in its last third.

In the renders in the above video, stratified sampling is being used to choose new ray directions from diffuse surface bounces; instead of choosing a random point over the entire cosine-weighted hemisphere at an intersection point, the renderer first chooses a strata with the same steradian as all other strata, and then chooses a random sample within that solid angle. The strata is chosen sequentially for primary bounces, and then chosen randomly for all secondary bounces to maintain unbiased sampling over the whole render. As a result of the sequential strata selection for primary bounces, images rendered in Takua Render will not converged to an unbiased solution until N iterations have elapsed, where N is the number of strata the probability region is divided into. The number of strata can be set by the user as a value in the scene description which is then squared to get the total strata number. So, if a user specifies a strata level of 16, then the probability region will be divided into 256 strata and a unbiased result will not be reached until 256 or more samples per pixel have been taken.

Here’s the Lamborghini model from last post at 256 samples per pixel with stratified (256 strata) and uniform sampling, to demonstrate how much less perceptible noise there is with the stratified sampler. From a distance, the uniform sampler renders may seem slightly darker side by side due to the higher percentage of noise, but if you compare them using the lightbox, you can see that the lighting and brightness is the same.

Stratified sampling, 256 strata, 256 samples per pixel

Uniform sampling, 256 samples per pixel

…and up-close crops with 400% zoom:

Stratified sampling, 256 strata, 256 samples per pixel, 400% crop

Uniform sampling, 256 samples per pixel, 400% crop

At some point soon I will also be implementing Halton sequence sampling and [0,2]-sequence sampling, but for the time being, stratified sampling is already providing a huge visual boost over uniform! In fact, I have a small secret to confess: all of the renders in the last post were rendered with the stratified sampler!

First Progress on New Pathtracing Core

I’ve started work on a completely new pathtracing core to replace the one used in Rev 2. The purpose of totally rewriting the entire pathtracing integrator and brdf systems is to produce something much more modular and robust; as much as possible, I am now decoupling brdf and new ray direction calculation from the actual pathtracing loop.

I’m still in the earliest stages of this rewrite, but I have some test images! Each of the following images was rendered out to somewhere around 25000 samples per pixel (a lot!), at about 5/6 samples per pixel per second. I let the renders run without a hard ending point and terminated them after I walked away for a while and came back, hence the inexact but enormous samples per pixel counts. Each scene was lit with my standard studio-styled lighting setup and in addition to the showcased model, uses a smooth backdrop that consists of about 10000 triangles.

Approximately 100000 face Stanford Dragon:

Approximately 150000 face Deloreon model:

Approximately 250000 face Lamborghini Aventador model:

Short-stack KD-Tree Traversal

In my last post, I talked about implementing history flag based kd-tree traversal. While the history flag based stackless traverse worked perfectly fine in terms of traversing the tree and finding the nearest intersection, I discovered over the past week that its performance is… less than thrilling. Unfortunately, the history flag system results in a huge amount of redundant node visits, since the entire system is state based and therefore necessarily needs to visit every node in a branch of the tree both to move down and up the branch.

So instead, I decided to try out a short-stack based approach. My initial concern with short-stack based approaches was the daunting memory requirements that keeping a short stack for a few hundred threads, however, I realized that realistically, a short stack never needs to be any larger than the maximum depth of the kd-tree being traversed. Since I haven’t yet had a need to test a tree with a depth beyond 100, the memory usage required for keeping short stacks is reasonably predictable and manageable; as a precaution, however, I’ve also decided to allow for the system to fall back to a stackless traverse in the case that a tree’s depth causes short stack memory usage to become unreasonable.

The actual short-stack traverse I’m using is a fairly standard while-while traverse based on the 2008 Kun Zhou realtime kd-tree paper and the 2007 Daniel Horn GPU kd-tree paper. I’ve added one small addition though: in addition to keeping a short stack for traversing the kd-tree, I’ve also added an optional second short stack that tracks the last N intersection test objects. The reason for keeping this second short stack is that kd-trees allow for objects to be split across multiple nodes; by tracking which objects we have already encountered, we can safely detect and skip objects that have already been tested. The object tracking short stack is meant to be rather small (say, no more than 10 to 15 objects at a time), and simply loops back and overwrites the oldest values in the stack when it overflows.

The new while-while traversal is significantly faster than the history flag approach, to the order of a 10x or better performance increase in some cases.

In order to validate that the entire kd traversal system works, I did a quick and dirty port of the old Rev 2 pathtracing integrator to run on top of the new Rev 3 framework. The following test images contain about 20000 faces and objects, and clocked in at about 6 samples per pixel per second with a tree depth of 15. Each image was rendered to 1024 samples per pixel:

I also attempted to render these images without any kd-tree acceleration as a control. Without kd-tree acceleration, each sample per pixel took upwards of 5 seconds, and I wound up terminating the renders before they got even close to completion.

The use of my old Rev 2 pathtracing core is purely temporary, however. The next task I’ll be tackling is a total rewrite of the entire pathtracing system and associated lighting and brdf evaluation systems. Previously, this systems have basically been monolithic blocks of code, but with this rewrite, I want to create a more modular, robust system that can recycle as much code as possible between GPU and CPU implementations, the GL debugger, and eventually other integration methods, such as photon mapping.

Stackless KD-Tree Traversal

I have a working, reasonably optimized, speedy GPU stackless kd-tree traversal implementation! Over the past few days, I implemented the history flag-esque approach I outlined in this post, and it works quite well!

The following image is a heatmap of a kd-tree built for the Stanford Dragon, showing the cost of tracing a ray through each pixel in the image. Brighter values mean more node traversals and intersection tests had to be done for that particular ray. The image was rendered entirely using Takua Render’s CUDA pathtracing engine, and took roughly 100 milliseconds to complete:

…and a similar heatmap, this time generated for a scene containing two mesh cows, two mesh helixes, and some cubes and spheres in a box:

Although room for even further optimization still exists, as it always does, I am quite happy with the results so far. My new kd-tree construction system and stackless traversal system are both several orders of magnitude faster and more efficient than my older attempts.

Here’s a bit of a cool image: in my OpenGL debugging view, I can now follow the kd-tree traversal for a single ray at a time and visualize the exact path and nodes encountered. This tool has been extremely useful for optimizing… without a visual debugging tool, no wonder my previous implementations had so many problems! The scene here is the same cow/helix scene, but rotated 90 degrees. The bluish green line coming in from the left is the ray, and the green boxes outline the nodes of the kd-tree that traversal had to check to get the correct intersection.

…and here’s the same image as above, but with all nodes that were skipped drawn in red. As you can see, the system is now efficient enough to cull the vast vast majority of the scene for each ray:

The size of the nodes relative to the density of the geometry in their vicinity also speaks towards the efficiency of the new kd-tree construction system: empty spaces are quickly skipped through with enormous bounding boxes, whereas high density areas have much smaller bounding boxes to allow for efficient culling.

Over the next day or so, I fully expect I’ll be able to reintegrate the actual pathtracing core, and have some nice images! Since the part of Takua that needed rewriting the most was the underlying scene and kd-tree system, I will be able to reuse a lot of the BRDF/emittance/etc. stuff from Takua Rev 2.

Revision 3 KD-Tree/ObjCore

The one piece of Takua Render that I’ve been proudest of so far has been the KD-Tree and obj mesh processing systems that I built. So of course, over the past week I completely threw away the old versions of KdCore and ObjCore and totally rewrote new versions entirely from scratch. The motive behind this rewrite came mostly from the fact that over the past year, I’ve learned a lot more about KD-Trees and programming in general; as a result, I’m pleased to report that the new versions of KdCore and ObjCore are significantly faster and more memory efficient than previous versions. KdCore3 is now able to process a million objects into an efficient, optimized KD-Tree with a depth of 20 and a minimum of 5 objects per leaf node in roughly one second.

Here’s my kitchen scene, exported to Takua Render’s AvohkiiScene format, and processed through KdCore3. White lines are the wireframe lines for the geometry itself, red lines represent KD-Tree node boundaries:

…and the same image as above, but with only the KD-Tree. You can use the lightbox to switch between the two images for comparisons:

One of the most noticeable improvements in KdCore3 over KdCore2, aside from the speed increases, is in how KdCore3 manages empty space. In the older versions of KdCore, empty space was often repeatedly split into multiple nodes, meaning that ray traversal through empty space was very inefficient, since repeated intersection tests would be required only for a ray to pass through the KD-Tree without actually hitting anything. The images in this old post demonstrate what I mean. The main source of this problem came from how splits were being chosen in KdCore2; in KdCore2, the chosen split was the lowest cost split regardless of axis. As a result, splits were often chosen that resulted in long, narrow nodes going through empty space. In KdCore3, the best split is chosen as the lowest cost split on the longest axis of the node. As a result, empty space is culled much more efficiently.

Another major change to KdCore3 is that the KD-Tree is no longer built recursively. Instead, KdCore3 builds the KD-Tree layer by layer through an iterative approach that is well suited for adaptation to the GPU. Instead of attempting to guess how deep to build the KD-Tree, KdCore3 now just takes a maximum depth from the user and builds the tree no deeper than the given depth. The entire tree is also no longer stored as a series of nodes with pointers to each other, but instead all nodes are stored in a flat array with a clever indexing scheme to allow nodes to implicitly know where their parent and child nodes are within the array. Furthermore, instead of building as a series of nodes with pointers, the tree builds directly into the array format. This array storage format again makes KdCore3 more suitable to a GPU adaptation, and also makes serializing the Kd-Tree out to disk significantly easier for memory caching purposes.

Another major change is how split candidates are chosen; in KdCore2, the candidates along each axis were the median of all contained object center-points, the middle of the axis, and some randomly chosen candidates. In KdCore3, the user can specify a number of split candidates to try along each axis, and then KdCore3 will simply divide each axis into that number of equally spaced points and use those points as candidates. As a result, KdCore3 is far more efficient than KdCore2 at calculating split candidates, can often find a better candidate with more deterministic results due to the removal of random choices, and offers the user more control over the quality of the final split.

The following series of images demonstrate KD-Trees built by KdCore3 for the Stanford Dragon with various settings. Again, feel free to use the lightbox for comparisons.

Max depth 2, min objects per node 20, min volume .0001% of whole tree

Max depth 5, min objects per node 20, min volume .0001% of whole tree

Max depth 10, min objects per node 20, min volume .0001% of whole tree

Max depth 15, min objects per node 20, min volume .0001% of whole tree

Max depth 20, min objects per node 20, min volume .0001% of whole tree

Max depth 20, min objects per node 5, min volume .0001% of whole tree

KdCore3 is also capable of figuring out when the number of nodes in the tree makes traversing the tree more expensive than brute force intersection testing all of the objects in the tree, and will stop tree construction beyond that point. I’ve also given KdCore3 an experiment method for finding best splits based on a semi-Monte-Carlo approach. In this mode, instead of using evenly split candidates, KdCore3 will make three random guesses, and then based on the relative costs of the guesses, begin making additional guesses with a probability distribution weighted towards where ever the lower relative cost is. With this approach, KdCore3 will eventually arrive at the absolute optimal cost split, although getting to this point may take some time. The number of guesses KdCore3 will attempt can be limited by the user, of course.

Finally, another one of the major improvements I made in KdCore3 was simply better use of C++. Over the past two years, my knowledge of how to write fast, effective C++ has evolved immensely, and I now write code very differently than how I did when I wrote KdCore2 and KdCore1. For example, KdCore3 avoids relying on class inheritance and virtual method table lookup (KdCore2 relied on inheritance quite heavily). Normally, virtual method lookup doesn’t add a noticeable amount of execution time to a single virtual method, but when repeated for a few million objects, the slowdown becomes extremely apparent. In talking with my friend Robert Mead, I realized that virtual method table lookup in almost, if not all implementations today necessarily means a minimum of three pointer lookups in memory to find a function, whereas a direct function call is a single pointer lookup.

If I have time to later, I’ll post some benchmarks of KdCore3 versus KdCore2. However, for now, here’s a final pair of images showcasing a scene with highly variable density processed through KdCore3. Note the keavy amount of nodes clustered where large amounts of geometry exist, and the near total emptyness of the KD-Tree in areas where the scene is sparse:

Next up: implementing some form of highly efficient stackless KD-Tree traversal, possibly even using that history based approach I wrote about before!