Bounding Boxes for Ellipsoids

Update (2014): Tavian Barnes has written a far better / more detailed post on this topic; instead of reading my post, I suggest you go read Tavian’s post instead.

Warning: this post is going to be pretty math-heavy.

Let’s talk about spheres, or more generally, ellipsoids. Specifically, let’s talk about calculating axis aligned bounding boxes for arbitrarily transformed ellipsoids, which is a bit of an interesting problem I recently stumbled upon while working on Takua Rev 3. I’m making this post because finding a solution took a lot of searching and I didn’t find any single collected source of information on this problem, so I figured I’d post it for both my own reference and for anyone else who may find this useful.

So what’s so hard about calculating tight axis aligned bounding boxes for arbitrary ellipsoids?

Well, consider a basic, boring sphere. The easiest way to calculate a tight axis aligned bounding box (or AABB) for a mesh is to simply min/max all of the vertices in the mesh to get two points representing the min and max points of the AABB. Similarly, getting a tight AABB for a box is easy: just use the eight vertices of the box for the min/max process. A naive approach to getting a tight AABB for a sphere seems simple then: along the three axes of the sphere, have one point on each end of the axis on the surface of the sphere, and then min/max. Figure 1. shows a 2D example of this naive approach, to extend the example to 3D, simply add two more points for the Z axis (I drew the illustrations quickly in Photoshop, so apologies for only showing 2D examples):

Figure 1.

This naive approach, however, quickly fails if we rotate the sphere such that its axes are no longer lined up nicely with the world axes. In Figure 2, our sphere is rotated, resulting in a way too small AABB if we min/max points on the sphere axes:

Figure 2.

If we scale the sphere such that it becomes an ellipsoid, the same problem persists, as the sphere is just a subtype of ellipsoid. In Figures 3 and 4, the same problem found in Figures 1/2 is illustrated with an ellipsoid:

Figure 3.

Figure 4.

One possible solution is to continue using the naive min/max axes approach, but simply expand the resultant AABB by some percentage such that it encompasses the whole sphere. However, we have no way of knowing what percentage will give an exact bound, so the only feasible way to use this fix is by making the AABB always larger than a tight fit would require. As a result, this solution is almost as undesirable as the naive solution, since the whole point of this exercise is to create as tight of an AABB as possible for as efficient intersection culling as possible!

Instead of min/maxing the axes, we need to use some more advanced math to get a tight AABB for ellipsoids.

We begin by noting our transformation matrix, which we’ll call M. We’ll also need the transpose of M, which we’ll call MT. Next, we define a sphere S using a 4x4 matrix:

[ r 0 0 0 ]
[ 0 r 0 0 ]
[ 0 0 r 0 ]
[ 0 0 0 -1] 

where r is the radius of the sphere. So for a unit diameter sphere, r = .5. Once we have built S, we’ll take its inverse, which we’ll call SI.

We now calculate a new 4x4 matrix R = M*SI*MT. R should be symmetric when we’re done, such that R = transpose(R). We’ll assign R’s indices the following names:

R = [ r11 r12 r13 r14 ] 
  [ r12 r22 r23 r24 ] 
  [ r13 r23 r23 r24 ] 
  [ r14 r24 r24 r24 ] 

Using R, we can now get our bounds:

zmax = (r23 + sqrt(pow(r23,2) - (r33*r22)) ) / r33; 
  	zmin = (r23 - sqrt(pow(r23,2) - (r33*r22)) ) / r33; 
  	ymax = (r13 + sqrt(pow(r13,2) - (r33*r11)) ) / r33; 
  	ymin = (r13 - sqrt(pow(r13,2) - (r33*r11)) ) / r33; 
  	xmax = (r03 + sqrt(pow(r03,2) - (r33*r00)) ) / r33; 
  	xmin = (r03 - sqrt(pow(r03,2) - (r33*r00)) ) / r33; 

…and we’re done!

Just to prove that it works, a screenshot of a transformed ellipse inside of a tight AABB in 3D from Takua Rev 3’s GL Debug view:

I’ve totally glossed over the mathematical rationale behind this method in this post and focused just on how to quickly get a working implementation, but if you want to read more about the actual math behind how it works, these are the two sources I pulled this from:

Stack Overflow post by user fd

Article by Inigo Quilez

In other news, Takua Rev 3’s new scene system is now complete and I am working on a brand new, better, faster, stackless KD-tree implementation. More on that later!

Revision 3, Old Renders

At the beginning of the semester, I decided to re-architect Takua again, hence the lack of updates for a couple of weeks now. I’ll talk more in-depth about the details of how this new architecture works in a later post, so for now I’ll just quickly describe the motivation behind this second round of re-architecting. As I wrote about before, I’ve been keeping parallel CPU and GPU branches of my renderer so far, but the two branches have increasingly diverged. On top of that, the GPU branch of my renderer, although significantly better organized than the experimental CUDA renderer from spring 2012, still is rather suboptimal; after TAing CIS565 for a semester, I’ve developed what I think are some better ways of architecting CUDA code. Over winter break, I began to wonder if merging the CPU and GPU branches might be possible, and if such a task could be done, how I might go about doing it.

This newest re-structuring of Takua accomplishes that goal. I’m calling this new version of Takua “Revision 3”, as it is the third major rewrite.

My new architecture centers around a couple of observations. First, we can observe that the lowest common denominator (so to speak) for structured data in CUDA and C++ is… a struct. Similarly, the easiest way to recycle code between CUDA and C++ is to implement code as inlineable, C style functions that can either be embedded in a CUDA kernel at compile time, or wrapped within a C++ class for use in C++. Therefore, one possible way to unify CPU C++ and GPU CUDA codebases could be to implement core components of the renderer using structs and C-style, inlineable functions, allowing easy integration into CUDA kernels, and then write thin wrapper classes around said structs and functions to allow for nice, object oriented C++ code. This exact system is how I am building Takua Revision 3; the end result should be a unified codebase that can compile to both CPU and GPU versions, and allow for both versions to develop in near lockstep.

Again, I’ll go into a more detailed explanation once this process is complete.

I’ll leave this post with a slightly orthogonal note; whilst in the process of merging code, I found some images from Takua Revision 1 that I never posted for some reason. Here’s a particularly cool pair of images from when I was implementing depth of field. The first image depicts a glass Stanford dragon without any depth of field, and the second image depicts the same exact scene with some crazy shallow aperture (I don’t remember the exact settings). You can tell these are from the days of Takua Revision 1 by the ceiling; I often made the entire ceiling a light source to speed up renders back then, until Revision 2’s huge performance increases rendered cheats like that unnecessary.

Glass Stanford dragon without depth of field

Glass Stanford dragon with depth of field

Texture Mapping

A few weeks back I started work on another piece of super low-hanging fruit: texture mapping! Before I delve into the details, here’s a test render showing three texture mapped spheres with varying degrees of glossiness in a glossy-walled Cornell box. I was also playing with logos for Takua render and put a test logo idea on the back wall for fun:

…and the same scene with the camera tilted down just to show off the glossy floor (because I really like the blurry glossy reflections):

My texturing system can, of course, support textures of arbitrary resolution. The black and white grid and colored UV tile textures in the above render are square 1024x1024, while the Earth texture is rectangular 1024x512. Huge textures are handled just fine, as demonstrated by the following render using a giant 2048x2048, color tweaked version of Simon Page’s Space Janus wallpaper:

Of course UV transformations are supported. Here’s the same texture with a 35 degree UV rotation applied and tiling switched on:

Since memory is always at a premium, especially on the GPU, I’ve implemented textures in a fashion inspired by geometry instancing and node based material systems, such as the system for Maya. Inside of my renderer, I represent texture files as a file node containing the raw image data, streamed from disk via stb_image. I then apply transformations, UV operations, etc through a texture properties node, which maintains a pointer to the relevant texture file node, and then materials point to whatever texture properties nodes they need. This way, texture data can be read and stored once in memory and recycled as many times as needed, meaning that a well formatted scene file can altogether eliminate the need for redundant texture read/storage in memory. This system allows me to create amusing scenes like the following one, where a single striped texture is reused in a number of materials with varied properties:

Admittedly I made that stripe texture really quickly in Photoshop without too much care for straightness of lines, so it doesn’t actually tile very well. Hence why the sphere in the lower front shows a discontinuity in its texture… that’s not glitchy UVing, just a crappy texture!

I’ve also gone ahead and extended my materials system to allow any material property to be driven with a texture. In fact, the stripe room render above is using the same stripe texture to drive reflectiveness on the side walls, resulting in reflective surfaces where the texture is black and diffuse surfaces where the texture is white. Here’s another example of texture driven material properties showing emission being driven using the same color-adjusted Space Janus texture from before:

Even refractive and reflective index of refraction can be driven with textures, which can yield some weird/interesting results. Here are a pair of renders showing a refractive red cube with uniform IOR, and with IOR driven with a Perlin noise map:

Uniform refractive IOR

Refractive IOR driven with a Perlin noise texture map

The nice thing about a node-style material representation is that I should be able to easily plug in procedural functions in place of textures whenever I get around to implementing some (that way I can use procedural Perlin noise instead of using a noise texture).

Here’s an admittedly kind of ugly render using the color UV grid texture to drive refractive color:

For some properties, I’ve had to add a requirement to specify a range of valid values by the user when using a texture map, since RGB values don’t map well to said properties. An example would be glossiness, where a gloss value range of 0% to 100% leaves little room for detailed adjustment. Of course this issue can be fixed by adding support for floating point image formats such as OpenEXR, which is coming very soon! In the following render, the back wall’s glossiness is being driven using the stripe texture (texture driven IOR is also still in effect on the red refractive cube):

Of course, even with nice instancing schemes, textures potentially can take up a gargantuan amount of memory, which poses a huge problem in the GPU world where onboard memory is at a premium. I still need to think more about how I’m going to deal with memory footprints larger than on-device memory, but at the moment my plan is to let the renderer allocate and overflow into pinned host memory whenever it detects that the needed footprint is within some margin of total available device memory. This concern is also a major reason why I’ve decided to stick with CUDA for now… until OpenCL gets support for a unified address space for pinned memory, I’m not wholly sure how I’m supposed to deal with memory overflow issues in OpenCL. I haven’t reexamine OpenCL in a little while now though, so perhaps it is time to take another look.

Unfortunately, something I discovered while in the process of extending my material system to support texture driven properties is that my renderer could probably use a bit of refactoring for the sake of organization and readability. Since I now have some time over winter break and am planning on making my Github repo for Takua-RT public soon, I’ll probably undertake a bit of code refactoring over the next few weeks.

Blurred Glossy Reflections

Over the past few months I haven’t been making as much progress on my renderer as I would have liked, mainly because another major project has been occupying most of my attention: TAing/restructuring the GPU Programming course here at Penn. I’ll probably write a post at the end of the semester with detailed thoughts and comments about that later. More on that later!

I recently had a bit of extra time, which I used to tackle a piece of super low hanging fruit: blurred glossy reflections. The simplest brute force approach blurred glossy reflections is to take the reflected ray direction from specular reflection, construct a lobe around that ray, and sample across the lobe instead of only along the reflected direction. The wider the lobe, the blurrier the glossy reflection gets. The following diagram, borrowed from Wikipedia, illustrates this property:

In a normal raytracer or rasterization based renderer, blurred glossy reflections require something of a compromise between speed and visual quality (much like many other effects!), since using a large number of samples within the glossy specular lobe to achieve a high quality reflection can be prohibitively expensive. This cost-quality tradeoff is therefore similar to the tradeoffs that must be made in any distributed raytracing effect. However, in a pathtracer, we’re already using a massive number of samples, so we can fold the blurred glossy reflection work into our existing high sample count. In a GPU renderer, we have massive amounts of compute as well, making blurred glossy reflections far more tractable than in a traditional system.

The image at the top of this post shows three spheres of varying gloss amounts in a modified Cornell box with a glossy floor and reflective colored walls, rendered entirely inside of Takua-RT. Glossy to glossy light transport is an especially inefficient scenario to resolve in pathtracing, but throwing brute force GPU compute at it allows for arriving at a good solution reasonably quickly: the above image took around a minute to render at 800x800 resolution. Here is another test of blurred glossy reflections, this time in a standard diffuse Cornell box:

…and some tests showing varying degrees of gloss, within a modified Cornell box with glossy left and right walls. Needless to say, all of these images were also rendered entirely inside of Takua-RT.

Full specular reflection

Approximately 10% blurred glossy reflection

Approximately 30% blurred glossy reflection

Finally, here’s another version of the first image in this post, but with the camera in the wrong place. You can see a bit of the stand-in sky I have right now. I’m working on a sun & sky system right now, but since its not ready yet, I have a simple gradient serving as a stand-in right now. I’ll post more about sun & sky when I’m closer to finishing with it… I’m not doing anything fancy like Peter Kutz is doing (his sky renderer blog is definitely worth checking out, by the way), just standard Preetham et. al. style.

Thoughts on Stackless KD-Tree Traversal

Edit: Erwin Coumans in the comments section has pointed me to a GDC 2009 talk by Takahiro Harada proposing something called Tree Traversal using History Flags, which is essentially the same as the idea proposed in this post, with the small exception that Harada’s technique uses a bit field to track previously visited nodes on the up traverse. I think that Harada’s technique is actually better than the pointer check I wrote about in this post, since keeping a bit field would allow for tracking the previously visited node without having to go back to global memory to do a node check. In other words, the bit field method allows for less thrashing of global memory, which I should think allows for a nice performance edge. So, much as I suspected, the idea in this post in one that folks smarter than me have arrived upon previously, and my knowledge of the literature on this topic is indeed incomplete. Much thanks to Erwin for pointing me to the Harada talk! The original post is preserved below, in case anyone still has an interest in reading it.

Of course, one of the biggest challenges to implementing a CUDA pathtracer is the lack of recursion on pre-Fermi GPUs. Since I intend for Takua-RT to be able to run on any CUDA enabled CPU, I necessarily have to work with the assumption that I won’t have recursion support. Getting around this problem in the core pathtracer is not actually a significant issue, as building raytracing systems that operate in an iterative fashion as opposed to in a recursive fashion is a well-covered topic.

Traversing a kd-tree without recursion, however, is a more challenging proposition. So far as I can tell from a very cursory glance at existing literature on the topic, there are presently two major approaches: fully stack-less methods that require some amount of pre-processing of the kd-tree, such as the rope-based method presented in Popov et. al. [2007], and methods utilizing a short stack or something similar, such as the method presented in Zhou et. al. [2008]. I’m in the process of reading both of these papers more carefully, and will probably explore at least one of these approaches soon. In the meantime, however, I thought it might be a fun exercise to try to come up with some solution of my own, which I’ll summarize in this post. I have to admit that I have no idea if this is actually a novel approach, or if its something that somebody also came up with and rejected a long time ago and I just haven’t found yet. My coverage of the literature in this area is highly incomplete, so if you, the reader, are aware of a pre-existing version of this idea, please let me know so that I can attribute it properly!

The basic idea I’m starting with is that when traversing a KD-tree (or any other type of tree, for that matter), at a given node, there’s only a finite number of directions one can go in, and a finite number of previous nodes one could have arrived at the current node from. In other words, one could conceivably define a finite-state machine type system for traversing a KD-tree, given an input ray. I say finite-state machine type, because what I shall define here isn’t actually strictly a FSM, as this method requires storing information about the previous state in addition to the current state. So here we go:

We begin by tracking two pieces of information: what the current node we are at is, and what direction we had to take from the previous node to get to the current node. There are three possible directions we could have come from:

  1. Down from the current node’s parent node
  2. Up from the current node’s left child
  3. Up from the current node’s right child

Similarly, there are only three directions we can possibly travel in from the current node:

  1. Up to the current node’s parent node
  2. Down to the current node’s left child
  3. Down to the current node’s right child

When we travel up from the current node to its parent, we can easily figure out if we are traveling up from the right or the left by looking at whether the current node is the parent node’s left or right child.

Now we need a few rules on which direction to travel in given the direction we came from and some information on where our ray currently is in space:

  1. If we came down from the parent node and if the current node is not a leaf node, intersection test our ray with both children of the current node. If the ray only intersects one of the children, traverse down to that child. If the ray intersects both of the children, traverse down to the left child.
  2. If we came down from the parent node and if the current node is a leaf node, carry out intersection tests between the ray and the contents of the node and store the nearest intersection.
  3. If we came up from the left child, intersection test our ray with the right child of the current node. If we have an intersection, traverse down the right child. If we don’t have an intersection, traverse upwards to the parent.
  4. If we came up from the right child, traverse upwards to the parent.

That’s it. With those four rules, we can traverse an entire KD-Tree in a DFS fashion, while skipping branches that our ray does not intersect for a more efficient traverse, and avoiding any use of recursion or the use of a stack in memory.

There is, of course, the edge case that our ray is coming in to the tree from the “back”, so that the right child of each node is “in front” of the left child instead of “behind”, but we can easily deal with this case by simply testing which side of the KD-tree we’re entering from and swapping left and right in our ruleset accordingly.

I haven’t actually gotten around to implementing this idea yet (as of September 15th, when I started writing this post, although this post may get published much later), so I’m not sure what the performance looks like. There are some inefficiencies in how many nodes our traverse will end up visiting, but on the flipside, we won’t need to keep much of anything in memory except for two pieces of state information and the actual KD-tree itself. On the GPU, I might run into implementation level problems that could impact performance, such as too many branching statements or memory thrashing if the KD-tree is kept in global memory and a gazillion threads try to traverse it at once, so these issues will need to be addressed later.

Again, if you, the reader, knows of this idea from a pre-existing place, please let me know! Also, if you see a gaping hole in my logic, please let me know too!

Since this has been a very text heavy post, I’ll close with some pictures of a KD-tree representing the scene from the Takua-RT post. They don’t really have much to do with the traverse method presented in this post, but they are KD-tree related!

Vimeo’s compression really does not like thin narrow lines, so here are some stills:

TAKUA/Avohkii Render

One question I’ve been asking myself ever since my friend Peter Kutz and I wrapped our little GPU Pathtracer experiment is “why am I writing Takua Render as a CPU-only renderer?” One of the biggest lessons learned from the GPU Pathtracer experiment was that GPUs can indeed provide vast quantities of compute suitable for use in pathtracing rendering. After thinking for a bit at the beginning of the summer, I’ve decided that since I’m starting my renderer from scratch and don’t have to worry about the tons of legacy that real-world big boy renderers like RenderMan have to deal with, there is no reason why I shouldn’t architect my renderer to use whatever compute power is available.

With that being said, from this point forward, I will be concurrently developing CPU and GPU based implementations of Takua Render. I call this new overall project TAKUA/Avohkii, mainly because Avohkii is a cool name. Within this project, I will continue developing the C++ based x86 version of Takua, which will retain the name of just Takua, and I will also work on a CUDA based GPU version, called Takua-RT, with full feature parity. I’m also planning on investigating the idea of an ARM port, but that’s an idea for later. I’m going to stick with CUDA for the GPU version now since I know CUDA better than OpenCL and since almost all of the hardware I have access to develop and test on right now is NVIDIA based (the SIG Lab runs on NVIDIA cards…), but that could change down the line. The eventual goal is to have a set of renderers that together cover as many hardware bases as possible, and can all interoperate and intercommunicate for farming purposes.

I’ve already gone ahead and finished the initial work of porting Takua Render to CUDA. One major lesson learned from the GPU Pathtracer experiment was that enormous CUDA kernels tend to run into a number of problems, much like massive monolithic GL shaders do. One problem in particular is that enormous kernels tend to take a long time to run and can result in the GPU driver terminating the kernel, since NVIDIA’s drivers by default assume that device threads taking longer than 2 seconds to run are hanging and cull said threads. In the GPU Pathtracer experiment, we used a giant monolithic kernel for a single ray bounce, which ran into problems as geometry count went up and subsequently intersection testing and therefore kernel execution time also increased. For Takua-RT, I’ve decided to split a single ray bounce into a sequence of micro-kernels that launch in succession. Basically, each operation is now a kernel; each intersection test is a kernel, BRDF evaluation is a kernel, etc. While I suppose I lose a bit of time in having numerous kernel launches, I am getting around the kernel time-out problem.

Another important lesson learned was that culling useless kernel launches is extremely important. I’m checking for empty threads at the end of each ray bounce and culling via string compaction for now, but this can of course be further extended to the micro-kernels for intersection testing later.

Anyhow, enough text. Takua-RT, even in its super-naive unoptimized CUDA-port state right now, is already so much faster than the CPU version that I can render frames with fairly high convergence in seconds to minutes. That means the renderer is now fast enough for use on rendering animations, such as the one at the top of this post. No post-processing whatsoever was applied, aside from my name watermark in the lower right hand corner. The following images are raw output frames from Takua-RT, this time straight from the renderer, without even watermarking:

Each of these frames represents 5000 iterations of convergence, and took about a minute to render on a NVIDIA Geforce GTX480. The flickering in the glass ball in animated version comes from having a low trace depth of 3 bounces, including for glass surfaces.