Code & Visuals

Porting Takua Renderer to 64-bit ARM- Part 1

May 29, 2021 | Tags: Coding, Renderer

1. Introduction
2. Motivation
3. Porting to arm64 Linux
4. Floating Point Consistency (or lack thereof) on Different Systems
5. Weak Memory Ordering in arm64 and Atomic Bugs in Takua

6. A Deep Dive on x86-64 versus arm64 Through the Lens of Compiling std::atomic::compare_exchange_weak()
7. Performance Testing
8. Conclusion to Part 1
9. Acknowledgements
10. References

Introduction

For almost its entire existence my hobby renderer, Takua Renderer, has built and run on Mac, Windows, and Linux on x86-64. I maintain Takua on all three major desktop operating systems because I routinely run and use all three operating systems, and I’ve found that building with different compilers on different platforms is a good way for making sure that I don’t have code that is actually wrong but just happens to work because of the implementation quirks of a particular compiler and / or platform. As of last year, Takua Renderer now also runs on 64-bit ARM, for both Linux and Mac! 64-bit ARM is often called either aarch64 or arm64; these two terms are interchangeable and mean the same thing (aarch64 is the official name for 64-bit ARM and is what Linux tends to use, while arm64 is the name that Apple and Microsoft’s tools tend to use). For the sake of consistency, I’ll use the term arm64.

This post is the first of a two-part writeup of the process I undertook to port Takua Renderer to run on arm64, along with interesting stuff that I learned along the way. In this first part, I’ll write about motivation and the initial port I undertook in the spring to arm64 Linux (specifically Fedora). I’ll also write about how arm64 and x86-64’s memory ordering guarantees differ and what that means for lock-free code, and I’ll also do some deeper dives into topics such as floating point differences between different processors and a case study examining how code compiles to x86-64 versus to arm64. In the second part, I’ll write about porting to arm64-based Apple Silicon Macs and I’ll also write about getting Embree up and running on ARM, creating Universal Binaries, and some other miscellaneous topics.

Motivation

So first, a bit of a preamble: why port to arm64 at all? Today, basically most, if not all, of the animation/VFX industry renders on x86-64 machines (and a vast majority of those machines are likely running Linux), so pretty much all contemporary production rendering development happens on x86-64. However, this has not always been true! A long long time ago, much of the computer graphics world was based on MIPS hardware running SGI’s IRIX Unix variant; in the early 2000s, as SGI’s custom hardware began to fall behind the performance-per-dollar, performance-per-watt, and even absolute performance that commodity x86-based machines could offer, the graphics world undertook a massive migration to the current x86 world that we live in today. Apple undertook a massive migration from PowerPC to x86 in the mid/late 2000s for similar reasons.

At this point, an ocean of text has been written about why it is that x86 (and by (literal) extension x86-64) became the dominant ISA in desktop computing and in the server space. One common theory that I like is that x86’s dominance was a classic example of disruptive innovation from the low end. A super short summary of disruptive innovation from the low end is that sometimes, a new player enters an existing market with a product that is much less capable but also much cheaper than existing competing products. By being so much cheaper, the new product can generate a new, larger market that existing competing products can’t access due to their higher cost or different set of requirements or whatever. As a result, the new product gets massive investment since the new product is the only thing that can capture this new larger market, and in turn this massive influx of investment allows the new player to iterate faster and rapidly grow its product in capabilities until the new player becomes capable of overtaking the old market as well. This theory maps well to x86; x86-based desktop PCs started off being much cheaper but also much less capable than specialized hardware such as SGI machines, but the investment that poured into the desktop PC space allowed x86 chips to rapidly grow in absolute performance capability until they were able to overtake specialized hardware in basically every comparable metric. At that point, moving to x86 became a no-brainer for many industries, including the computer graphics realm.

I think that ARM is following the same disruptive innovation path that x86 did, only this time the starting “low end” point is smartphones and tablets, which is an even lower starting point than desktop PCs were. More importantly, I think we’re now at a tipping point for ARM. For many years now, ARM chips have offered better performance-per-dollar and performance-per-watt than any x86-64 chip from Intel or AMD, and the point where arm64 chips can overtake x86-64 chips in absolute performance seems plausibly within sight over the next few years. Notably, Amazon’s in-house Graviton2 arm64 CPU and Apple’s M1 arm64-based Apple Silicon chip are both already highly competitive in absolute performance terms with high end consumer x86-64 CPUs, while consuming less power and costing less. Actually, I think that this trend should have been obvious to anyone paying attention to Apple’s A-series chips since the A9 chip was released in 2015.

In cases of disruptive innovation from the low end, the outer edge of the absolute high end is often the last place where the disruption reaches. One of the interesting things about the high-end rendering field is that high-end rendering is one of a relatively small handful of applications that sits at the absolute outer edge of high end compute performance. All of the major animation and VFX studios have render farms (either on-premises or in the cloud) with core counts somewhere in the tens of thousands of cores; these render farms have more similarities with supercomputers than they do with a regular consumer desktop or laptop. I don’t know that anyone has actually tried this, but my guess is that if someone benchmarked any major animation or VFX studio’s render farm using the LINPACK supercomputer benchmark, the score would sit very respectably somewhere in the upper half of the TOP500 supercomputer list. With the above in mind, the fact that the fastest supercomputer in the world is now an arm64-based system should be an interesting indicator of where ARM is now in the process of catching up to x86-64 and how seriously all of us in high-end computer graphics should be when contemplating the possibility of an ARM-based future.

So all of the above brings me to why I undertook porting Takua to arm64. The reason is because I think we can now plausibly see a potential near future in which the fastest, most efficient, and most cost effective chips in the world are based on arm64 instead of x86-64, and the moment this potential future becomes reality, high-performance software that hasn’t already made the jump will face growing pressure to port to arm64. With Apple’s in-progress shift to arm64-based Apple Silicon Macs, we may already be at this point. I can’t speak for any animation or VFX studio in particular; everything I have written here is purely personal opinion and personal conjecture, but I’d like to be ready in the event that a move to arm64 becomes something we have to face as an industry, and what better way is there to prepare than to try with my own hobby renderer first! Also, for several years now I’ve thought that Apple eventually moving Macs to arm64 was obvious given the progress the A-series Apple chips were making, and since macOS is my primary personal daily use platform, I figured I’d have to port Takua to arm64 eventually anyway.

Porting to arm64 Linux

I actually first attempted an ARM port of Takua several years ago, when Fedora 27 became the first version of Fedora to support arm64 single-board computers (SBCs) such as the Raspberry Pi 3B or the Pine A64. I’ve been a big fan of the Raspberry Pi basically since the original first came out, and the thought of porting Takua to run on a Raspberry Pi as an experiment has been with me basically since 2012. However, Takua is written very much with 64-bit in mind, and the first two generations of Raspberry Pis only had 32-bit ARMv7 processors. I actually backed the original Pine A64 on Kickstarter in 2015 precisely because it was one of the very first 64-bit ARMv8 boards on the market, and if I remember correctly, I also ordered the Raspberry Pi 3B the week it was announced in 2016 because it was the first 64-bit ARMv8 Raspberry Pi. However, my Pine A64 and Raspberry Pi 3B mostly just sat around not doing much because I was working on a bunch of other stuff, but that actually wound up working out because by the time I got back around to tinkering with SBCs in late 2017, Fedora 27 had just been released. Thanks to a ton of work from Peter Robinson at Red Hat, Fedora 27 added native arm64 support that basically worked out-of-the-box on both the Raspberry Pi 3B and the Pine A64, which was ideal for me since my Linux distribution of choice for personal hobby projects is Fedora. Since I already had Takua building and running on Fedora on x86-64, being able to use Fedora as the target distribution for arm64 as well meant that I could eliminate different compiler and system library versions as a variable factor; I “just” had to move everything in my Fedora x86-64 build over to Fedora arm64. However, back in 2017, I found that a lot of the foundational libraries that Takua depends on just weren’t quite ready on arm64 yet. The problem usually wasn’t with the actual source code itself, since anything written in pure C++ without any intrinsics or inline assembly should just compile directly on any platform with a supported compiler; instead, the problem was usually just in build scripts not knowing how to handle small differences in where system libraries were located or stuff like that. At the time I was focused on other stuff, so I didn’t try particularly hard to diagnose and work around the problems I ran into; I kind of just shrugged and put it all aside to revisit some other day.

Fast forward to early 2020, when rumors started circulating of a potential macOS transition to 64-bit ARM. As the rumors grew, I figured that this was a good time to return to porting Takua to arm64 Fedora in preparation for if a macOS transition actually happened. I had also recently bought a Raspberry Pi 4B with 4 GB of RAM; the 4 GB of RAM made actually building and running complex code on-device a lot easier than with the Raspberry Pi 3B/3B+’s 1 GB of RAM. By this point, the arm64 build support level for Takua’s dependencies had improved dramatically. I think that as arm64 devices like the iPhone and iPad Pro have gotten more and more powerful processors over the last few years and enabled more and more advanced and complex iOS / iPadOS apps (and similarly with Android devices and Android apps), more and more open source libraries have seen adoption on ARM-based platforms and have seen ARM support improve as a result. Almost everything just built and worked out-of-the-box on arm64, including (to my enormous surprise) Intel’s TBB library! I had assumed that TBB would be x86-64-only since TBB is an Intel project, but it turns out that over the years, the community has contributed support for ARMv7 and arm64 and even PowerPC to TBB. The only library that didn’t work out-of-the-box or with minor changes was Embree, which relies heavily on SSE and AVX intrinsics and has small amounts of inline x86-64 assembly. To get things up and running initially, I just disabled Takua’s Embree-based traversal backend and fell back to my own custom BVH traversal backend. My own custom BVH traversal backend isn’t nearly as fast as Embree and is instead meant to serves as a reference implementation and fallback for when Embree isn’t available, but for the time being since the goal was just to get Takua working at all, losing performance due to not having Embree was fine. As you can see by the “Traverser: Embree” label in Takua Renderer’s UI in Figure 1, I later got Embree up and running on arm64 using Syoyo Fujita’s embree-aarch64 port, but I’ll write more about that in the next post. To be honest, the biggest challenge with getting everything compiled and running was just the amount of patience that was required. I never seem to be able to get cross-compilation for a different architecture right because I always forget something, so instead of cross-compiling for arm64 from my nice big powerful x86-64 Fedora workstation, I just compiled for arm64 directly on the Raspberry Pi 4B. While the Raspberry Pi 4B is much faster than the Raspberry Pi 3B, it’s still nowhere near as fast as a big fancy dual-Xeon workstation, so some libraries took forever to compile locally (especially Boost, which I wish I didn’t have to have a dependency on, but I have to since OpenVDB depends on Boost). Overall getting a working build of Takua up and running on arm64 was very fast; from deciding to undertake the port to getting a first image back took only about a day’s worth of work, and most of that time was just waiting for stuff to compile.

However, getting code to build is a completely different question from getting code to run correctly (unless you’re using one of those fancy proof-solver languages I guess). The first test renders I did with Takua on arm64 Fedora looked fine to my eye, but when I diff’d them against reference images rendered on x86-64, I found some subtle differences; the source of these differences took me a good amount of digging to understand! Chasing this problem down led down some interesting rabbit holes exploring important differences between x86-64 and arm64 that need to be considered when porting code between the two platforms; just because code is written in portable C++ does not necessarily mean that it is always actually as portable as one might think!

Floating Point Consistency (or lack thereof) on Different Systems

Takua has two different types of image comparison based regression tests: the first type of test renders out to high samples-per-pixel numbers and does comparisons with near-converged images, while the second type of test renders out and does comparisons using a single sample-per-pixel. The reason for these two different types of tests is because of how difficult getting floating point calculations to match across different compilers / platforms / processors is. Takua’s single-sample-per-pixel tests are only meant to catch regressions on the same compiler / platform / processor, while Takua’s longer tests are meant to test overall correctness of converged renders. Because of differences in how floating point operations come out on different compilers / platforms / processors, Takua’s convergence tests don’t require an exact match; instead, the tests use small, predefined difference thresholds that comparisons must stay within to pass. The difference thresholds are basically completely ad-hoc; I picked them to be at a level where I can’t perceive any difference when flipping between the images, since I put together my testing system before image differencing systems that formally factor in perception [Andersson et al. 2020] were published. A large part of the differences between Takua’s test results on x86-64 versus arm64 come from these problems with floating point reproducibility across different systems. Because of how commonplace this issue is and how often this issue is misunderstood by programmers who haven’t had to deal with it, I want to spend a few paragraphs talking about floating point numbers.

A lot of programmers that don’t have to routinely deal with floating point calculations might not realize that even though floating point numbers are standardized through the IEEE754 standard, in practice reproducibility is not at all guaranteed when carrying out the same set of floating point calculations using different compilers / platforms / processors! In fact, starting with the same C++ floating point code, determinism is only really guaranteed for successive runs using binaries generated using the same compiler, with the same optimizations enabled, on the same processor family; sometimes running on the same operating system is also a requirement for guaranteed determinism. There are three main reasons [Kreinin 2008] why reproducing exactly the same results from the same set of floating point calculations across different systems is so inconsistent: compiler optimizations, processor implementation details, and different implementations of built-in “complex” functions like sine and cosine .

The first reason above is pretty easy to understand: operations like addition and multiplication are commutative, meaning they can be done in any order, and often a compiler in an optimization pass may choose to reorder commutative math operations. However, as anyone who has dealt extensively with floating point numbers knows, due to how floating point numbers are represented [Goldberg 1991] the commutative and associative properties of addition and multiplication do not actually hold true for floating point numbers; not even for IEEE754 floating point numbers! Sometimes reordering floating point math is expressly permitted by the language, and sometimes doing this is not actually allowed by the language but happens anyway in the compiler because the user has specified flags like -ffast-math, which tells the compiler that it is allowed to sacrifice strict IEEE754 and language math requirements in exchange for additional optimization opportunities. Sometimes the compiler can just have implementation bugs too; here is an example that I found on the llvm-dev mailing lists describing a bug with loop vectorization that impacts floating point consistency! The end result of all of the above is that the same floating point source code can produce subtly different results depending on which compiler is used and which compiler optimizations are enabled within that compiler. Also, while some compiler optimization passes operate purely on the AST built from the parser or operate purely on the compiler’s intermediate representation, there can also be optimization passes that take into account the underlying target instruction set and choose to carry out different optimizations depending on the what’s available in the target processor architecture. These architecture-specific optimizations mean that even the same floating point source code compiled using the same compiler can still produce different results on different processor architectures! Architecture-specific optimizations are one reason why floating point results on x86-64 versus arm64 can be subtly different. Also, another fun fact: the C++ specification doesn’t actually specify a binary representation for floating point numbers, so in principle a C++ compiler could outright ignore IEEE754 and use something else entirely, although in practice this is basically never the case since all modern compilers like GCC, Clang, and MSVC use IEEE754 floats.

The second reason floating point math is so hard to reproduce exactly across different systems is in how floating point math is implemented in the processor itself. Differences at this level is a huge source of floating point differences between x86-64 and arm64. In both x86-64 and arm64, at the assembly level individual arithmetic instructions such as add, subtract, multiple, divide, etc all adhere strictly to the IEEE754 standard. However, the IEEE754 standard is itself… surprisingly loosely specified in some areas! For example, the IEEE754 standard specifies that intermediate results should be as precise as possible, but this means that two different implementations of a floating point addition instructions both adhering to IEEE754 can actually produce different results for the same input if they use different levels of increased precision internally. Here’s a bit of a deprecated example that is still useful to know for historical reasons: everyone knows that an IEEE754 floating point number is 32 bits, but older 32-bit x86 specifies that internal calculations be done using 80-bit precision, which is a holdover from the Intel 8087 math coprocessor. Every x86 (and by extension x86-64) processor when using x87 FPU instructions actually does floating point math using 80 bit internal precision and then rounds back down to 32 bit floats in hardware; the 80 bit internal representation is known as the x86 extended precision format. But even within the same x86 processor, we can still get difference floating point results depending on if the compiler has output x87 FPU instructions or SSE instructions; SSE stays within 32 bits at all times, which means SSE and x87 on the same processor doing the same floating point math isn’t guaranteed to produce the exact same answer. Of course, modern x86-64 generally uses SSE for floating point math instead of x87, but different amounts of precision truncation can still happen depending on what order values are loaded into SSE registers and back into other non-SSE registers. Furthermore, SSE is sufficiently under-specified that the actual implementation details can differ, which is why the same SSE floating point instructions can produce different results on Intel versus AMD processors. Similarly, the ARM architecture doesn’t actually specify a particular FPU implementation at all; the internals of the FPU are left up to each processor designer; for example, the VFP/NEON floating point units that ship on the Raspberry Pi 4B’s Cortex-A72-based CPU use up to 64 bits of internal precision [Johnston 2020]. So, while the x87, SSE on Intel, SSE on AMD, and VFP/NEON FPU implementations are IEEE754-compliant, because of their internal maximum precision differences they can still all produce different results from each other. There are many more examples of areas where IEEE754 leaves in wiggle room for different implementations to do different things [Obiltschnig 2006], and in practice different CPUs do use this wiggle room to do things differently from each other. For example, this wiggle room is why for floating point operations at the extreme ends of the IEEE754 float range, Intel’s x86-64 versus AMD’s x86-64 versus arm64 can produce results with minor differences from each other in the end of the mantissa.

Finally, the third reason floating point math can vary across different systems is because of transcendental functions such as sine and cosine. Transcendental functions like sine and cosine have exact, precise mathematical definitions, but unfortunately these precise mathematical definitions can’t be implemented exactly in hardware. Think back to high school trigonometry; the exact answer for a given input to functions like sine and cosine have to be determined using a Taylor series, but actually implementing a Taylor series in hardware is not at all practical nor performant. Instead, modern processors typically use some form of a CORDIC algorithm to approximate functions like sine and cosine, often to reasonably high levels of accuracy. However, the level of precision to which any given processor approximates sine and cosine is completely unspecified by either IEEE754 or any language standard; as a result, these approximations can and do vary widely between different hardware implementations on different processors! However, how much this reason actually matters in practice is complicated and compiler/language dependent. As an example using cosine, the standard library could choose to implement cosine in software using a variety of different methods, or the standard library could choose to just pass through to the hardware cosine implementation. To illustrate how much the actual execution path depends on the compiler: I originally wanted to include a simple small example using cosine that you, the reader, could go and compile and run yourself on an x86-64 machine and then on an arm64 machine to see the difference, but I wound up having so much difficulty convincing different compilers on different platforms to actually compile the cosine function (even using intrinsics like __builtin_cos!) down to a hardware instruction reliably that I wound up having to abandon the idea.

One of the things that makes all of the above even more difficult to reason about is that which specific factors are applicable at any given moment depends heavily on what the compiler is doing, what compiler flags are in use, and what the compiler’s defaults are. Actually getting floating point determinism across different systems is a notoriously difficult problem [Fiedler 2010] that volumes of stuff has been written about! On top of that, while in principle getting floating point code to produce consistent results across many different systems is possible (hard, but possible) by disabling compiler optimizations and by relying entirely on software implementations of floating point operations to ensure strict, identical IEEE754 compliance on all systems, actually doing all of the above comes with major trade-offs. The biggest trade-off is simply performance: all of the changes necessary to make floating point code consistent across different systems (and especially across different processor architectures like x86-64 versus arm64) also likely will make the floating point considerably slower too.

All of the above reasons mean that modern usage of floating point code basically falls into three categories. The first category is: just don’t use floating point code at all. Included in this first category are applications that require absolute precision and absolute consistency and determinism across all implementations; examples are banking and financial industry code, which tend to store monetary values entirely using only integers. The second category are applications that absolutely must use floats but also must ensure absolute consistency; a good example of applications in this category are high-end scientific simulations that run on supercomputers. For applications in this second category, the difficult work and the performance sacrifices that have to be made in favor of consistency are absolutely worthwhile. Also, tools do exist that can help with ensuring floating point consistency; for example, Herbie is a tool that can detect potentially inaccurate floating point expressions and suggest more accurate replacements. The last category are applications where the requirement for consistency is not necessarily absolute, and the requirement for performance may weigh heavier. This is the space that things like game engines and renderers and stuff live in, and here the trade-offs become more nuanced and situation-dependent. A single-player game may choose absolute performance over any kind of cross-platform guaranteed floating point consistency, whereas a multi-player multi-platform game may choose to sacrifice some performance in order to guarantee that physics and gameplay calculations produce the same result for all players regardless of platform.

Takua Renderer lives squarely in the third category, and historically the point in the trade-off space that I’ve chosen for Takua Renderer is to favor performance over cross-platform floating point consistency. I have a couple of reasons for choosing this trade-off, some of which are good and some of which are… just laziness, I guess! As a hobby renderer, I’ve never had shipping Takua as a public release in any form in mind, and so consistency across many platforms has never really mattered to me. I know exactly which systems Takua will be run on, because I’m the only one running Takua on anything, and to me having Takua run slightly faster at the cost of minor noise differences on different platforms seems worthwhile. As long as Takua is converging to the correct image, I’m happy, and for my purposes, I consider converged images that are perceptually indistinguishable when compared with a known correct reference to also be correct. I do keep determinism within the same platform as a major priority though, since determinism within each platform is important for being able to reliably reproduce bugs and is important for being able to reason about what’s going on in the renderer.

Here is a concrete example of the noise differences I get on x86-64 versus on arm64. This scene is the iced tea scene I originally created for my Nested Dielectrics post; I picked this scene for this comparison purely because it is has a small memory footprint and therefore fits in the relatively constrained 4 GB memory footprint of my Raspberry Pi 4B, while also being slightly more interesting than a Cornell Box. Here is a comparison of a single sample-per-pixel render using bidirectional path tracing on a dual-socket Xeon E5-2680 x86-64 system versus on a Raspberry Pi 4B with a Cortex-A72 based arm64 processor. The scene actually appears somewhat noisier than it normally would be coming out of Takua renderer because for this demonstration, I disabled low-discrepancy sampling and had the renderer fall back to purely random PCG-based sample sequences, with the goal of trying to produce more noticeable noise differences:

Figure 2: A single-spp render demonstrating noise pattern differences between x86-64 (left) versus arm64 (right). Differences are most noticeable on rim of the cup, especially on the left near the handle. For a full screen comparison, click here.

The noise differences are actually relatively minimal! The most noticeable noise differences are on the rim of the cup; note the left of the rim near the handle. Since the noise differences can be fairly difficult to see in the full render on a small screen, here is a 2x zoomed-in crop:

Figure 3: A zoomed-in crop of Figure 2 showing noise pattern differences between x86-64 (left) versus arm64 (right). For a full screen comparison, click here.

The differences are still kind of hard to see even in the zoomed-in crop! So, here’s the absolute difference between the x86-64 and arm64 renders, created by just subtracting the images from each other and taking the absolute value of the difference at each pixel. Black pixels indicate pixels where the absolute difference is zero (or at least, so close to zero so as to be completely imperceptible). Brighter pixels indicate greater differences between the x86-64 and arm64 renders; from where the bright pixels are, we can see that most of the differences occur on the rim of the cup, on ice cubes in the cup, and in random places mostly in the caustics cast by the cup. There’s also a faint horizontal line of small differences across the background; that area lines up with where the seamless white cyclorama backdrop starts to curve upwards:

Understanding why the areas with the highest differences are where they are requires thinking about how light transport is functioning in this specific scene and how differences in floating point calculations impact that light transport. This scene is lit fairly simply; the only light sources are two rect lights and a skydome. Basically everything is illuminated through direct lighting, meaning that for most areas of the scene, a ray starting from the camera is directly hitting the diffuse background cyclorama and then sampling a light source, and a ray starting from the light is directly hitting the diffuse background cyclorama and then immediately sampling the camera lens. So, even with bidirectional path tracing, the total path lengths for a lot of the scene is just two path segments, or one bounce. That’s not a whole lot of path for differences in floating point calculations to accumulate during. On the flip side, most of the areas with the greatest differences are areas where a lot of paths pass through the glass tea cup. For paths that go through the glass tea cup, the path lengths can be very long, especially if a path gets caught in total internal reflection within the glass walls of the cup. As the path lengths get longer, the floating point calculation differences at each bounce accumulate until the entire path begins to diverge significantly between the x86-64 and arm64 versions of the render. Fortunately, these differences basically eventually “integrate out” thanks to the magic of Monte Carlo integration; by the time the renders are near converged, the x86-64 and arm64 results are basically perceptually indistinguishable from each other:

Figure 5: The same cup scene from Figure 1, but now much closer to convergence (2048 spp), rendered using x86-64 (left) and arm64 (right). Note how differences between the x86-64 and arm64 renders are now basically imperceptible to the eye; these are in fact two different images! For a full screen comparison, click here.

Below is the absolute difference between the two images above. To the naked eye the absolute difference image looks completely black, because the differences between the two images are so small that they’re basically below the threshold of normal perception. So, to confirm that there are in fact differences, I’ve also included below a version of the absolute difference exposed up 10 stops, or made 1024 times brighter. Much like in the single spp renders in Figure 1, the areas of greatest difference are in the areas where the path lengths are the longest, which in this scene are areas where paths refract through the glass cup, the tea, and the ice cubes. Just, the differences between individual paths for the same sample across x86-64 and arm64 become tiny to the point of insignificance once averaged across 2048 samples-per-pixel:

Figure 6: Left: Absolute difference between the x86-64 and arm64 renders from Figure 2. Right: Since the absolute difference image basically looks completely black to the eye, I've also included a version of the absolute difference exposed up 10 stops (made 1024 times brighter) to make the differences more visible. For a full screen comparison, click here.

For many extremely precise scientific applications, the level of differences above would still likely be unacceptable, but for our purposes in just making pretty pictures, I’ll call this good enough! In fact, many rendering teams only target perceptually indistinguishable for the purposes of calling things deterministic enough, as opposed to aiming for absolute binary-level determinism; great examples include Pixar’s RenderMan XPU, Disney Animation’s Hyperion, and DreamWorks Animation’s MoonRay.

Eventually maybe I’ll get around to putting more work into trying to get Takua Renderer’s per-path results to be completely consistent even across different systems and processor architectures and compilers, but for the time being I’m fine with keeping that goal as a fairly low priority relative to everything else I want to work on, because as you can see, once the renders are converged, the difference doesn’t really matter! Floating point calculations accounted for most of the differences I was finding when comparing renders on x86-64 versus renders on arm64, but only most. The remaining source of differences turned out… to be an actual bug!

Weak Memory Ordering in arm64 and Atomic Bugs in Takua

Multithreaded programming with atomics and locks has a reputation for being one of the relatively more challenging skills for programmers to master, and for good reason. Since different processor architectures often have different semantics and guarantees and rules around multithreading-related things like memory reordering, porting between different architectures is often a great way to expose subtle multithreading bugs. The remaining source of major differences between the x86-64 and arm64 renders I was getting turned out to be caused by a memory reordering-related bug in some old multithreading code that I wrote a long time ago and forgot about.

In addition to outputing the main render, Takua Renderer is also able to generate some additional render outputs, including some useful diagnostic images. One of the diagnostic render outputs is a sample heatmap, which shows how many pixel samples were used for each pixel in the image. I originally added the sample heatmap render output to Takua when I was implementing adaptive sampling, and since then the sample heatmap render output has been a useful tool for understanding how much time Takua is spending on different parts of the image. One of the other things the sample heatmap render output has served as though is as a simple sanity check that Takua’s multithreaded work dispatching system is functioning correctly. For a render where the adaptive sampler is disabled, the sample heatmap should contain exactly the same value for every single pixel in the entire image, since without adaptive sampling, every pixel is just being rendered to the target samples-per-pixel of the entire render. So, in some of my tests, I have the renderer scripted to always output the sample heatmap, and the test system checks that the sample heatmap is completely uniform after the render as a sanity check to make sure that the renderer has rendered everything that it was supposed to. To my surprise, sometimes on arm64, a test would fail because the sample heatmap for a render without adaptive sampling would come back as nonuniform! Specifically, the sample heatmap would come back indicating that some pixels had received one fewer sample than the total target sample-per-pixel count across the whole render. These pixels were always in square blocks corresponding to a specific tile, or multithreaded work dispatch unit. The specific bug was in how Takua Renderer dispatches rendering work to each thread; to provide the relevant context and explain what I mean by a “tile”, I’ll first have to quickly describe how Takua Renderer is multithreaded.

In university computer graphics courses, path tracing is often taught as being trivially simple to parallelize: since a path tracer traces individual paths in a depth-first fashion, individual paths don’t have dependencies on other paths, so just assign each path that has to be traced to a separate thread. The easiest way to implement this simple parallelization scheme is to just run a parallel_for loop over all of the paths that need to be traced for a given set of samples, and to just repeat this for each set of samples until the render is complete. However, in reality, parallelizing a modern production-grade path tracing renderer is often not as simple as the classic “embarrassingly parallel” approach. Modern advanced path tracers often are written to take into account factors such as cache coherency, memory access patterns and memory locality, NUMA awareness, optimal SIMD utilization, and more. Also, advanced path tracers often make use of various complex data structures such as out-of-core texture caches, photon maps, path guiding trees, and more. Making sure that these data structures can be built, updated, and accessed on-the-fly by multiple threads simultaneously and efficiently often introduces complex lock-free data structure design problems. On top of that, path tracers that use a wavefront or breadth-first architecture instead of a depth-first approach are far from trivial to parallelize, since various sorting and batching operations and synchronization points need to be accounted for.

Even for relatively straightforward depth-first architectures like the one Takua has used for the past six years, the direct parallel_for approach can be improved upon in some simple ways. Before progressive rendering became the standard modern approach, many renderers used an approach called “bucket” rendering [Geupel 2018], where the image plane was divided up into a bunch of small tiles, or buckets. Each thread would be assigned a single bucket, and each thread would be responsible for rendering that bucket to completion before being assigned another bucket. For offline, non-interactive rendering, bucket rendering often ends up being faster than just a simple parallel_for because bucket rendering allows for a higher degree of memory access coherency and cache coherency within each thread since each thread is always working in roughly the same area of space (at least for the first few bounces). Even with progressive rendering as the standard approach for renderers running in an interactive mode today, many (if not most) renderers still use a bucketed approach when dispatched to a renderfarm today. For CPU path tracers today, the number of pixels that need to be rendered for a typical image is much much larger than the number of hardware threads available on the CPU. As a result, the basic locality idea that bucket rendering utilizes also ends up being applicable to progressive, interactive rendering in CPU path tracers (for GPU path tracing though, the GPU’s completely different, wavefront-based SIMT threading model means a bit of a different approach is necessary). RenderMan, Arnold, and Vray in interactive progressive mode all still render pixels in a bucket-like order, although instead of having each thread render all samples-per-pixel to completion in each bucket all at once, each thread just renders a single sample-per-pixel for each bucket and then the renderer loops over the entire image plane for each sample-per-pixel number. To differentiate using buckets in a progressive mode from using buckets in a batch mode, I will refer to buckets in progressive mode as “tiles” for the rest of this post.

Takua Renderer also supports using a tiled approach for assigning work to individual threads. At renderer startup, Takua precalculates a work assignment order, which can be in a tiled fashion, or can use a more naive parallel_for approach; the tiled mode is the default. When using a tiled work assignment order, the specific order of tiles supports several different options; the default is a spiral starting from the center of the image. Here’s a short screen recording demonstrating what this tiling work assignment looks like:

Figure 7: A short video showing Takua Renderer's tile assignment system running in spiral mode; each red outlined square represents a single tile. This video was captured on an arm64 M1 Mac Mini running macOS Big Sur instead of on a Raspberry Pi 4B because trying to screen record on a Raspberry Pi 4B while also running the renderer was not a good time. To see this video in a full window, click here.

As threads free up, the work assignment system hands each free thread a tile to render; each thread then renders a single sample-per-pixel for every pixel in its assigned tile and then goes back to the work assignment system to request more work. Once the number of remaining tiles for the current samples-per-pixel number drops below the number of available threads, the work assignment system starts allowing multiple threads to team up on a single tile. In general, the additional cache coherency and more localizes memory access patterns from using a tiled approach gives Takua Renderer a minimum 3% speed improvement compared to using a naive parallel_for to assign work to each thread; sometimes the speed improvement can be even higher if the scene is heavily dependent on things like texture cache access or reading from a photon map.

The reason the work assignment system actually hands out tiles one by one upon request instead of just running a parallel_for loop over all of the tiles is because using something like tbb::parallel_for means that the tiles won’t actually be rendered in the correct specified order. Actually, Takua does have a “I don’t care what order the tiles are in” mode, which does in fact just run a tbb::parallel_for over all of the tiles and lets tbb’s underlying scheduler decide what order the tiles are dispatched in; rendering tiles in a specific order doesn’t actually matter for correctness. However, maintaining a specific tile ordering does make user feedback a bit nicer.

Implementing a work dispatcher that can still maintain a specific tile ordering requires some mechanism internally to track what the next tile that should be dispatched is; Takua does so using an atomic integer inside of the work dispatcher. This atomic is where the memory-reordering bug comes in that led to Takua occasionally dropping a single spp for a single tile on arm64. Here’s some pesudo-code for how threads are launched and how they ask the work dispatcher for tiles to render; this is highly simplified and condensed from how the actual code in Takua is written (specifically, I’ve inlined together code from both individual threads and from the work dispatcher and removed a bunch of other unrelated stuff), but preserves all of the important details necessary to illustrate the bug:

int nextTileIndex = 0;
std::atomic<bool> nextTileSoftLock(false);
tbb::parallel_for(int(0), numberOfTilesToRender, [&](int /*i*/) {
    bool gotNewTile = false;
    int tile = -1;
    while (!gotNewTile) {
        bool expected = false;
        if (nextTileSoftLock.compare_exchange_strong(expected, true, std::memory_order_relaxed)) {
            tile = nextTileIndex++;
            nextTileSoftLock.store(false, std::memory_order_relaxed);
            gotNewTile = true;
        }
    }
    if (tileIsInRange(tile)) {
        renderTile(tile);
    }
});

Listing 1: Simplified pseudocode for the not-very-good work scheduling mechanism Takua used to assign tiles to threads. This version of the scheduler resulted in tiles occasionally being missed on arm64, but not on x64-64.

If you remember your memory ordering rules, you already know what’s wrong with the code above; this code is really really bad! In my defense, this code is an ancient part of Takua’s codebase; I wrote it back in college and haven’t really revisited it since, and back when I wrote it, I didn’t have the strongest grasp of memory ordering rules and how they apply to concurrent programming yet. First off, why does this code use an atomic bool as a makeshift mutex so that multiple threads can increment a non-atomic integer, as opposed to just using an atomic integer? Looking through the commit history, the earliest version of this code that I first prototyped (some eight years ago!) actually relied on a full-blown std::mutex to protect from race conditions around incrementing nextTileIndex; I must have prototyped this code completely single-threaded originally and then done a quick-and-dirty multithreading adaptation by just wrapping a mutex around everything, and then replaced the mutex with a cheaper atomic bool as an incredibly lazy port to a lock-free implementation instead of properly rewriting things. I haven’t had to modify it since then because it worked well enough, so over time I must have just completely forgotten about how awful this code is.

Anyhow, the fix for the code above is simple enough: just replace the first std::memory_order_relaxed in line 8 with std::memory_order_acquire and replace the second std::memory_order_relaxed in line 10 with std::memory_order_release. An even better fix though is to just outright replace the combination of an atomic bool and non-atomic integer incremented with a single atomic integer incrementer, which is what I actually did. But, going back to the original code, why exactly does using std::memory_order_relaxed produce correctly functioning code on x86-64, but produces code that occasionally drops tiles on arm64? Well, first, why did I use std::memory_order_relaxed in the first place? My commit comments from eight years ago indicate that I chose std::memory_order_relaxed because I thought it would compile down to something cheaper than if I had chosen some other memory ordering flag; I really didn’t understand this stuff back then! I wasn’t entirely wrong, although not for the reasons that I thought at the time. On x86-64, different memory order flags don’t actually do anything, since x86-64 has a guaranteed strong memory model. On arm64, using std::memory_order_relaxed instead of std::memory_order_acquire/std::memory_order_release does indeed produce simpler and faster arm64 assembly, but the simpler and faster arm64 assembly is also wrong for what the code is supposed to do. Understanding why the above happens on arm64 but not on x86-64 requires understanding what a weakly ordered CPU is versus what a strong ordered CPU is; arm64 is a weakly ordered architecture, whereas x86-64 is a strongly ordered architecture.

One of the best resources on diving deep into weak versus strong memory orderings is the well-known series of articles by Jeff Preshing on the topic (parts 1, 2, 3, 4, 5, 6, and 7). Actually, while I was going back through the Preshing on Programming series in preparation to write this post, I noticed that by hilarious coincidence the older code in Takua represented by Listing 1, once boiled down to what it is fundamentally doing, is extremely similar to the canonical example used in Preshing on Programming’s “This Is Why They Call It a Weakly-Ordered CPU” article. If only I had read the Preshing on Programming series a year before implementing Takua’s work assignment system instead of a few years after! I’ll do my best to quickly recap what the Preshing on Programming series covers about weak versus strong memory orderings here, but if you have not read Jeff Preshing’s articles before, I’d recommend taking some time later to do so.

One of the single most important things that lock-free multithreaded code needs to take into account is the potential for memory reordering. Memory reordering is when the compiler and/or the processor decides to optimize code by changing the ordering of instructions that access and modify memory. Memory reordering is always carried out in such a way that the behavior of a single-threaded program never changes, and multithreaded code using locks such as mutexes forces the compiler and processor to not reorder instructions across the boundaries defined by locks. However, lock-free multithreaded code is basically free range for the compiler and processor to do whatever they want; even though memory reordering is carried out for each individual thread in such a way that keeps the apparent behavior of that specific thread the same as before, this rule does not take into account the interactions between threads, so different reorderings in different threads that keep behavior the same in each thread isolated can still result in very different behavior in the overall multithreaded behavior.

The easiest way to disable any kind of memory reordering at compile time is to just… disable all compiler optimizations. However, in practice we never actually want to do this, because disabling compiler optimizations means all of our code will run slower (sometimes a lot slower). Instruction selection to lower from IR to assembly also means that even disabling all compiler optimizations may not be enough to ensure no memory reordering, because we still need to contend with potential memory reordering at runtime from the CPU.

Memory reordering in multithreaded code happens on the CPU because of how CPUs access memory: modern processors have a series of caches (L1, L2, sometimes L3, etc) sitting between the actual registers in each CPU core and main memory. Some of these cache levels (usually L1) are per-CPU-core, and some of these cache levels (usually L2 and higher) are shared across some or all cores. The lower the cache level number, the faster and also smaller that cache level typically is, and the higher the cache level number, the slower and larger that cache level is. When a CPU wants to read a particular piece of data, it will check for it in cache first, and if the value is not in cache, then the CPU must make a fetch request to main memory for the value; fetching from main memory is obviously much slower than fetching from cache. Where these caches get tricky is how data is propagated from a given CPU core’s registers and caches back to main memory and then eventually up again into the L1 caches for other CPU cores. This propagation can happen… whenever! A variety of different possible implementation strategies exist for when caches update from and write back to main memory, with the end result being that by default we as programmers have no reliable way of guessing when data transfers between cache and main memory will happen.

Imagine that we have some multithreaded code written such that one thread writes, or stores, to a value, and then a little while later, another thread reads, or loads, that same value. We would expect the store on the first thread to always precede the load on the second thread, so the second thread should always pick up whatever value the first thread read. However, if we implement this code just using a normal int or float or bool or whatever, what can actually happen at runtime is our first thread writes the value to L1 cache, and then eventually the value in L1 cache gets written back to main memory. However, before the value manages to get propagated from L1 cache back to main memory, the second thread reads the value out of main memory. In this case, from the perspective of main memory, the second thread’s load out of main memory takes place before the first thread’s store has rippled back down to main memory. This case is an example of StoreLoad reordering, so named because a store has been reordered with a later load. There are also LoadStore, LoadLoad, and StoreStore reorderings that are possible. Jeff Preshing’s “Memory Barriers are Like Source Control” article does a great job of describing these four possible reordering scenarios in detail.

Different CPU architectures make different guarantees about which types of memory reordering can and can’t happen on that particular architecture at the hardware level. A processor that guarantees absolutely no memory reordering of any kind is said to have a sequentially consistent memory model. Few, if any modern processor architecture provide a guaranteed sequentially consistent memory model. Some processors don’t guarantee absolutely sequential consistency, but do guarantee that at least when a CPU core makes a series of writes, other CPU cores will see those writes in the same sequence that they were made; CPUs that make this guarantee have a strong memory model. Strong memory models effectively guarantee that StoreLoad reordering is the only type of reordering allowed; x86-64 has a strong memory model. Finally, CPUs that allow for any type of memory reordering at all are said to have a weak memory model. The arm64 architecture uses a weak memory model, although arm64 at least guarantees that if we read a value through a pointer, the value read will be at least as new as the pointer itself.

So, how can we possibly hope to be able to reason about multithreaded code when both the compiler and the processor can happily reorder our memory access instructions between threads whenever they want for whatever reason they want? The answer is in memory barriers and fence instructions; these tools allow us to specify boundaries that the compiler cannot reorder memory access instructions across and allow us to force the CPU to make sure that values are flushed to main memory before being read. In C++, specifying barriers and fences can be done by using compiler intrinsics that map to specific underlying assembly instructions, but the easier and more common way of doing this is by using std::memory_order flags in combination with atomics. Other languages have similar concepts; for example, Rust’s atomic access flags are very similar to the C++ memory ordering flags.

std::memory_order flags specify how memory accesses for all operations surrounding an atomic are to be ordered; the impacted surrounding operations include all non-atomics. There are a whole bunch of std::memory_order flags; we’ll examine the few that are relevant to the specific example in Listing 1. The heaviest hammer of all of the flags is std::memory_order_seq_cst, which enforces absolute sequential consistency at the cost of potentially being more expensive due to potentially needing more loads and/or stores. For example, on x86-64, std::memory_order_seq_cst is often implemented using slower xchg or paired mov/mfence instructions instead of a single mov instruction, and on arm64, the overhead is even greater due to arm64’s weak memory model. Using std::memory_order_seq_cst also potentially disallows the CPU from reordering unrelated, longer running instructions to starting (and therefore finish) earlier, potentially causing even more slowdowns. In C++, atomic operations default to using std::memory_order_seq_cst if no memory ordering flag is explicitly specified. Contrast with std::memory_order_relaxed, which is the exact opposite of std::memory_order_seq_cst. std::memory_order_relaxed enforces no synchronization or ordering constraints whatsoever; on an architecture like x86-64, using std::memory_order_relaxed can be faster than using std::memory_order_seq_cst if your memory ordering requirements are already met in hardware by x86-64’s strong memory model. However, being sloppy with std::memory_order_relaxed can result in some nasty nondeterministic bugs on arm64 if your code requires specific memory ordering guarantees, due to arm64’s weak memory model. The above is the exact reason why the code in Listing 1 occasionally resulted in dropped tiles in Takua on arm64!

Without any kind of memory ordering constraints, with arm64’s weak memory ordering, the code in Listing 1 can sometimes execute in such a way that one thread sets nextTileSoftLock to true, but another thread attempts to check nextTileSoftLock before the first thread’s new value propagates back to main memory and to all of the other threads. As a result, two threads can end up in a race condition, trying to both increment the non-atomic nextTileIndex at the same time. When this happens, two threads can end up working on the same tile at the same time or a tile can get skipped! We could fix this problem by just removing the memory ordering flags entirely from Listing 1, allowing everything to default back to std::memory_order_seq_cst, which would fix the problem. However, as just mentioned above, we can do better than using std::memory_order_seq_cst if we know specifically what memory ordering requirements we need for the code to work correctly.

Enter std::memory_order_acquire and std::memory_order_release, which represent acquire semantics and release semantics respectively and, when used correctly, always come in a pair. Acquire semantics apply to load (read) operations and prevent memory ordering of the load operation with any subsequent read or write operation. Release semantics apply to store (write) operations and prevent memory reordering of the store operation with any preceding read or write operation. In other words, std::memory_order_acquire tells the compiler to issue instructions that prevent LoadLoad and LoadStore reordering from happening, and std::memory_order_release tells the compiler to issue instructions that prevent LoadStore and StoreStore reordering from happening. Using acquire and release semantics allows Listing 1 to work correctly on arm64, while being ever so slightly cheaper compared with enforcing absolute sequential consistency everywhere.

What is the takeaway from this long tour through memory reordering and weak and strong memory models and memory ordering constraints? The takeaway is that when writing multithreaded code that needs to be portable across architectures with different memory ordering guarantees, such as x86-64 versus arm64, we need to be very careful with thinking about how each architecture’s memory ordering guarantees (or lack thereof) impact any lock-free cross-thread communication we need to do! Atomic code often can be written more sloppily on x86-64 than on arm64 and still have a good chance of working, whereas arm64’s weak memory model means there’s much less room for being sloppy. If you want a good way to smoke out potential bugs in your lock-free atomic code, porting to arm64 is a good way to find out!

A Deep Dive on x86-64 versus arm64 Through the Lens of Compiling `std::atomic::compare_exchange_weak()`

While I was looking for the source of the memory reordering bug, I found a separate interesting bug in Takua’s atomic framebuffer… or at least, I thought it was a bug. The thing I found turned out to not be a bug at all in the end, but at the time I thought that there was a bug in the form of a race condition in an atomic compare-and-exchange loop. I figured that the renderer must be just running correctly most of the time instead of all of the time, but as I’ll explain in a little bit, the renderer actually provably runs correctly 100% of the time. Understanding what was going on here led me to dive into the compiler’s assembly output, and wound up being an interesting case study in comparing how the same exact C++ source code compiles to x86-64 versus arm64. In order to provide the context for the not-a-bug and what I learned about arm64 from it, I need to first briefly describe what Takua’s atomic framebuffer is and how it is used.

Takua supports multiple threads writing to the same pixel in the framebuffer at the same time. There are two major uses cases for this capability: first, integration techniques that use light tracing will connect back to the camera completely arbitrarily, resulting in splats to the framebuffer that are completely unpredictable and possibly overlapping on the same pixels. Second, adaptive sampling techniques that redistribute sample allocation within a single iteration (meaning launching a single set of pixel samples) can result in multiple samples for the same pixel in the same iteration, which means multiple threads can be calculating paths starting from the same pixel and therefore multiple threads need to write to the same framebuffer pixel. In order to support multiple threads writing simultaneously to the same pixel in the framebuffer, there are three possible implementation options. The first option is to just keep a separate framebuffer per thread and merge afterwards, but this approach obviously requires potentially a huge amount of memory. The second option is to never write to the framebuffer directly, but instead keep queues of framebuffer write requests that occasionally get flushed to the framebuffer by a dedicated worker thread (or some variation thereof). The third option is to just make each pixel in the framebuffer support exclusive operations through atomics (a mutex per pixel works too, but obviously this would involve much more overhead and might be slower); this option is the atomic framebuffer. I actually implemented the second option in Takua a long time ago, but the added complexity and performance impact of needing to flush the queue led me to eventually replace the whole thing with an atomic framebuffer.

The tricky part of implementing an atomic framebuffer in C++ is the need for atomic floats. Obviously each pixel in the framebuffer has to store at the very least accumulated radiance values for the base RGB primaries, along with potentially other AOV values, and accumulated radiance values and many common AOVs all have to be represented with floats. Modern C++ has standard library support for atomic types through std::atomic, and std::atomic works with floats. However, pre-C++20, std::atomic only provides atomic arithmetic operations for integer types. C++20 adds fetch_add() and fetch_sub() implementations for std::atomic<float>, but I wrote Takua’s atomic framebuffer way back when C++11 was still the latest standard. So, pre-C++20, if you want atomic arithmetic operations for std::atomic<float>, you have to implement it yourself. Fortunately, pre-C++20 does provide compare_and_exchange() implementations for all atomic types, and that’s all we need to implement everything else we need ourselves.

Implementing fetch_add() for atomic floats is fairly straightforward. Let’s say we want to add a value f1 to an atomic float f0. The basic idea is to do an atomic load from f0 into some temporary variable oldval. A standard compare_and_exchange() implementation compares some input value with the current value of the atomic float, and if the two are equal, replaces the current value of the atomic float with a second input value; C++ provides an implementations in the form of compare_exchange_weak() and compare_exchange_strong(). So, all we need to do is run compare_exchange_weak() on f0 where the value we use for the comparison test is oldval and the replacement value is oldval + f1; if compare_exchange_weak() succeeds, we return oldval, otherwise, loop and repeat until compare_exchange_weak() succeeds. Here’s an example implementation:

float addAtomicFloat(std::atomic<float>& f0, const float f1) {
    do {
        float oldval = f0.load();
        float newval = oldval + f1;
        if (f0.compare_exchange_weak(oldval, newval)) {
            return oldval;
        }
    } while (true);
}

Listing 2: Example implementation of atomic float addition.

Seeing why the above implementation works should be very straightforward: imagine two threads are calling the above implementation at the same time. We want each thread to reload the atomic float on each iteration because we never want a situation where a first thread loads from f0, a second thread succeeds in adding to f0, and then the first thread also succeeds in writing its value to f0, because upon the first thread writing, the value of f0 that the first thread used for the addition operation is out of date!

Well, here’s the implementation that has actually been in Takua’s atomic framebuffer implementation for most of the past decade. This implementation is very similar to Listing 2, but compared with Listing 2, Lines 2 and 3 are swapped from where they should be; I likely swapped these two lines through a simple copy/paste error or something when I originally wrote it. This is the implementation that I suspected was a bug upon revisiting it during the arm64 porting process:

float addAtomicFloat(std::atomic<float>& f0, const float f1) {
    float oldval = f0.load();
    do {
        float newval = oldval + f1;
        if (f0.compare_exchange_weak(oldval, newval)) {
            return oldval;
        }
    } while (true);
}

Listing 3: What I thought was an incorrect implementation of atomic float addition.

In the Listing 3 implementation, note how the atomic load of f0 only ever happens once outside of the loop. The following is what I thought was going on and why at the moment I thought this implementation was wrong: Think about what happens if a first thread loads from f0 and then a second thread’s call to compare_exchange_weak() succeeds before the first thread gets to compare_exchange_weak(); in this race condition scenario, the first thread should get stuck in an infinite loop. Since the value of f0 has now been updated by the second thread, but the first thread never reloads the value of f0 inside of the loop, the first thread should have no way of ever succeeding at the compare_exchange_weak() call! However, in reality, with the Listing 3 implementation, Takua never actually gets stuck in an infinite loop, even when multiple threads are writing to the same pixel in the atomic framebuffer. I initially thought that I must have just been getting really lucky every time and multiple threads, while attempting to accumulate to the same pixel, just never happened to produce the specific compare_exchange_weak() call ordering that would cause the race condition and infinite loop. But then I repeatedly tried a simple test where I had 32 threads simultaneously call addAtomicFloat() for the same atomic float a million times per thread, and… still an infinite loop never occurred. So, the situation appeared to be that what I thought was incorrect code was always behaving as if it had been written correctly, and furthermore, this held true on both x86-64 and on arm64, across both compiling with Clang on macOS and compiling with GCC on Linux.

If you are well-versed in the C++ specifications, you already know which crucial detail I had forgotten that explains why Listing 3 is actually completely correct and functionally equivalent to Listing 2. Under the hood, std::atomic<T>::compare_exchange_weak(T& expected, T desired) requires doing an atomic load of the target value in order to compare the target value with expected. What I had forgotten was that if the comparison fails, std::atomic<T>::compare_exchange_weak() doesn’t just return a false bool; the function also replaces expected with the result of the atomic load on the target value! So really, there isn’t only a single atomic load of f0 in Listing 3; there’s actually an atomic load of f0 in every loop as part of compare_exchange_weak(), and in the event that the comparison fails, the equivalent of oldval = f0.load() happens. Of course, I didn’t actually correctly remember what compare_exchange_weak() does in the comparison failure case, and I stupidly didn’t double check cppreference, so it took me much longer to figure out what was going on.

So, still missing the key piece of knowledge that I had forgotten and assuming that compare_exchange_weak() didn’t modify any inputs upon comparison failure, my initial guess was that perhaps the compiler was inlining f0.load() wherever oldval was being used as an optimization, which would produce a result that should prevent the race condition from ever happening. However, after a bit more thought, I concluded that this optimization was very unlikely, since it both changes the written semantics of what the code should be doing by effectively moving an operation from outside a loop to the inside of the loop, and also inlining f0.load() wherever oldval is used is not actually a safe code transformation and can produce a different result from the originally written code, since having two atomic loads from f0 introduces the possibility that another thread can do an atomic write to f0 in between the current thread’s two atomic loads.

Things got even more interesting when I tried adding in an additional bit of indirection around the atomic load of f0 into oldval. Here is an actually incorrect implementation that I thought should be functionally equivalent to the implementation in Listing 3:

float addAtomicFloat(std::atomic<float>& f0, const float f1) {
    const float oldvaltemp = f0.load();
    do {
        float oldval = oldvaltemp;
        float newval = oldval + f1;
        if (f0.compare_exchange_weak(oldval, newval)) {
            return oldval;
        }
    } while (true);
}

Listing 4: An actually incorrect implementation of atomic float addition that might appear to be semantically identical to the implementation in Listing 3 if you've forgotten a certain very important detail about std::compare_exchange_weak().

Creating the race condition and subsequent infinite loop is extremely easy and reliable with Listing 4. So, to summarize where I was at this point: Listing 2 is a correctly written implementation that produces a correct result in reality, Listing 4 is an incorrectly written implementation that, as expected, produces an incorrect result in reality, and Listing 3 is what I thought was an incorrectly written implementation that I thought was semantically identical to Listing 4, but actually produces the same correct result in reality as Listing 2!

So, left with no better ideas, I decided to just go look directly at the compiler’s output assembly. To make things a bit easier, we’ll look at and compare the x86-64 assembly for the Listing 2 and Listing 3 C++ implementations first, and explain what important detail I had missed that led me down this wild goose chase. Then, we’ll look at and compare the arm64 assembly, and we’ll discuss some interesting things I learned along the way by comparing the x86-64 and arm64 assembly for the same C++ function.

Here is the corresponding x86-64 assembly for the correct C++ implementation in Listing 2, compiled with Clang 10.0.0 using -O3. For readers who are not very used to reading assembly, I’ve included annotations as comments in the assembly code to describe what the assembly code is doing and how it corresponds back to the original C++ code:

addAtomicFloat(std::atomic<float>&, float):  # f0 is dword ptr [rdi], f1 is xmm0
.LBB0_1:
        mov           eax, dword ptr [rdi]   # eax = *arg0 = f0.load()
        movd          xmm1, eax              # xmm1 = eax = f0.load()
        movdqa        xmm2, xmm1             # xmm2 = xmm1 = eax = f0.load()
        addss         xmm2, xmm0             # xmm2 = (xmm2 + xmm0) = (f0 + f1)
        movd          ecx, xmm2              # ecx = xmm2 = (f0 + f1)
        lock cmpxchg  dword ptr [rdi], ecx   # if eax == *arg0 { ZF = 1; *arg0 = arg1 }
                                             #    else { ZF = 0; eax = *arg0 };
                                             #    "lock" means all done exclusively
        jne           .LBB0_1                # if ZF == 0 goto .LBB0_1
        movdqa        xmm0, xmm1             # return f0 value from before cmpxchg
        ret

Listing 5: x86-64 assembly corresponding to the implementation in Listing 2, with my annotations in the comments. Compiled using armv8-a Clang 10.0.0 using -O3. See on Godbolt Compiler Explorer

Here is the corresponding x86-64 assembly for the C++ implementation in Listing 3; again, this is the version that produces the same correct result as Listing 2. Just like with Listing 5, this was compiled using Clang 10.0.0 using -O3, and descriptive annotations are in the comments:

addAtomicFloat(std::atomic<float>&, float):  # f0 is dword ptr [rdi], f1 is xmm0
        mov           eax, dword ptr [rdi]   # eax = *arg0 = f0.load()
.LBB0_1:
        movd          xmm1, eax              # xmm1 = eax = f0.load()
        movdqa        xmm2, xmm1             # xmm2 = xmm1 = eax = f0.load()
        addss         xmm2, xmm0             # xmm2 = (xmm2 + xmm0) = (f0 + f1)
        movd          ecx, xmm2              # ecx = xmm2 = (f0 + f1)
        lock cmpxchg  dword ptr [rdi], ecx   # if eax == *arg0 { ZF = 1; *arg0 = arg1 }
                                             #    else { ZF = 0; eax = *arg0 };
                                             #    "lock" means all done exclusively
        jne           .LBB0_1                # if ZF == 0 goto .LBB0_1
        movdqa        xmm0, xmm1             # return f0 value from before cmpxchg

Listing 6: x86-64 assembly corresponding to the implementation in Listing 3, with my annotations in the comments. Compiled using armv8-a Clang 10.0.0 using -O3. See on Godbolt Compiler Explorer

The compiled x86-64 assembly in Listing 5 and Listing 6 is almost identical; the only difference is that in Listing 5, copying data from the address stored in register rdi to register eax happens after label .LBB0_1 and in Listing 6 the copy happens before label .LBB0_1. Comparing the x86-64 assembly with the C++ code, we can see that this difference corresponds directly to where f0’s value is atomically loaded into oldval. We can also see that std::atomic<float>::compare_exchange_weak() compiles down to a single cmpxchg instruction, which as the instruction name suggests, is a compare and exchange operation. The lock instruction prefix in front of cmpxchg ensures that the current CPU core has exclusive ownership of the corresponding cache line for the duration of the cmpxchg operation, which is how the operation is made atomic.

This is the point where I eventually realized what I had missed. I actually didn’t notice immediately; figuring out what I had missed didn’t actually occur to me until several days later! The thing that finally made me realize what I had missed and made me understand why Listing 3 / Listing 6 don’t actually result in an infinite loop and instead match the behavior of Listing 2 / Listing 5 lies in cmpxchg. Let’s take a look at the official Intel 64 and IA-32 Architectures Software Developer’s Manual’s description [Intel 2021] of what cmpxchg does:

Compares the value in the AL, AX, EAX, or RAX register with the first operand (destination operand). If the two values are equal, the second operand (source operand) is loaded into the destination operand. Otherwise, the destination operand is loaded into the AL, AX, EAX or RAX register. RAX register is available only in 64-bit mode.

This instruction can be used with a LOCK prefix to allow the instruction to be executed atomically. To simplify the interface to the processor’s bus, the destination operand receives a write cycle without regard to the result of the comparison. The destination operand is written back if the comparison fails; otherwise, the source operand is written into the destination. (The processor never produces a locked read without also producing a locked write.)

If the compare part of cmpxchg fails, the first operand is loaded into the EAX register! After thinking about this property of cmpxchg for a bit, I finally had my head-smack moment and remembered that std::atomic<T>::compare_exchange_weak(T& expected, T desired) replaces expected with the result of the atomic load in the event of comparison failure. This property of std::atomic<T>::compare_exchange_weak() is why std::atomic<T>::compare_exchange_weak() can be compiled down to a single cmpxchg instruction on x86-64 in the first place. We can actually see the compiler being clever here in Listing 6 and exploiting the fact that cmpxchg comparison failure mode writes into the eax register: the compiler chooses to use eax as the target for the mov instruction in Line 1 instead of using some other register so that a second move from eax into some other register isn’t necessary after cmpxchg. If anything, the implementation in Listing 3 / Listing 6 is actually slightly more efficient than the implementation in Listing 2 / Listing 5, since there is one fewer mov instruction needed in the loop.

So what does this have to do with learning about arm64? Well, while I was in the process of looking at the x86-64 assembly to try to understand what was going on, I also tried the implementation in Listing 3 on my Raspberry Pi 4B just to sanity check if things worked the same on arm64. At that point I hadn’t realized that the code in Listing 3 was actually correct yet, so I was beginning to consider possibilities like a compiler bug or weird platform-specific considerations that I hadn’t thought of, so to rule those more exotic explanations out, I decided to see if the code worked the same on x86-64 and arm64. Of course the code worked exactly the same on both, so the next step was to also examine the arm64 assembly in addition to the x86-64 assembly. Comparing the same code’s corresponding assembly for x86-64 and arm64 at the same time proved to be a very interesting exercise in getting to better understand some low-level and general differences between the two instruction sets.

Here is the corresponding arm64 assembly for the implementation in Listing 2; this is the arm64 assembly that is the direct counterpart to the x86-64 assembly in Listing 5. This arm64 assembly was also compiled with Clang 10.0.0 using -O3. I’ve included annotations here as well, although admittedly my arm64 assembly comprehension is not as good as my x86-64 assembly comprehension, since I’m relatively new to compiling for arm64. If you’re well versed in arm64 assembly and see a mistake in my annotations, feel free to send me a correction!

addAtomicFloat(std::atomic<float>&, float):
        b       .LBB0_2              // goto .LBB0_2
.LBB0_1:
        clrex                        // clear this thread's record of exclusive lock
.LBB0_2:
        ldar    w8, [x0]             // w8 = *arg0 = f0, non-atomically loaded
        ldaxr   w9, [x0]             // w9 = *arg0 = f0.load(), atomically
                                     //    loaded (get exclusive lock on x0), with
                                     //    implicit synchronization
        fmov    s1, w8               // s1 = w8 = f0
        fadd    s2, s1, s0           // s2 = s1 + s0 = (f0 + f1)
        cmp     w9, w8               // compare non-atomically loaded f0 with atomically
                                     //    loaded f0 and store result in N
        b.ne    .LBB0_1              // if N==0 { goto .LBB0_1 }
        fmov    w8, s2               // w8 = s2 = (f0 + f1)
        stlxr   w9, w8, [x0]         // if this thread has the exclusive lock,
                                     //    { *arg0 = w8 = (f0 + f1), release lock },
                                     //    store whether or not succeeded in w9
        cbnz    w9, .LBB0_2          // if w9 says exclusive lock failed { goto .LBB0_2}
        mov     v0.16b, v1.16b       // return f0 value from ldaxr
        ret

Listing 7: arm64 assembly corresponding to Listing 2, with my annotations in the comments. Compiled using arm64 Clang 10.0.0 using -O3. See on Godbolt Compiler Explorer

I should note here that the specific version of arm64 that Listing 7 was compiled for is ARMv8.0-A, which is what Clang and GCC both default to when compiling for arm64; this detail will become important a little bit later in this post. When we compare Listing 7 with Listing 5, we can immediately see some major differences between the arm64 and x86-64 instruction sets, aside from superficial stuff like how registers are named. The arm64 version is just under twice as long as the x86-64 version, and examining the code, we can see that most of the additional length comes from how the atomic compare-and-exchange is implemented. Actually, the rest of the code is very similar; the rest of the code is just moving stuff around to support the addition operation and to deal with setting up and jumping to the top of the loop. In the compare and exchange code, we can see that the arm64 version does not have a single instruction to implement the atomic compare-and-exchange! While the x86-64 version can compile std::atomic<float>::compare_exchange_weak() down into a single cmpxchg instruction, ARMv8.0-A has no equivalent instruction, so the arm64 version instead must use three separate instructions to implement the complete functionality: ldaxr to do an exclusive load, stlxr to do an exclusive store, and clrex to reset the current thread’s record of exclusive access requests.

This difference speaks directly towards x86-84 being a CISC architecture and arm64 being a RISC architecture. x86-64’s CISC nature calls for the ISA to have a large number of instructions carrying out complex often-multistep operations, and this design philosophy is what allows x86-64 to encode complex multi-step operations like a compare-and-exchange as a single instruction. Conversely, arm64’s RISC nature means a design consisting of fewer, simpler operations [Patterson and Ditzel 1980]; for example, the RISC design philosophy mandates that memory access be done through specific single-cycle instructions instead of as part of a more complex instruction such as compare-and-exchange. These differing design philosophies mean that in arm64 assembly, we will often see many instructions used to implement what would be a single instruction in x86_64; given this difference, compiling Listing 2 produces surprisingly structurally similarities in the output x86_64 (Listing 5) and arm64 (Listing 7) assembly. However, if we take the implementation of addAtomicFloat() in Listing 3 and compile it for arm64’s ARMv8.0-A revision, structural differences between the x86-64 and arm64 output become far more apparent:

addAtomicFloat(std::atomic<float>&, float):
        ldar    w9, [x0]             // w9 = *arg0 = f0, non-atomically loaded
        ldaxr   w8, [x0]             // w8 = *arg0 = f0.load(), atomically
                                     // loaded (get exclusive lock on x0), with
                                     // implicit synchronization
        fmov    s1, w9               // s1 = s9 = f0
        cmp     w8, w9               // compare non-atomically loaded f0 with atomically
                                     // loaded f0 and store result in N
        b.ne    .LBB0_3              // if N==0 { goto .LBB0_3 }
        fadd    s2, s1, s0           // s2 = s1 + s0 = (f0 + f1)
        fmov    w9, s2               // w9 = s2 = (f0 + f1)
        stlxr   w10, w9, [x0]        // if this thread has the exclusive lock,
                                     //    { *arg0 = w9 = (f0 + f1), release lock },
                                     //    store whether or not succeeded in w10
        cbnz    w10, .LBB0_4.        // if w10 says exclusive lock failed { goto .LBBO_4 }
        mov     w9, #1.              // w9 = 1 (???)
        tbz     w9, #0, .LBB0_8.     // if bit 0 of w9 == 0 { goto .LBB0_8 }
        b       .LBB0_5              // goto .LBB0_5
.LBB0_3:
        clrex.                       // clear this thread's record of exclusive lock
.LBB0_4:
        mov     w9, wzr              // w9 = 0
        tbz     w9, #0, .LBB0_8      // if bit 0 of w9 == 0 { goto .LBBO_8 }
.LBB0_5:
        mov     v0.16b, v1.16b.      // return f0 value from ldaxr
        ret
.LBB0_6:
        clrex                        // clear this thread's record of exclusive lock
.LBB0_7:
        mov     w10, wzr             // w10 = 0
        mov     w8, w9               // w8 = w9
        cbnz    w10, .LBB0_5         // if w10 is not zero { goto .LBB0_5 }
.LBB0_8:
        ldaxr   w9, [x0]             // w9 = *arg0 = f0.load(), atomically
                                     //    loaded (get exclusive lock on x0), with
                                     //    implicit synchronization
        fmov    s1, w8               // s1 = w0 = f0
        cmp     w9, w8               // compare non-atomically loaded f0 with atomically
                                     // loaded f0 and store result in N
        b.ne    .LBB0_6              // if N==0 { goto .LBBO_6 }
        fadd    s2, s1, s0           // s2 = s1 + s0 = (f0 + f1)
        fmov    w8, s2               // w2 = s2 = (f0 + f1)
        stlxr   w10, w8, [x0]        // if this thread has the exclusive lock,
                                     //    { *arg0 = w8 = (f0 + f1), release lock },
                                     //    store whether or not succeeded in w10
        cbnz    w10, .LBB0_7         // if w10 says exclusive lock failed { goto .LBB0_7 }
        mov     w10, #1              // w10 = 1
        mov     w8, w9               // w8 = w9 = f0.load()
        cbz     w10, .LBB0_8         // if w10==0 { goto .LBB0_8 }
        b       .LBB0_5              // goto .LBB0_5

Listing 8: arm64 assembly corresponding to Listing 3, with my annotations in the comments. Compiled using arm64 Clang 10.0.0 using -O3. See on Godbolt Compiler Explorer

Moving the atomic load out of the loop in Listing 3 resulted in a single line change between Listing 5 and Listing 6’s x86-64 assembly, but causes the arm64 version to explode in size and radically change in structure between Listing 7 and Listing 8! The key difference between Listing 7 and Listing 8 is that in Listing 8, the entire first iteration of the while loop is lifted out into it’s own code segment, which can then either directly return out of the function or go into the main body of the loop afterwards. I initially thought that Clang’s decision to lift out the first iteration of the loop was surprising, but it turns out that GCC 10.3 and MSVC v19.28’s respective arm64 backends also similarly decide to lift the first iteration of the loop out as well. The need to lift the entire first iteration out of the loop likely comes from the need to use an ldaxr instruction to carry out the initial atomic load of f0. Compared with GCC 10.3 and MSVC v19.28 though, Clang 10.0.0’s arm64 output does seem to do a bit more jumping around (see .LBB0_4 through .LBBO_7) though. Also, admittedly I’m not entirely sure why register w9 gets set to 1 and then immediately compared with 0 in lines 16/17 and lines 47/49; maybe that’s just a convenient way to clear the z bit of the CPSR (Current Program Status Register; this is analogous to EFLAG on x86-64)? But anyhow, compared with Listing 7, the arm64 assembly in Listing 8 is much longer in terms of code length, but actually is only slightly more inefficient in terms of total instructions executed. The slight additional inefficiency comes from some of the additional setup work needed to manage all of the jumping and the split loop. However, the fact that Listing 8 is less efficient compared with Listing 7 is interesting when we compare with what Listing 3 does to the x86-64 assembly; in the case of x86-64, pulling the initial atomic load out of the loop makes the output x86-64 assembly slightly more efficient, as opposed to slightly less efficient as we have here with arm64.

As a very loose general rule of thumb, arm64 assembly tends to be longer than the equivalent x86-64 assembly for the same high-level code because CISC architectures simply tend to encode a lot more stuff per instruction compared with RISC architectures [Weaver and McKee 2009]. However, compiled x86-64 binaries having fewer instructions doesn’t actually mean x86-64 binaries necessarily runs faster than equivalent, less “instruction-dense” compiled arm64 binary. x86-64 instructions are variable length, requiring more complex logic in the processor’s instruction decoder, and also since x86-64 instructions are more complex, they can take many more cycles per instruction to execute. Contrast with arm64, in which instructions are fixed length. Generally RISC architectures usually feature fixed length instructions, although this generalization isn’t a hard rule; the SuperH architecture (famously used in the Sega Saturn and Sega Dreamcast) is notably a RISC architecture with variable length instructions. Fixed length instructions allow for arm64 chips to have simpler logic in decoding, and arm64 also tends to take many many fewer instructions per cycle (often, but not always, as low as one or two cycles per instruction). The end result is that even though compiled arm64 binaries have lower instruction-density than compiled x86-64 binaries, arm64 processors tend to be able to retire more instructions per cycle than comparable x86-64 processors, allowing arm64 as an architecture to make up for the difference in code density.

…except, of course, all of the above is only loosely true today! While the x86-64 instruction set is still definitively a CISC instruction set today and the arm64 instruction set is still clearly a RISC instruction set today, a lot of the details have gotten fuzzier over time. Processors today rarely directly implement the instruction set that they run; basically all modern x86-64 processors today feed x86-64 instructions into a huge hardware decoder block that breaks down individual x86-64 instructions into lower-level micro-operations, or μops. Compared with older x86 processors from decades ago that directly implemented x86, these modern micro-operation-based x86-64 implementations are often much more RISC-like internally. In fact, if you were to examine all of the parts of a modern Intel and AMD x86-64 processor that take place after the instruction decoding phase, without knowing what processor you were looking at beforehand, you likely would not be able to determine if the processor implemented a CISC or a RISC ISA [Thomadakis 2011].

The same is true going the other way; while modern x86-64 is a CISC architecture that in practical implementation is often more RISC-like, modern arm64 is a RISC architecture that sometimes has surprisingly CISC-like elements if you look closely. Modern arm64 processors often also decode individual instructions into smaller micro-operations [ARM 2016], although the extent to which modern arm64 processors do this is a lot less intensive than what modern x86-64 does [Castellano 2015]. Modern arm64 instruction decoders usually rely on simple hardwired control to break instructions down into micro-operations, whereas modern x86-64 must use a programmable ROM containing advanced microcode to store mappings from x86-64 instructions to micro-instructions.

Another way that arm64 has slowly gained some CISC-like characteristics is that arm64 over time has gained some surprisingly specialized complex instructions! Remember the important note I made earlier about Listing 7 and Listing 8 being generated specifically for the ARMv8.0-A revision of arm64? Well, the specific ldaxr/stlxr combination in Listings 6 and 7 that is needed to implement an atomic compare-and-exchange (and generally any kind of atomic load-and-conditional-store operation) is a specific area where a more complex single-instruction implementation generally can perform better than an implementation using several instructions. As discussed earlier, one complex instruction is not necessarily always faster than several simpler instructions due to how the instructions actually have to be decoded and executed, but in this case, one atomic instruction allows for a faster implementation than several instructions combined since a single atomic instruction can take advantage of more available information at once [Cownie 2021]. Accordingly, the ARMv8.1-A revision of arm64 introduces a collection of new single-instruction atomic operations. Of interest to our particular example here is the new casal instruction, which performs a compare-and-exchange to memory with acquire and release semantics; this new instruction is a direct analog to the x86_64 cmpxchg instruction with the lock prefix.

We can actually use these new ARMv8.1-A single-instruction atomic operations today; while GCC and Clang both target ARMv8.0-A by default today, ARMv8.1-A support can be enabled using the -march=armv8.1-a flag starting in GCC 10.1 and starting in Clang 9.0.0. Actually, Clang’s support might go back even earlier; Clang 9.0.0 was the furthest back I was able to test. Here’s what Listing 2 compiles to using the -march=armv8.1-a flag to enable the casal instruction:

addAtomicFloat(std::atomic<float>&, float):
.LBB0_1:
        ldar    w8, [x0]             // w8 = *arg0 = f0, non-atomically loaded
        fmov    s1, w8               // s1 = w8 = f0
        fadd    s2, s1, s0           // s2 = s1 + s0 = (f0 + f1)
        mov     w9, w8               // w9 = w8 = f0
        fmov    w10, s2              // w10 = s2 = (f0 + f1)
        casal   w9, w10, [x0]        // atomically read the contents of the address stored
                                     //    in x0 (*arg0 = f0) and compare with w9;
                                     //    if [x0] == w9:
                                     //       atomically set the contents of the
                                     //       [x0] to the value in w10
                                     //    else:
                                     //       w9 = value loaded from [x0]
        cmp     w9, w8               // compare w9 and w8 and store result in N
        cset    w8, eq               // if previous instruction's compare was true,
                                     //    set w8 = 1
        cmp     w8, #1               // compare if w8 == 1 and store result in N
        b.ne    .LBB0_1              // if N==0 { goto .LBB0_1 }
        mov     v0.16b, v1.16b       // return f0 value from ldar
        ret

Listing 9: arm64 revision ARMv8.1-A assembly corresponding to Listing 2, with my annotations in the comments. Compiled using arm64 Clang 10.0.0 using -O3 and also -march=armv8.1-a. See on Godbolt Compiler Explorer

If we compare Listing 9 with the ARMv8.0-A version in Listing 7, we can see that Listing 9 is only slightly shorted in terms of total instructions used, but the need for separate ldaxr, stlxr, and clrex instructions has been completely replaced with a single casal instruction. Interestingly, Listing 9 is now structurally very very similar to it’s x86-64 counterpart in Listing 5. My guess is that if someone was familiar with x86-64 assembly but had never seen arm64 assembly before, and that person was given Listing 5 and Listing 9 to compare side-by-side, they’d be able to figure out almost immediately what each line in Listing 9 does.

Now let’s see what Listing 3 compiles to using the -march=armv8.1-a flag:

addAtomicFloat(std::atomic<float>&, float):
        ldar    w9, [x0]             // w9 = *arg0 = f0, non-atomically loaded
        fmov    s1, w9               // s1 = w9 = f0
        fadd    s2, s1, s0           // s2 = s1 + s0 = (f0 + f1)
        mov     w8, w9               // w8 = w9 = f0
        fmov    w10, s2              // w10 = s2 = (f0 + f1)
        casal   w8, w10, [x0]        // atomically read the contents of the address stored
                                     //    in x0 (*arg0 = f0) and compare with w8;
                                     //    if [x0] == w8:
                                     //       atomically set the contents of the
                                     //       [x0] to the value in w10
                                     //    else:
                                     //       w8 = value loaded from [x0]
        cmp     w8, w9               // compare w8 and w9 and store result in N
        b.eq    .LBB0_3              // if N==1 { goto .LBB0_3 }
        mov     w9, w8
.LBB0_2:
        fmov    s1, w8               // s1 = w8 = value previously loaded from [x0] = f0
        fadd    s2, s1, s0           // s2 = s1 + s0 = (f0 + f1)
        fmov    w10, s2              // w10 = s2 = (f0 + f1)
        casal   w9, w10, [x0]        // atomically read the contents of the address stored
                                     //    in x0 (*arg0 = f0) and compare with w9;
                                     //    if [x0] == w9:
                                     //       atomically set the contents of the
                                     //       [x0] to the value in w10
                                     //    else:
                                     //       w9 = value loaded from [x0]
        cmp     w9, w8               // compare w9 and w8 and store result in N
        cset    w8, eq               // if previous instruction's compare was true,
                                     //    set w8 = 1
        cmp     w8, #1               // compare if w8 == 1 and store result in N
        mov     w8, w9               // w8 = w9 = value previously loaded from [x0] = f0
        b.ne    .LBB0_2              // if N==0 { goto .LBB0_2 }
.LBB0_3:
        mov     v0.16b, v1.16b       // return f0 value from ldar
        ret

Listing 10: arm64 revision ARMv8.1-A assembly corresponding to Listing 3, with my annotations in the comments. Compiled using arm64 Clang 10.0.0 using -O3 and also -march=armv8.1-a. See on Godbolt Compiler Explorer

Here, the availability of the casal instruction makes a huge difference in the compactness of the output assembly! Listing 10 is nearly half the length of Listing 8, and more importantly, Listing 10 is also structurally much simpler than Listing 8. In Listing 10, the compiler still decided to unroll the first iteration of the loop, but the amount of setup and jumping around in between iterations of the loop is significantly reduced, which should make Listing 10 a bit more performant than Listing 8 even before we take into account the performance improvements from using casal.

By the way, remember our discussion of weak versus strong memory models in the previous section? As you may have noticed, Takua’s implementation of addAtomicFloat() uses std::atomic<T>::compare_exchange_weak() instead of std::atomic<T>::compare_exchange_strong(). The difference between the weak and strong versions of std::atomic<T>::compare_exchange_*() is that the weak version is allowed to sometimes report a failed comparison even if the values are actually equal (that is, the weak version is allowed to spuriously report a false negative), while the strong version guarantees always accurately reporting the outcome of the comparison. On x86-64, there is no difference between using the weak and strong versions of because x86-64 always provides strong memory ordering (in other words, on x86-64 the weak version is allowed to report a false negative by the spec but never actually does). However, on arm64, the weak version actually does report false negatives in practice. The reason I chose to use the weak version is because when the compare-and-exchange is attempted repeatedly in a loop, if the underlying processor actually has weak memory ordering, using the weak version is usually faster than the strong version. To see why, let’s take a look at the arm64 ARMv8.0-A assembly corresponding to Listing 2, but with std::atomic<T>::compare_exchange_strong() swapped in instead of std::atomic<T>::compare_exchange_weak():

addAtomicFloat(std::atomic<float>&, float):
.LBB0_1:
        ldar    w8, [x0]       // w8 = *arg0 = f0, non-atomically loaded
        fmov    s1, w8         // s1 = w8 = f0
        fadd    s2, s1, s0     // s2 = s1 + s0 = (f0 + f1)
        fmov    w9, s2         // w9 = s2 = (f0 + f1)
.LBB0_2:
        ldaxr   w10, [x0]      // w10 = *arg0 = f0.load(), atomically
                               //    loaded (get exclusive lock on x0), with
                               //    implicit synchronization
        cmp     w10, w8        // compare non-atomically loaded f0 with atomically
                               //    loaded f0 and store result in N
        b.ne    .LBB0_4        // if N==0 { goto .LBB0_4 }
        stlxr   w10, w9, [x0]  // if this thread has the exclusive lock,
                               //    { *arg0 = w9 = (f0 + f1), release lock },
                               //    store whether or not succeeded in w10
        cbnz    w10, .LBB0_2   // if w10 says exclusive lock failed { goto .LBB0_2}
        b       .LBB0_5        // goto .LBB0_5
.LBB0_4:
        clrex                  // clear this thread's record of exclusive lock
        b       .LBB0_1        // goto .LBB0_1
.LBB0_5:
        mov     v0.16b, v1.16b // return f0 value from ldaxr
        ret

Listing 11: arm64 revision ARMv8.0-A assembly corresponding to Listing 2 but using std::atomic::compare_exchange_strong() instead of std::atomic::compare_exchange_weak(), with my annotations in the comments. Compiled using arm64 Clang 10.0.0 using -O3 and also -march=armv8.1-a. See on Godbolt Compiler Explorer

If we compare Listing 11 with Listing 7, we can see that just changing the compare and exchange to a strong version instead of a weak version causes a major restructuring of the arm64 assembly and the addition of a bunch more jumps. In Listing 7, loads from [x0] (corresponding to reads of f0 in the C++ code) happen together at the top of the loop and the loaded values are reused through the rest of the loop. However, Listing 11 is restructured such that loads from [x0] happen immediately before the instruction that uses the loaded value from [x0] to do a comparison or other operation. This change means that there is less time for another thread to change the value at [x0] while this thread is still doing stuff. Interestingly, if we compile using ARMv8.1-A, the availability of single-instruction atomic operations means that just like on x86-64, the difference between the strong and weak versions of the compare and exchange go away and end up compiling to the same arm64 assembly.

At this point in process of porting Takua to arm64, I only had a couple of Raspberry Pis, as Apple Silicon Macs hadn’t even been announced yet. Unfortunately, the Raspberry Pi 3B’s Cortex-A53-based CPU and the Raspberry Pi 4B’s Cortex-A72-based CPU only implement ARMv8.0-A, which means I couldn’t actually test and compare the versions of the compiled assembly with and without casal. Fortunately though, we can still compile the code such that if the processor the code is running on implements ARMv8.1-A, the code will use casal and other ARMv8.1-A single-instruction atomic operations, and otherwise if only ARMv8.0-A is implemented, then the code will fall back to using ldaxr, stlxr, and clrex. We can get the compiler to automatically do the above by using the -moutline-atomics compiler flag, which Richard Henderson of Linaro contributed into GCC 10.1 [Tkachov 2020] and which also recently was added to Clang 12.0.0 in April 2021. The -moutline-atomics flag tells the compiler to generate a runtime helper function and stub the runtime helper function into the atomic operation call-site instead of directly generating atomic instructions; this helper function then does a runtime check for what atomic instructions are available on the current processor and dispatches to the best possible implementation given the available instructions. This runtime check is cached to make subsequent calls to the helper function faster. Using this flag means that if a future Raspberry Pi 5 or something comes out hopefully with support for something newer than ARMv8.0-A, Takua should be able to automatically take advantage of faster single-instruction atomics without me having to reconfigure Takua’s builds per processor.

Performance Testing

So, now that I have Takua up and running on arm64 on Linux, how does it actually perform? Here are some comparisons, although there are some important caveats. First, at this stage in the porting process, the only arm64 hardware I had that could actually run reasonably sized scenes on was a Raspberry Pi 4B with 4 GB of memory. The Raspberry Pi 4B’s CPU is a Broadcom BCM2711, which has 4 Cortex-A72 cores; these cores aren’t exactly fast, and even though the Raspberry Pi 4B came out in 2019, the Cortex-A72 core actually dates back to 2015. So, for the x86-64 comparison point, I’m using my early 2015 MacBook Air, which also has only 4 GB of memory and has an Intel Core i5-5250U CPU with 2 cores / 4 threads. Also, as an extremely unfair comparison point, I also ran the comparisons on my workstation, which has 128 GB of memory and dual Intel Xeon E5-2680 CPUs with 8 cores / 16 threads each, for 16 cores / 32 threads in total. The three scenes I used were the Cornell Box seen in Figure 1, the glass teacup seen in Figure 2, and the bedroom scene from my shadow terminator blog post; these scenes were chosen because they fit in under 4 GB of memory. All scenes were rendered to 16 samples-per-pixel, because I didn’t want to wait forever. The Cornell Box and Bedroom scenes are rendered using unidirectional path tracing, while the tea cup scene is rendered using VCM. The Cornell Box scene is rendered at 1024x1024 resolution, while the Tea Cup and Bedroom scenes are rendered at 1920x1080 resolution.

Here are the results:

	CORNELL BOX
	1024x1024, PT
Processor:	Wall Time:	Core-Seconds:
Broadcom BCM2711:	440.627 s	approx 1762.51 s
Intel Core i5-5250U:	272.053 s	approx 1088.21 s
Intel Xeon E5-2680 x2:	36.6183 s	approx 1139.79 s

	TEA CUP
	1920x1080, VCM
Processor:	Wall Time:	Core-Seconds:
Broadcom BCM2711:	2205.072 s	approx 8820.32 s
Intel Core i5-5250U:	2237.136 s	approx 8948.56 s
Intel Xeon E5-2680 x2:	174.872 s	approx 5593.60 s

	BEDROOM
	1920x1080, PT
Processor:	Wall Time:	Core-Seconds:
Broadcom BCM2711:	5653.66 s	approx 22614.64 s
Intel Core i5-5250U:	4900.54 s	approx 19602.18 s
Intel Xeon E5-2680 x2:	310.35 s	approx 9931.52 s

In the results above, “wall time” refers to how long the render took to complete in real-world time as if measured by a clock on the wall, while “core-seconds” is a measure of how long the render would have taken completely single-threaded. Both values are separately tracked by the renderer; “wall time” is just a timer that starts when the renderer begins working on its first sample and stops when the very last sample is finished, while “core-seconds” is tracked by using a separate timer per thread and adding up how much time each thread has spent rendering.

The results are interesting! The Raspberry Pi 4B and 2015 MacBook Air are both just completely outclassed by the dual-Xeon workstation in absolute wall time, but that should come as a surprise to absolutely nobody. What’s more surprising is that the multiplier by which the dual-Xeon workstation is faster than the Raspberry Pi 4B in wall time is much higher than the multiplier in core-seconds. For the Cornell Box scene, the dual-Xeon is 12.033x faster than the Raspberry Pi 4B in wall time, but is only 1.546x faster in core-seconds. For the Tea Cup scene, the dual-Xeon is 12.61x faster than the Raspberry Pi 4B in wall time, but is only 1.577x faster in core-seconds. For the Bedroom scene, the dual-Xeon is 18.217x faster than the Raspberry Pi 4B in wall time, but is only 2.277x faster in core-seconds. This difference in wall time multiplier versus core-seconds multiplier indicates that the Raspberry Pi 4B and dual-Xeon workstation are shockingly close in single-threaded performance; the dual-Xeon workstation only has such a crushing lead in wall clock time because it just has way more cores and threads available than the Raspberry Pi 4B.

When we compare the Raspberry Pi 4B to the 2015 MacBook Air, the results are even more interesting. Between these two machines, the times are actually relatively close; for the Cornell Box and Bedroom scenes, the Raspberry Pi 4B is within striking distance of the 2015 MacBook Air, and for the Tea Cup scene, the Raspberry Pi 4B is actually faster than the 2015 MacBook Air. The reason the Raspberry Pi 4B is likely faster than the 2014 MacBook Air at the Tea Cup scene is likely because the Tea Cup scene was rendered using VCM; VCM requires the construction of a photon map, and from previous profiling I know that Takua’s photon map builder works better with more actual physical cores. The Raspberry Pi 4B has four physical cores, whereas the 2014 MacBook Air only has two physical cores and gets to four threads using hyperthreading; my photon map builder doesn’t scale well with hyperthreading.

So, overall, the Raspberry Pi 4B’s arm64 processor intended for phones got handily beat by a dual-Xeon workstation but came very close to a 2015 MacBook Air. The thing here to remember though, is that the Raspberry Pi 4B’s arm64-based processor has a TDP of just 4 watts! Contrast with the MacBook Air’s Intel Core i5-5250U, which has a 15 watt TDP, and with the dual Xeon E5-2680 in my workstation, which have a 130 watt TDP each for a combined 260 watt TDP. For this comparison, I think using the max TDP of each processor is a relatively fair thing to do, since Takua Renderer pushes each CPU to 100% utilization for sustained periods of time. So, the real story here from an energy perspective is that the Raspberry Pi 4B was between 12 to 18 times slower than the dual-Xeon workstation, but the Raspberry Pi 4B also has a TDP that is 65x lower than the dual-Xeon workstation. Similarly, the Raspberry Pi 4B nearly matches the 2015 MacBook Air, but with a TDP that is 3.75x lower!

When factoring in energy utilization, the numbers get even more interesting once we look at total energy used across the whole render. We can get the total energy used for each render by multiplying the wall clock render time with the TDP of each processor (again, we’re assuming 100% processor utilization during each render); this gives us total energy used in watt-seconds, which we divide by 3600 seconds per hour to get watt-hours:

	CORNELL BOX
	1024x1024, PT
Processor:	Max TDP:	Total Energy Used:
Broadcom BCM2711:	4 W	0.4895 Wh
Intel Core i5-5250U:	15 W	1.1336 Wh
Intel Xeon E5-2680 x2:	260 W	2.6450 Wh

	TEA CUP
	1920x1080, VCM
Processor:	Max TDP:	Total Energy Used:
Broadcom BCM2711:	4 W	2.4500 Wh
Intel Core i5-5250U:	15 W	9.3214 Wh
Intel Xeon E5-2680 x2:	260 W	12.6297 Wh

	BEDROOM
	1920x1080, PT
Processor:	Max TDP:	Total Energy Used:
Broadcom BCM2711:	4 W	6.2819 Wh
Intel Core i5-5250U:	15 W	20.4189 Wh
Intel Xeon E5-2680 x2:	260 W	22.4142 Wh

From the numbers above, we can see that even though the Raspberry Pi 4B is a lot slower than the dual-Xeon workstation in wall clock time, the Raspberry Pi 4B absolutely crushes both the 2015 MacBook Air and the dual-Xeon workstation in terms of energy efficiency. To render the same image, the Raspberry Pi 4B used between approximately 3.5x to 5.5x less energy overall than the dual-Xeon workstation, and used between approximately 2.3x to 3.8x less energy than the 2015 MacBook Air. It’s also worth noting that the 2015 MacBook Air cost $899 when it first launched (and the processor had a recommended price from Intel of $315), and the dual-Xeon workstation cost… I don’t actually know. I bought the dual-Xeon workstation used for a pittance when my employer retired it, so I don’t know how much it actually cost new. But, I do know that the processors in the dual-Xeon had a recommended price from Intel of $1723… each, for a total of $3446 when they were new. In comparison, the Raspberry Pi 4B with 4 GB of RAM costs about $55 for the entire computer, and the processor cost… well, the actual price for most ARM processors is not ever publicly disclosed, but since a baseline Raspberry Pi 4B costs only $35, the processor can’t have cost more than a few dollars at most, possibly even under a dollar.

I think the main takeaway from these performance comparisons is that even back with 2015 technology, even though most arm64 processors were slower in absolute terms compared to their x86-64 counterparts, the single-threaded performance was already shockingly close, and arm64 energy usage per compute unit and price already were leaving x86-64 in the dust. Fast forward to the present day in 2021, where we have seen Apple’s arm64-based M1 chip take the absolute performance crown in its category from all x86-64 competitors, at both a lower energy utilization level and a lower price. The even wilder thing is: the M1 is likely the slowest desktop arm64 chip that Apple will ever ship, and arm64 processors from NVIDIA and Samsung and Qualcomm and Broadcom won’t be far behind in the consumer space while Amazon and Ampere and other companies are also introducing enormous, extremely powerful arm64 chips in the high end server space. Intel and (especially) AMD aren’t sitting still in the x86-64 space either though. The next few years are going to be very interesting; no matter what happens, on x86-64 or on arm64, Takua Renderer is now ready to be there!

Conclusion to Part 1

Through the process of porting to arm64 on Linux, I learned a lot about the arm64 architecture and how it differs from x86-64, and I also found a couple of good reminders about topics like memory ordering and how floating point works. Originally I thought that my post on porting Takua to arm64 would be a nice, short, and fast to write, but instead here we are some 17,000 words later and I have not even gotten to porting Takua to arm64 on macOS and Apple Silicon yet! So, I think we will stop here for now and save the rest for an upcoming Part 2. In Part 2, I’ll write about the process to port to arm64, about how to create Universal Binaries, and examine Apple’s Rosetta 2 system for running x86-64 binaries on arm64. Also, in Part 2 we’ll examine how Embree works on arm64 and compare arm64’s NEON vector extensions with x86-64’s SSE vector extensions, and we’ll finish with some additional miscellaneous differences between x86-64 and arm64 that need to be considered when writing C++ code for both architectures.

Acknowledgements

Thanks so much to Mark Lee and Wei-Feng Wayne Huang for puzzling through some of the std::compare_exchange_weak() stuff with me. Thanks a ton to Josh Filstrup for proofreading and giving feedback and suggestions on this post pre-release! Josh was the one who told me about the Herbie tool mentioned in the floating point section, and he made an interesting suggestion about using e-graph analysis to better understand floating point behavior. Also Josh pointed out SuperH as an example of a variable width RISC architecture, which of course he would because he knows all there is to know about the Sega Dreamcast. Finally, thanks to my wife, Harmony Li, for being patient with me while I wrote up this monster of a blog post and for also puzzling through some of the technical details with me.

References

Pontus Andersson, Jim Nilsson, Tomas Akenine-Möller, Magnus Oskarsson, Kalle Åström, and Mark D. Fairchild. 2020. FLIP: A Difference Evaluator for Alternating Images. Proc. of the ACM on Computer Graphics and Interactive Techniques (Proc. of High Performance Graphics) 3, 2 (Jul. 2020), Article 15.

ARM Holdings. 2016. Cortex-A57 Software Optimization Guide. Retrieved May 12, 2021.

ARM Holdings. 2021. Arm Architecture Reference Manual Armv8, for Armv8-A Architecture Profile, Version G.a. Retrieved May 14, 2021.

ARM Holdings. 2021. Arm Architecture Reference Manual Supplement ARMv8.1, for ARMv8-A Architecture Profile, Version: A.b. Retrieved May 14, 2021.

Brandon Castellano. 2015. SuperUser Answer to “Do ARM Processors like Cortex-A9 Use Microcode?”. Retrieved May 12, 2021.

Jim Cownie. 2021. Atomics in AArch64. In CPU Fun. Retrieved May 14, 2021.

CppReference. 2021. std::atomic<T>::compare_exchange_weak. Retrieved April 02, 2021.

CppReference. 2021. std::memory_order. Retrieved March 20, 2021.

Intel Corporation. 2021. Intel 64 and IA-32 Architectures Software Developer’s Manual. Retrieved April 02, 2021.

Bruce Dawson. 2020. ARM and Lock-Free Programming. In Random ASCII. Retrieved April 15, 2021.

Glenn Fiedler. 2008. Floating Point Determinism. In Gaffer on Games. Retrieved April 20, 2021.

David Goldbery. 1991. What Every Computer Scientist Should Know About Floating-Point Arithmetic. ACM Computing Surveys 32, 1 (Mar. 1991), 5-48.

Martin Geupel. 2018. Bucket and Progressive Rendering. In CG Basics. Retrieved May 12, 2021.

Phillip Johnston. 2020. Demystifying ARM Floating Point Compiler Options. In Embedded Artistry. Retrieved April 20, 2021.

Yossi Kreinin. 2008. Consistency: How to Defeat the Purpose of IEEE Floating Point. In Proper Fixation. Retrieved April 20, 2021.

Günter Obiltschnig. 2006. Cross-Platform Issues with Floating-Point Arithmetics in C++. In ACCU Conference 2006.

David A. Patterson and David R. Ditzel. 1980. The Case for the Reduced Instruction Set Computer. ACM SIGARCH Computer Architecture News 8, 6 (Oct. 1980), 25-33.

Jeff Preshing. 2012. Memory Reordering Caught in the Act. In Preshing on Programming. Retrieved March 20, 2021.

Jeff Preshing. 2012. An Introduction to Lock-Free Programming. In Preshing on Programming. Retrieved March 20, 2021.

Jeff Preshing. 2012. Memory Ordering at Compile Time. In Preshing on Programming. Retrieved March 20, 2021.

Jeff Preshing. 2012. Memory Barriers Are Like Source Control Operations. In Preshing on Programming. Retrieved March 20, 2021.

Jeff Preshing. 2012. Acquire and Release Semantics. In Preshing on Programming. Retrieved March 20, 2021.

Jeff Preshing. 2012. Weak vs. Strong Memory Models. In Preshing on Programming. Retrieved March 20, 2021.

Jeff Preshing. 2012. This Is Why They Call It a Weakly-Ordered CPU. In Preshing on Programming. Retrieved March 20, 2021.

The Rust Team. 2021. Atomics. In The Rustonomicon. Retrieved March 20, 2021.

Michael E. Thomadakis. 2011. The Architecture of the Nehalem Processor and Nehalem-EP SMP Platforms. JFE Technical Report. Texas A&M University.

Kyrylo Tkachov. 2020. Making the Most of the Arm Architecture with GCC 10. In ARM Tools, Software, and IDEs Blog. Retrieved May 14, 2021.

Vincent M. Weaver and Sally A. McKee. 2009. Code Density Concerns for New Architectures. In Proc. of IEEE International Conference on Computer Design (ICCD 2009). 459-464.

WikiBooks. 2021. Microprocessor Design: Instruction Decoder. Retrieved May 12, 2021.

Wikipedia. 2021. Complex Instruction Set Computer. Retrieved April 05, 2021.

Wikipedia. 2021. CPU Cache. Retrieved March 20, 2021.

Wikipedia. 2021. Extended Precision. Retrieved April 20, 2021.

Wikipedia. 2021. Hardwired Control Unit. Retrieved May 12, 2021.

Wikipedia. 2021. IEEE 754. Retrieved April 20, 2021.

Wikipedia. 2021. Intel 8087. Retrieved April 20, 2021.

Wikipedia. 2021. Micro-Code. Retrieved May 12, 2021.

Wikipedia. 2021. Micro-Operation. Retrieved May 10, 2021.

Wikipedia. 2021. Reduced Instruction Set Computer. Retrieved April 05, 2021.

Wikipedia. 2021. SuperH. Retrieved June 02, 2021.

New Responsive Layout and Blog Plans

May 18, 2021 | Tags: Website, Announcements

I recently noticed that my blog and personal website’s layout looked really bad on mobile devices and in smaller browser windows. When I originally created the current layout for this blog and for my personal website back in 2013, I didn’t really design the layout with mobile in mind whatsoever. Back in 2013, responsive web design had only just started to take off, and being focused entirely on renderer development and computer graphics, I wasn’t paying much attention to the web design world that much! I then proceeded to not notice at all how bad the layout on mobile and in small windows was because… well, I don’t really visit my own website and blog very much, because why would I? I know everything that’s on them already!

Well, I finally visited my site on my iPhone, and immediately noticed how terrible the layout looked. On an iPhone, the layout was just the full desktop browser layout shrunk down to an unreadable size! So, last week, I spent two evenings extending the current layout to incorporate responsive web design principles. Responsive web design principles call for a site’s layout to adjust itself according to the device and window size such that the site renders in a way that is maximally readable in a variety of different viewing contexts. Generally this means that content and images and stuff should resize so that its always at a readable size, and elements on the page should be on a fluid grid that can reflow instead of being located at fixed locations.

Here is how the layout used by my blog and personal site used to look on an iPhone 11 display, compared with how the layout looks now with modern responsive web design principles implemented:

So why did I bother with implementing these improvements to my blog and personal site now, some eight years after I first deployed the current layout and current version of the blog? To answer this (self-asked) question, I want to first write a bit about how the purpose of this blog has evolved over the years. I originally started this blog back when I first started college, and it originally didn’t have any clear purpose. If anything, starting a blog really was just an excuse to rewrite and expand a custom content management system that I had written in PHP 5 back in high school. Sometime in late 2010, as I got more interested in computer graphics, this blog became something of a personal journal to document my progress in exploring computer graphics. Around this time I also decided that I wanted to focus all of my attention on computer graphics, so I dropped most of the web-related projects I had at the time and moved this blog from my own custom CMS to Blogger. In grad school, I started to experiment with writing longer-form posts; for the first time for this blog, these posts were written primarily with a reader other than my future self in mind. In other words, this is the point where I actually started to write posts intended for an external audience. At this point I also moved the blog from Blogger to running on Jekyll hosted through Github Pages, and that’s when the first iterations of the current layout were put into place.

Fast forward to today; I’ve now been working at Disney Animation for six years, and (to my constant surprise) this blog has picked up a small but steady readership in the computer graphics field! The purpose I see for this blog now is to provide high quality, in-depth writeups of whatever projects I find interesting, with the hope that 1. my friends and colleagues and other folks in the field will find the posts similarly interesting and 2. that the posts I write can be informative and inspiring for aspirational students that might stumble upon this blog. When I was a student, I drew a lot of inspiration from reading a lot of really cool computer graphics and programming blogs, and I want to be able to give back the same to future students! Similarly, my personal site, which uses an extended version of the blog’s layout, now serves primarily as a place to collect and showcase projects that I’ve worked on with an eye towards hopefully inspiring other people, as opposed to serving as a tool to get recruited.

The rate that I post at now is much slower than when I was in school, but the reason for this slowdown is because I put far more thought and effort into each post now, and while the rate at which new posts appear has slowed down, I like to think that I’ve vastly improved both the quality and quantity of content within each post. I recently ran wc -w on the blog’s archives, which yielded some interesting numbers. From 2014 to now, I’ve only written 38 posts, but these 38 posts total a bit over 96,000 words (which averages to roughly 2,500 words per post). Contrast with 2010 through the end of 2013, when I wrote 78 posts that together total only about 28,000 words (which averages to roughly 360 words per post)! Those early posts came frequently, but a lot of those early posts are basically garbage; I only leave them there so that new students can see that my stuff wasn’t very good when I started either.

When I put the current layout into place eight years ago, I wanted the layout to have as little clutter as possible and focus on presenting a clear, optimized reading experience. I wanted computer graphics enthusiasts that come to read this blog to be able to focus on the content and imagery with as little distraction from the site’s layout as possible, and that meant keeping the layout as simple and minimal as possible while still looking good. Since the main topic this blog focuses on is computer graphics, and obviously computer graphics is all about pictures and the code that generates those pictures (hence the name of the blog being “Code & Visuals”), I wanted the layout to allow for large, full-width images. The focus on large full-width images is why the blog is single-column with no sidebars of any sort; in many ways, the layout is actually more about the images than the text, hence why text never wraps around an image either. Over the years I have also added additional capabilities to the layout in support of computer graphics content, such as MathJax integration so that I can embed beautiful LaTeX math equations, and an embedded sliding image comparison tool so that I can show before/after images with a wiping interface.

So with all of the above in mind, the reason for finally making the layout responsive is simple: I want the blog to be as clear and as readable as I can reasonably make it, and that means clear and readable on any device, not just in a desktop browser with a large window! I think a lot of modern “minimal” designs tend to use too much whitespace and sacrifice information and text density; a key driving principle behind my layout is to maintain a clean and simple look while still maintaining a reasonable level of information and text density. However, the old non-responsive layout’s density in smaller viewports was just ridiculous; nothing could be read without zooming in a lot, which on phones then meant a lot of swiping both up/down and left/right just to read a single sentence. For the new responsive improvements, I wanted to make everything readable in small viewports without any zooming or swiping left/right. I think the new responsive version of the layout largely accomplishes this goal; here’s an animation of how the layout resizes as the content window shrinks, as applied to the landing page of my personal site:

Adapting my layout to be responsive was surprisingly easy and straightforward! My blog and personal site use the same layout design, but the actual implementations are a bit different. The blog’s layout is a highly modified version of an old layout called N-Coded, which in turn is an homage to what Ghost’s default Casper layout looked like back in 2014 (Casper looks completely different today). Since the blog’s layout inherited some bits of responsive functionality from the layout that I forked from, getting most things working just required updating, fixing, and activating some already existing but inactive parts of the CSS. My personal site, on the other hand, reimplements the same layout using completely hand-written CSS instead of using the same CSS as the blog; the reason for this difference is because my personal site extends the design language of the layout for a number of more customized pages such as project pages, publication pages, and more. Getting my personal site’s layout updated with responsive functionality required writing more new CSS from scratch.

I used to be fairly well versed in web stuff back in high school, but obviously the web world has moved on considerably since then. I’ve forgotten most of what I knew back then anyway since it’s been well over a decade, so I kind of had to relearn a lot of things. However, I guess a lot of things in programming are similar to riding a bicycle- once you learn, you never fully forget! Relearning what I had forgotten was pretty easy, and I quickly figured out that the only really new thing I needed to understand for implementing responsive stuff was the CSS @media rule, which was introduced in 2009 but only gained full support across all major browsers in 2012. For the totally unfamiliar with web stuff: the @media rule allows for checking things like the width and height and resolution of the current viewport and allows for specifying CSS rule overrides per media query. Obviously this capability is super useful for responsive layouts; implementing responsive layouts really boils down to just making sure that positions are specified as percentages or relative positions instead of fixed positions and then using @media rules to make larger adjustments to the layout as the viewport size reaches different thresholds. For example, I use @media rules to determine when to reorganize from a two-column layout into stacked single-column layout, and I also use @media rules to determine when to adjust font sizes and margins and stuff. The other important part to implementing a responsive layout is to instruct the browser to set the width of the page to follow the screen-width of the viewing device on mobile. The easiest way to implement this requirement by far is to just insert the following into every page’s HTML headers:

<meta name="viewport" content="width=device-width">

For the most part, the new responsive layout actually doesn’t really noticeably change how my blog and personal site look on full desktop browsers and in large windows much, aside from some minor cleanups to spacing and stuff. However, there is one big noticeable change: I got rid of the shrinking pinned functionality for the navbar. Previously, as a user scrolled down, the header for my blog and personal site would shrink and gradually transform into a more compact version that would then stay pinned to the top of the browser window:

The shrinking pinned navbar functionality was implemented by using a small piece of JavaScript to read how far down the user had scrolled and dynamically adjusting the CSS for the navbar accordingly. This feature was actually one of my favorite things that I implemented for my blog and site layout! However, I decided to get rid of it because on mobile, space in the layout is already at a premium, and taking up space that otherwise could be used for content with a pinned navbar just to have my name always at the top of the browser window felt wasteful. I thought about changing the navbar so that as the user scrolled down, the nav links would turn into a hidden menu accessible through a hamburger button, but I personally don’t actually really like the additional level of indirection and complexity that hamburger buttons add. So, the navbar is now just fixed and scrolls just like a normal element of each page:

I think a fixed navbar is fine for now; I figure that if someone is already reading a post on my blog or something on my personal site, they’ll already know where they are and don’t need a big pinned banner with my name on it to remind them of where they are. However, if I start to find that scrolling up to reach nav links is getting annoying, I guess I’ll put some more thought into if I can come up with a design that I like for a smaller pinned navbar that doesn’t take up too much space in smaller viewports.

While I was in the code, I also made a few other small improvements to both the blog and my personal site. On the blog, I made a small improvement for embedded code snippets: embedded code snippets now include line numbers on the side! The line numbers are implemented using a small bit of JavaScript and exist entirely through CSS, so they don’t interfere with selecting and copying text out of the embedded code snippets. On my personal site, removing the shrinking/pinning aspect of the navbar actually allowed me to completely remove almost all JavaScript includes on the site, aside from some analytics code. On the blog, JavaScript is still present for some small things like the code line numbers, some caption features, MathJax, and analytics, but otherwise is at a bare minimum.

Over time I’d like to pare back what includes my layout uses even further to help improve load times even more. One of the big motivators for moving my blog from Blogger to Jekyll was simply for page loading speed; under the hood Blogger is a big fancy dynamic CMS, whereas Jekyll just serves up static pages that are pre-generated once from Markdown files. A few years ago, I similarly moved my personal site from using a simple dynamic templating engine I had written in PHP to instead be entirely 100% static; I now just write each page on my personal site directly as simple HTML and serve everything statically as well. As a result, my personal site loads extremely fast! My current layout definitely still has room for optimization though; currently, I use fonts from TypeKit because I like nice typography and having nice fonts like Futura and Proxima Nova is a big part of the overall “look” of the layout. Fonts can add a lot of weight if not optimized carefully though, so maybe down the line I’ll need to streamline how fonts work in my layout. Also, since the blog has a ton of images, I think updating the blog to use native browser lazy loading of images through the loading="lazy" attribute on img tags should help a lot with load speeds, but not all major browsers support this attribute yet. Some day I’d like to get my site down to something as minimal and lightweight as Tom MacWright’s blog, but still, for now I think things are in decent shape.

If for some reason you’re curious to see how all of the improvements mentioned in this post are implemented, the source code for both my blog and my personal site are available on my Github. Please feel free to either steal any bits of the layout that you may find useful, or if you want, feel free to even fork the entire layout to use as a basis for your own site. Although, if you do fork the entire layout, I would suggest and really prefer that you put some effort into personalizing the layout and really making it your own instead of just using it exactly as how I have it!

Hopefully this is the last time for a very long while that I’ll write a blog post about the blog itself; I’m an excruciating slow writer these days, but I currently have the largest simultaneous number of posts near completion that I’ve had in a long time, and I’ll be posting them soon. As early as later this week I’ll be posting the first part of a two-part series about porting Takua Renderer to 64-bit ARM; get ready for a deep dive into some fun concurrency and atomics-related problems at the x86-64 and arm64 assembly level in this post. The second part of this series should come soon too, and over the summer I’m also hoping to finish posts about hex-tiling in Takua and on implementing/using different light visibility modes. Stay-at-home during the pandemic has also given me time to slowly chip away on the long-delayed second and third parts of what was supposed to be a series on mipmapped tiled texture caching, so with some luck maybe those posts will finally appear this year too. Beyond that, I’ve started some very initial steps on new next-generation from-the-ground-up reimplementations of Takua in CUDA/Optix and in Metal, and I’ve started to dip my toes into Rust as well, so who knows, maybe I’ll have stuff to write about that too in the future!

RenderMan Art Challenge: Magic Shop

April 12, 2021 | Tags: Art

1. Introduction
2. Character Explorations
3. Skin Shading and Subsurface Scattering
4. Clothes and Fuzz
5. Layout, Framing, and Building the Shop

6. So Many Props!
7. Signs and Records and Books
8. Putting Everything Together
9. Conclusion
10. References

Introduction

Last fall, I participated in my third Pixar’s RenderMan Art Challenge, “Magic Shop”! I wasn’t initially planning on participating this time around due to not having as much free time on my hands, but after taking a look at the provided assets for this challenge, I figured that it looked fun and that I could learn some new things, so why not? Admittedly participating in this challenge is why some technical content I had planned for this blog in the fall wound up being delayed, but in exchange, here’s another writeup of some fun CG art things I learned along the way! This RenderMan Art Challenge followed the same format as usual: Pixar supplied some base models without any uvs, texturing, shading, lighting, etc, and participants had to start with the supplied base models and come up with a single final image. Unlike in previous challenges though, this time around Pixar also provided a rigged character in the form of the popular open-source Mathilda Rig, to be incorporated into the final entry somehow. Although my day job involves rendering characters all of the time, I have really limited experience with working with characters in my personal projects, so I got to try some new stuff! Considering that I my time spent on this project was far more limited than on previous RenderMan Art Challenges, and considering that I didn’t really know what I was doing with the character aspect, I’m pretty happy that my final entry won third place in the contest!

Character Explorations

I originally wasn’t planning on entering this challenge, but I downloaded the base assets anyway because I was curious about playing with the rigged character a bit. I discovered really quickly that the Mathilda rig is reasonably flexible, but the flexibility meant that the rig can go off model really fast, and also the face can get really creepy really fast. I think part of the problem is just the overall character design; the rig is based on a young Natalie Portman’s character from the movie Léon: The Professional, and the character in that movie is… something of an unusual character, to say the least. The model itself has a head that’s proportionally a bit on the large side, and the mouth is especially large, which is part of why the facial rig gets so creepy so fast. One of the first things I discovered was that I had to scale down the rig’s mouth and teeth a bit just to bring things back into more normal proportions.

After playing with the rig for a few evenings, I started thinking about what I should make if I did enter the challenge after all. I’ve gotten a lot busier recently with personal life stuff, so I knew I wasn’t going to have as much time to spend on this challenge, which meant I needed to come up with a relatively straightforward simple concept and carefully choose what aspects of the challenge I was going to focus on. I figured that most of the other entries into the challenge were going to use the provided character in more or less its default configuration and look, so I decided that I’d try to take the rig further away from its default look and instead use the rig as a basis for a bit of a different character. The major changes I wanted to make to take the rig away from its default look were to add glasses, completely redo the hair, simplify the outfit, and shade the outfit completely differently from its default appearance.

With this plan in mind, the first problem I tackled was creating a completely new hairstyle for the character. The last time I did anything with making CG hair was about a decade ago, and I did a terrible job back then, so I wanted to figure out how to make passable CG hair first because I saw the hair as basically a make-or-break problem for this entire project. To make the hair in this project, I chose to use Maya’s XGen plugin, which is a generator for arbitrary primitives, including but not limited to curves for things like hair and fur. I chose to use XGen in part because it’s built into Maya, and also because I already have some familiarity with XGen thanks to my day job at Disney Animation. XGen was originally developed at Disney Animation [Thompson et al. 2003] and is used extensively on Disney Animation feature films; Autodesk licensed XGen from Disney Animation and incorporated XGen into Maya’s standard feature set in 2011. XGen’s origins as a Disney Animation technology explain why XGen’s authoring workflow uses Ptex [Burley and Lacewell 2008) for maps and SeExpr [Walt Disney Animation Studios 2011] for expressions. Of course, since 2011, the internal Disney Animation version of XGen has developed along its own path and gained capabilities and features [Palmer and Litaker 2016] beyond Autodesk’s version of XGen, but the basics are still similar enough that I figured I wouldn’t have too difficult of a time adapting.

I found a great intro to XGen course from Jesus FC, which got me up and running with guides/splines XGen workflow. I eventually found that the workflow that worked best for me was to actually model sheets of hair using just regular polygonal modeling tools, and then use the modeled polygonal sheets as a base surface to help place guide curves on to drive the XGen splines. After a ton of trial and error and several restarts from scratch, I finally got to something that… admittedly still was not very good, but at least was workable as a starting point. One of the biggest challenges I kept running into was making sure that different “planes” of hair didn’t intersect each other, which produces grooms that look okay at first glance but then immediately look unnatural after anything more than just a moment. Here are some early drafts of the custom hair groom:

To shade the hair, I used RenderMan’s PxrMarschnerHair shader, driven using RenderMan’s PxrHairColor node. PxrHairColor implements d’Eon et al. [2011], which allow for realistic hair colors by modeling melanin concentrations in hair fibers, and PxrMarschnerHair [Hery and Ling 2017] implements a version of the classic Marschner et al. [2003] hair model improved using adaptive importance sampling [Pekelis et al. 2015]. In order to really make hair look good, some amount of randomization and color variation between different strands is necessary; PxrHairColor supports randomization and separately coloring stray flyaway hairs based on primvars. In order to use the randomization features, I had to remember to check off the “id” and “stray” boxes under the “Primitive Shader Parameters” section of XGen’s Preview/Output tab. Overall I found the PxrHairColor/PxrMarschnerHair system a little bit difficult to use; figuring out how a selected melanin color maps to a final rendered look isn’t exactly 1-to-1 and requires some getting used to. This difference in authored hair color and final rendered hair color happens because the authored hair color is the color of a single hair strand, whereas the final rendered hair color is the result of multiple scattering between many hair strands combined with azimuthal roughness. Fortunately, hair shading should get easier in future versions of RenderMan, which are supposed to ship with an implementation of Disney Animation’s artist-friendly hair model [Chiang et al. 2016]. The Chiang model uses a color re-parameterization that allows for the final rendered hair color to closely match the desired authored color by remapping the authored color to account for multiple scattering and azimuthal roughness; this hair model is what we use in Disney’s Hyperion Renderer of course, and is also implemented in Redshift and is the basis of VRay’s modern VRayHairNextMtl shader.

Skin Shading and Subsurface Scattering

For shading the character’s skin, the approach I took was to use the rig’s default textures as a starting point, modify heavily to get the textures that I actually wanted, and then use the modified textures to author new materials using PxrSurface. The largest changes I made to the supplied skin textures are in the maps for subsurface; I basically had to redo everything to provide better inputs to subsurface color and mean free path to get the look that I wanted, since I used PxrSurface’s subsurface scattering set to exponential path-traced mode. I generally like the controllability and predictability that path-traced SSS brings, but RenderMan 23’s PxrSurface implementation includes a whole bunch of different subsurface scattering modes, and the reason for this is interesting and worth briefly discussing.

Subsurface scattering models how light penetrates the surface of a translucent object, bounces around and scatters inside of the object, and exits at a different surface point from where it entered; this effect is exhibited by almost all organic and non-conductive materials to some degree. However, subsurface scattering has existed in renderers for a long time; strong subsurface scattering support was actually a standout feature for RenderMan as early as 2002/2003ish [Hery 2003], when RenderMan was still a REYES rasterization renderer. Instead of relying on brute-force path tracing, earlier subsurface scattering implementations relied on diffusion approximations, which approximate the effect of light scattering around inside of an object by modeling the aggregate behavior of scattered light over a simplified surface. One popular way of implementing diffusion is through dipole diffusion [Jensen et al. 2001, d’Eon 2012, Hery 2012] and another popular technique is through the normalized diffusion model [Burley 2015, Christensen and Burley 2015] that was originally developed at Disney Animation for Hyperion. These models are implemented in RenderMan 23’s PxrSurface as the “Jensen and d’Eon Dipoles” subsurface model and the “Burley Normalized” subsurface model, respectively.

Diffusion models were the state-of-the-art for a long time, but diffusion models require a number of simplifying assumptions to work; one of the fundamental key simplifications universal to all diffusion models is an assumption that subsurface scattering is taking place on a semi-infinite slab of material. Thin geometry breaks this fundamental assumption, and as a result, diffusion-based subsurface scattering tends to loose more energy than it should in thin geometry. This energy loss means that thin parts of geometry rendered with diffusion models tend to look darker than one would expect in reality. Along with other drawbacks, this thin geometry energy loss drawback in diffusion models is one of the major reasons why most renderers have moved to brute-force path-traced subsurface scattering in the past half decade, and avoiding the artifacts from diffusion is exactly what the controllability and predictability that I mentioned earlier refers to. Subsurface scattering is most accurately simulated by brute-force path tracing within a translucent object, but brute-force path-traced subsurface scattering has only really become practical for production in the past 5 or 6 years for two major reasons: first, computational cost, and second, the (up until recently) lack of an intuitive, artist-friendly parameterization for apparent color and scattering distance. Much like how the final color of a hair model is really the result of the color of individual hair fibers and the aggregate multiple scattering behavior between many hair strands, the final color result of subsurface scattering arises from a complex interaction between single-scattering albedo, mean free path, and numerous multiple scattering events. So, much like how an artist-friendly, controllable hair model requires being able to invert an artist-specified final apparent color to produce internally-used scattering albedos (this process is called albedo inversion), subsurface scattering similarly requires an albedo inversion step to allow for artist-friendly controllable parameterizations. The process of albedo inversion for diffusion models is relatively straightforward and can be computed using nice closed-form analytical solutions, but the same is not true for path-traced subsurface scattering. A major key breakthrough to making path-traced subsurface scattering practical was the development of a usable data-fitted albedo inversion technique [Chiang et al. 2016] that allows path-traced subsurface scattering and diffusion subsurface scattering to use the same parameterization and controls. This technique was first developed at Disney Animation for Hyperion, and this technique was modified by Wrenninge et al. [2017] and combined with additional support for anisotropic scattering and non-exponential free flight to produce the “Multiple Mean Free Paths” and “path-traced” subsurface models in RenderMan 23’s PxrSurface.

In my initial standalone lookdev test setup, something that took a while was dialing the subsurface back from looking too gummy while at the same time trying to preserve something of a glow-y look, since the final scene I had in mind would be very glow-y. From both personal and production experience, I’ve found that one of the biggest challenges in moving from diffusion or point based subsurface scattering solutions to brute-force path-traced subsurface scattering often is in having to readjust mean free paths to prevent characters from looking too gummy, especially in areas where the geometry gets relatively thin because of the aforementioned thin geometry problem that diffusion models suffer from. In order to compensate for energy loss and produce a more plausible result, parameters and texture maps for diffusion-based subsurface scattering are often tuned to overcompensate for energy loss in thin areas. However, applying these same parameters to an accurate brute-force path tracing model that already models subsurface scattering in thin areas correctly results in overly bright thin areas, hence the gummier look. Since I started with the supplied skin textures for the character model, and the original skin shader for the character model was authored for a different renderer that used diffusion-based subsurface scattering, the adjustments I had to make where specifically to fight this overly glow-y gummy look in path-traced mode when using parameters authored for diffusion.

Clothes and Fuzz

For the character’s clothes and shoes, I wanted to keep the outfit geometry to save time, but I also wanted to completely re-texture and re-shade the outfit to give it my own look. I had a lot of trouble posing the character without getting lots of geometry interpenetration in the provided jacket, so I decided to just get rid of the jacket entirely. For the shirt, I picked a sort of plaid flannel-y look for no other reason than I like plaid flannel. The character’s shorts come with this sort of crazy striped pattern, which I opted to replace with a much more simplified denim shorts look. I used Substance Painter for texturing the clothes; Substance Painter comes with a number of good base fabric materials that I heavily modified to get to the fabrics that I wanted. I also wound up redoing the UVs for the clothing completely; my idea was to lay out the UVs similar to how the sewing patterns for each piece of clothing might work if they were made in reality; doing the UVs this way allowed for quickly getting the textures to meet up and align properly as if the clothes were actually sewn together from fabric panels. A nice added bonus is that Substance Painter’s smart masks and smart materials often use UV seams as hints for effects like wear and darkening, and all of that basically just worked out of the box perfectly with sewing pattern styled UVs.

Bringing everything back into RenderMan though, I didn’t feel that the flannel shirt looked convincingly soft and fuzzy and warm. I tried using PxrSurface’s fuzz parameter to get more of that fuzzy look, but the results still didn’t really hold up. The reason the flannel wasn’t looking right ultimately has to do with what the fuzz lobe in PxrSurface is meant to do, and where the fuzzy look in real flannel fabric comes from. PxrSurface’s fuzz lobe can only really approximate the look of fuzzy surfaces from a distance, where the fuzz is small enough relative to the viewing position that they can essentially be captured as an aggregate microfacet effect. Even specialized cloth BSDFs really only hold up at a relatively far distance from the camera, since they all attempt to capture cloth’s appearance as an aggregated microfacet effect; an enormous body of research exists on this topic [Schröder et al. 2011, Zhao et al. 2012, Zhao et al. 2016, Allaga et al. 2017, Deshmukh et al. 2017, Montazeri et al. 2020]. However, up close, the fuzzy look in real fabric isn’t really a microfacet effect at all- the fuzzy look really arises from multiple scattering happening between individual flyaway fuzz fibers on the surface of the fabric; while these fuzz fibers are very small to the naked eye, they are still a macro-scale effect when compared to microfacets. The way feature animation studios such as Disney Animation and Pixar have made fuzzy fabric look really convincing over the past half decade is to… just actually cover fuzzy fabric geometry with actual fuzz fiber geometry [Crow et al. 2018]. In the past few years, Disney Animation and Pixar and others have actually gone even further. On Frozen 2, embroidery details and lace and such were built out of actual curves instead of displacement on surfaces [Liu et al. 2020]. On Brave, some of the clothing made from very coarse fibers were rendered entirely as ray-marched woven curves instead of as subdivision surfaces and shaded using a specialized volumetric scheme [Child 2012], and on Soul, many of the hero character outfits (including ones made of finer woven fabrics) are similarly rendered as brute-force path-traced curves instead of as subdivision surfaces [Hoffman et al. 2020]. Animal Logic similarly renders hero cloth as actual woven curves [Smith 2018], and I wouldn’t be surprised if most VFX shops use a similar technique now.

Anyhow, in the end I decided to just bite the bullet in terms of memory and render speed and cover the flannel shirt in bazillions of tiny little actual fuzz fibers, instanced and groomed using XGen. The fuzz fibers are shaded using PxrMarschnerHair and colored to match the fabric surface beneath. I didn’t actually go as crazy as replacing the entire cloth surface mesh with woven curves; I didn’t have nearly enough time to write all of the custom software that would require, but fuzzy curves on top of the cloth surface mesh is a more-than-good-enough solution for the distance that I was going to have the camera at from the character. The end result instantly looked vastly better, as seen in this comparison of before and after adding fuzz fibers:

Figure 5: Shirt before (left) and after (right) XGen fuzz. For a full screen comparison, click here.

Putting fuzz geometry on the shirt actually worked well enough that I proceeded to do the same for the character’s shorts and socks as well. For the socks especially having actual fuzz geometry really helped sell the overall look. I also added fine peach fuzz geometry to the character’s skin as well, which may sound a bit extreme, but has actually been standard practice in the feature animation world for several years now; Disney Animation began adding fine peach fuzz on all characters on Moana [Burley et al. 2017], and Pixar started doing so on Coco. Adding peach fuzz to character skin ends up being really useful for capturing effects like rim lighting without the need for dedicated lights or weird shader hacks to get that distinct bright rim look; the rim lighting effect instead comes entirely from multiple scattering through the peach fuzz curves. Since I wanted my character to be strongly backlit in my final scene, I knew that having good rim lighting was going to be super important, and using actual peach fuzz geometry meant that it all just worked! Here is a comparison of my final character texturing/shading/look, backlit without and with all of the geometric fuzz. The lighting setup is exactly the same between the two renders; the only difference is the presence of fuzz causing the rim effect. This effect doesn’t happen when using only the fuzz lobe of PxrSurface!

Figure 6: Character backlit without and with fuzz. The rim lighting effect is created entirely by backlighting scattering through XGen fuzz on the character and the outfit. For a full screen comparison, click here. Click here and here to see the full 4K renders by themselves.

I used SeExpr expressions instead of using XGen’s guides/splines workflow to control all of the fuzz; the reason for using expressions was because I only needed some basic noise and overall orientation controls for the fuzz instead of detailed specific grooming. Of course, adding geometric fuzz to all of a character’s skin and clothing does increase memory usage and render times, but not by as much as one might expect! According to RenderMan’s stats collection system, adding geometric fuzz increased overall memory usage for the character by about 20%, and for the renders in Figure 8, adding geometric fuzz increased render time by about 11%. Without the geometric fuzz, there are 40159 curves on the character, and with geometric fuzz the curve count increases to 1680364. Even though there was a 41x increase in the number of curves, the total render time didn’t really increase by too much, thanks to logarithmic scaling of ray tracing with respect to input complexity. In a rasterizer, adding 41x more geometry would slow the render down to a crawl due to the linear scaling of rasterization, but ray tracing makes crazy things like actual geometric fuzz not just possible, but downright practical. Of course all of this can be made to work in a rasterizer with sufficiently clever culling and LOD and such, but in a ray tracer it all just works out of the box!

Here are a few closeup test renders of all of the fuzz:

Layout, Framing, and Building the Shop

After completing all of the grooming and re-shading work on the character, I finally reached a point where I felt confident enough in being able to make an okay looking character that I was willing to fully commit into entering this RenderMan Art Challenge. I got to this decision really late in the process relative to on previous challenges! Getting to this point late meant that I had actually not spent a whole lot of time thinking about the overall set yet, aside from a vague notion that I wanted backlighting and an overall bright and happy sort of setting. For whatever reason, “magic shop” and “gloomy dark place” are often associated with each other (and looking at many of the other competitors’ entries, that association definitely seemed to hold on this challenge too). I wanted to steer away from “gloomy dark place”, so I decided I instead wanted more of a sunny magic bookstore with lots of interesting props and little details to tell an overall story.

To build my magic bookstore set, I wound up remixing the provided assets fairly extensively; I completely dismantled the entire provided magic shop set and used the pieces to build a new corner set that would emphasize sunlight pouring in through windows. I initially was thinking of placing the camera up somewhere in the ceiling of the shop and showing a sort of overhead view of the entire shop, but I abandoned the overhead idea pretty quickly since I wanted to emphasize the character more (especially after putting so much work into the character). Once I decided that I wanted a more focused shot of the character with lots of bright sunny backlighting, I arrived at an overall framing and even set dressing that actually largely stayed mostly the same throughout the rest of the project, albeit with minor adjustments here and there. Almost all of the props are taken from the original provided assets, with a handful of notable exceptions: in the final scene, the table and benches, telephone, and neon sign are my own models. Figuring out where to put the character took some more experimentation; I originally had the character up front and center and sitting such that her side is facing the camera. However, having the character up front and center made her feel not particularly integrated with the rest of the scene, so I eventually placed her behind the big table and changed her pose so that she’s sitting facing the camera.

Here are some major points along the progression of my layout and set dressing explorations:

One interesting change that I think had a huge impact on how the scene felt overall actually had nothing to do with the set dressing at all, but instead had to do with the camera itself. At some point I tried pulling the camera back further from the character and using a much narrower lens, which had the overall effect of pulling the entire frame much closer and tighter on the character and giving everything an ever-so-slightly more orthographic feel. I really liked how this lensing worked; to me it made the overall composition feel much more focused on the character. Also around this point is when I started integrating the character with completed shading and texturing and fuzz into the scene, and I was really happy to see how well the peach fuzz and clothing fuzz worked out with the backlighting:

Once I had the overall blocking of the scene and rough set dressing done, the next step was to shade and texture everything! Since my scene is set indoors, I knew that global illumination coming off of the walls and floor and ceiling of the room itself was going to play a large role in the overall lighting and look of the final image, so I started the lookdev process with the room’s structure itself.

The first decision to tackle was whether or not to have glass in the big window thing behind the character. I didn’t really want to put glass in the window, since most of the light for the scene was coming through the window and having to sample the primary light source through glass was going to be really bad for render times. Instead, I decided that the window was going to be an interior window opening up into some kind of sunroom on the other side, so that I could get away with not putting glass in. The story I made up in my head was that the sunroom on the other side, being a sunroom, would be bright enough that I could just blow it out entirely to white in the final image. To help sell the idea, I thought it would be fun to have some ivy or vines growing through the window’s diamond-shaped sections; maybe they’re coming from a giant potted plant or something in the sunroom on the other side.

I initially tried creating the ivy vines using SpeedTree, but I haven’t really used SpeedTree too extensively before and the vines toolset was completely unfamiliar to me. Since I didn’t have a whole lot of time to work on this project overall, I wound up tabling SpeedTree on this project and instead opted to fall back on a (much) older but more familiar tool: Thomas Luft’s standalone Ivy Generator program. After several iterations to get an ivy growth pattern that I liked, I textured and shaded the vines and ivy leaves using some atlases from Quixel Megascans. The nice thing about adding in the ivy was that it helped break up how overwhelmingly bright the entire window was:

For the overall look of the room, I opted for a sort-of Mediterranean look inspired by the architecture of the tower that came with the scene (despite the fact that the tower isn’t actually in my image). Based on the Mediterranean idea, I wanted to make the windows out of a fired terracotta brick sort of material and, after initially experimenting with brick walls, I decided to go with stone walls. To help sell the look of a window made out of stacked fired terracotta blocks, I added a bit more unevenness to the window geometry, and I used fired orange terracotta clay flower pots as a reference for what the fired terracotta material should look like. To help break up how flat the window geometry is and to help give the blocks a more handmade look, I added unique color unevenness per block and also added a bunch of swirly and dimply patterns to the material’s displacement:

To create the stone walls, I just (heavily) modified a preexisting stone material that I got off of Substance Source; the final look relies very heavily on displacement mapping since the base geometry is basically just a flat plane. I made only the back wall a stone wall; I decided to make the side wall on the right out of plaster instead just so I wouldn’t have to figure out how to make two stone walls meet up at a corner. I also wound up completely replacing the stone floor with a parquet wood floor, since I wanted some warm bounce coming up from the floor onto the character. Each plank in the parquet wood floor is a piece of individual geometry. Putting it all together, here’s what the shading for the room structure looks like:

The actual materials in my final image are not nearly as diffuse looking as everything looks in the above test render; my lookdev test setup’s lighting setup is relatively diffuse/soft, which I guess didn’t really serve as a great predictor for how things looked in my actual scene since the lighting in my actual scene landed somewhere super strongly backlit. Also, note how all of the places where different walls meet each other and where the walls meet the floor are super janky; I didn’t bother putting much effort in there since I knew that those areas were either going to be outside of the final frame or were going to be hidden behind props and furniture.

So Many Props!

With the character and room completed, all that was left to do for texturing and shading was just lots and lots of props. This part was both the easiest and most difficult part of the entire project- easy because all of the miscellaneous props were relatively straightforward to texture and shade, but difficult simply because there were a lot of props. However, the props were also the funnest part of the whole project! Thinking about how to make each prop detailed and interesting and unique was an interesting exercise, and I also had fun sneaking in a lot of little easter eggs and references to things I like here and there.

My process for texturing and shading props was a very straightforward workflow that is basically completely unchanged from the workflow I settled into on the previous Shipshape RenderMan Art Challenge: use Substance Painter for texturing, UDIM tiles for high resolution textures, and PxrSurface as the shader for everything. The only different from in previous projects was that I used a far lazier UV mapping process: almost every prop was just auto-UV’d with some minor adjustments here and there. The reason I relied on auto-UVs this time was just because I didn’t have a whole lot of time on this project and couldn’t afford to spend the time to do precise careful high quality by-hand UVs for everything, but I figured that since all of the props would be relatively small in image space in the final frame, I could get away with hiding seams from crappy UVs by just exporting really high-resolution textures from Substance Painter. Yes, this approach is extremely inefficient, but it worked well enough considering how little time I had.

Since a lot of bounce lighting on the character’s face was going to have to come from the table, the first props I textured and shaded were the table and accompanying benches. I tried to make the table and bench match each other; they both use a darker wood for the support legs and have metal bits in the frame, and have a lighter wood for the top. I think I got a good amount of interesting wear and stuff on the benches on my first attempt, but getting the right amount of wear on the table’s top took a couple of iterations to get right. Again, due to how diffuse my lookdev test setup on this project was, the detail and wear in the table’s top showed up better in my final scene than in these test renders:

To have a bit of fun and add a slight tiny hint of mystery and magic into the scene, I put some inlaid gold runes into the side of the table. The runes are a favorite scifi/fantasy quote of mine, which is an inversion of Clarke’s third law. They read: “any sufficiently rigorously defined magic is indistinguishable from technology”; this quote became something of a driving theme for the props in the scene. I wanted to give a sense that this shop is a bookshop specializing in books about magic, but the magic of this world is not arbitrary and random; instead, this world’s magic has been studied and systematized into almost another branch of science.

A lot of the props did require minor geometric modifications to make them more plausible. For example, the cardboard box was originally made entirely out of single-sided surfaces with zero thickness; I had to extrude the surfaces of the box in order to have enough thickness to seem convincing. There’s not a whole lot else interesting to write about with the cardboard box; it’s just corrugated cardboard. Although, I do have to say that I am pretty happy with how convincingly cardboard the cardboard boxes came out! Similarly, the scrolls just use a simple paper texture and, as one would expect with paper, use some diffuse transmission as well. Each of the scrolls has a unique design, which provided an opportunity for some fun personal easter eggs. Two of the scrolls have some SIGGRAPH paper abstracts translated into the same runes that the inlay on the table uses. One of the scrolls has a wireframe schematic of the wand prop that sits on the table in the final scene; my idea was that this scroll is one of the technical schematics that the character used to construct her wand. To fit with this technical schematic idea, the two sheets of paper in the background on the right wall use the same paper texture as the scrolls and similarly have technical blueprints on them for the record player and camera props. The last scroll in the box is a city map made using Oleg Dolya’s wonderful Medieval Fantasy City Generator tool, which is a fun little tool that does exactly what the name suggests and with which I’ve wasted more time than I’d like to admit generating and daydreaming about made up little fantasy towns.

The next prop I worked on was the mannequin, which was even more straightforward than the cardboard box and scrolls. For the mannequin’s wooden components, I relied entirely on triplanar projections in Substance Painter oriented such that the grain of the wood would flow correctly along each part. The wood material is just a modified version of a default Substance Painter smart material, with additional wear and dust and stuff layered on top to give everything a bit more personality:

The record player was a fun prop the texture and shade, since there were a lot of components and a lot of room for adding little details and touches. I found a bunch of reference online for briefcase record players and, based off of the reference, I chose to make the actual record player part of the briefcase out of metal, black leather, and black plastic. The briefcase itself is made from a sort of canvas-like material stretched over a hard shell, with brass hardware for the clasps and corner reinforcements and stuff. For the speaker openings, instead of going with a normal grid-like dot pattern, I put in an interesting swirly design. The inside of the briefcase lid uses a red fabric, with a custom gold imprinted logo for an imaginary music company that I made up for this project: “SeneTone”. I don’t know why, but my favorite details to do when texturing and shading props is stuff like logos and labels and stuff; I think that it’s always things like labels that you’d expect in real life that really help make something CG believable.

The camera prop took some time to figure out what to do with, mostly because I wasn’t actually sure whether it was a camera or a projector initially! While this prop looks like an old hand-cranked movie camera. the size of the prop in the scene that Pixar provided threw me off; the prop is way larger than any references for hand-cranked movie cameras that I could find. I eventually decided that the size could probably be handwaved away by explaining the camera as some sort of really large-format camera. I decided to model the look of the camera prop after professional film equipment from roughly the 1960s, when high-end cameras and stuff were almost uniformly made out of steel or aluminum housings with black leather or plastic grips. Modern high-end camera gear also tends to be made from metal, but in modern gear the metal is usually completely covered in plastic or colored power-coating, whereas the equipment from the 1960s I saw had a lot of exposed silvery-grey metal finishes with covering materials only in areas that a user would expect to touch or hold. So, I decided to give the camera prop an exposed gunmetal finish, with black leather and black plastic grips. I also reworked the lens and what I think is a rangefinder to include actual optical elements, so that they would look right when viewed from a straight-on angle. As an homage to old film cinema, I made a little “Super 35” logo for the camera (even though the Super 35 film format is a bit anachronistic for a 1960s era camera). The “Senecam” typemark is inspired by how camera companies often put their own typemark right across the top of the camera over the lens mount.

$Figure 22: Camera prop front view. Note all of the layers of refraction and reflection in the lens.$

The crystal was really interesting to shade. I wanted to give the internals of the crystal some structure, and I didn’t want the crystal to refract a uniform color throughout. To get some interesting internal structure, I wound up just shoving a bunch of crumpled up quads inside of the crystal mesh. The internal crumpled up geometry refracts a couple of different variants of blue and light blue, and the internal geometry has a small amount of emission as well to get a bit of a glowy effect. The outer shell of the crystal refracts mostly pink and purple; this dual-color scheme gives the internals of the crystal a lot of interesting depth. The back-story in my head was that this crystal came from a giant geode or something, so I made the bottom of the crystal have bits of a more stony surface to suggest where the crystal was once attached to the inside of a stone geode. The displacement on the crystal is basically just a bunch of rocky displacement patterns piled on top of each other using triplanar projects in Substance Painter; I think the final look is suitably magical!

Originally the crystal was going to be on one of the back shelves, but I liked how the crystal turned out so much that I decided to promote it to a foreground prop and put it on the foreground table. I then filled the crystal’s original location on the back shelf with a pile of books.

I liked the crystal look so much that I decided to make the star on the magic wand out of the same crystal material. The story I came up with in my head is that in this world, magic requires these crystals as a sort of focusing or transmitting element. The magic wand’s star is shaded using the same technique as the crystal: the inside has a bunch of crumpled up refractive geometry to produce all of the interesting color variation and appearance of internal fractures and cracks, and the outer surface’s displacement is just a bunch of rocky patterns randomly stacked on top of each other.

The flower-shaped lamps hanging above the table are also made from the same crystal material, albeit a much more simplified version. The lamps are polished completely smooth and don’t have all of the crumpled up internal geometry since I wanted the lamps to be crack-free.

The potted plant on top of the stack of record crates was probably one of the easiest props to texture and shade. The pot itself uses the same orange fired terracotta material as the main windows, but with displacement removed and with a bit less roughness. The leaves and bark on the branches are straight from Quixel Megascans. The displacement for the branches is actually slightly broken in both the test render below and in the final render, but since it’s a background prop and relatively far from the camera, I actually didn’t really notice until I was writing this post.

The reason that the character in my scene is talking on an old-school rotary dial phone is… actually, there isn’t a strong reason. I originally was tinkering with a completely different idea on that did have a strong story reason for the phone, but I abandoned that idea very early on. Somehow the phone always stayed in my scene though! Since the setting of my final scene is a magic bookshop, I figured that maybe the character is working at the shop and maybe she’s casting a spell over the phone!

The phone itself is kit-bashed together from a stock model that I had in my stock model library. I did have to create the cord from scratch, since the cord needed to stretch from the main phone set to the receiver in the character’s hand. I modeled the cord in Maya by first creating a guide curve that described the path the cord was supposed to follow, and then making a helix and making it follow the guide curve using Animate -> Motion Paths -> Flow Path Object tool. The Flow Path Object tool puts a lattice deformer around the helix and makes the lattice deformer follow the guide curve, which in turn deforms the helix to follow as well.

As with everything else in the scene, all of the shading and texturing for the phone is my own. The phone is made from a simple red Bakelite plastic with some scuffs and scratches and fingerprints to make it look well used, while the dial and hook switch are made of a simple metal material. I noticed that in some of the references images of old rotary phones that I found, the phones sometimes had a nameplate on them somewhere with the name of the phone company that provided the phone, so I made up yet another fictional logo and stuck it on the front of the phone. The fictional phone company is “Senecom”; all of the little references to a place called Seneca hint that maybe this image is set in the same world as my entry for the previous RenderMan Art Challenge. In the final image, you can’t actually see the Senecom logo though, but again at least I know it’s there!

Signs and Records and Books

While I was looking up reference for bookstores with shading books in mind, I came across an image of a sign reading “Books are Magic” from a bookstore in Brooklyn with that name. Seeing that sign provided a good boost of inspiration for how I proceeded with theming my bookstore set, and I liked the sign so much that I decided to make a bit of an homage to it in my scene. I wasn’t entirely sure how to make a neon sign though, so I had to do some experimentation. I started by laying out curves in Adobe Illustrator and bringing them into Maya. I then made each glass tube by just extruding a cylinder along each curve, and then I extruded a narrower cylinder along the same curve for the glowy part inside of the glass tube. Each glass tube has a glass shader with colored refraction and uses the thin glass option, since real neon glass tubes are hollow. The glowy part inside is a mesh light. To make the renders converge more quickly, I actually duplicated each mesh light; one mesh light is white, is visible to camera, and has thin shadows disabled to provide to look of the glowy neon core, and the second mesh light is red, invisible to camera, and has thin shadows enabled to allow for casting colored glow outside of the glass tubes without introducing tons of noise. Inside of Maya, this setup looks like the following:

After all of this setup work, I gave the neon tubes a test render, and to my enormous surprise and relief, it looks promising! This was the first test render of the neon tubes; when I saw this, I knew that the neon sign was going to work out after all:

After getting the actual neon tubes part of the neon sign working, I added in a supporting frame and wires and stuff. In the final scene, the neon sign is held onto the back wall using screws (which I actually modeled as well, even though as usual for all of the tiny things that I put way too much effort into, you can’t really see them). Here is the neon sign on its frame:

The single most time consuming prop in the entire project wound up being the stack of record crates behind the character to the right; I don’t know why I decided to make a stack of record crates, considering how many unique records I wound up having to make to give the whole thing a plausible feel. In the end I made around twenty different custom album covers; the titles are borrowed from stuff I had recently listened to at the time, but all of the artwork is completely custom to avoid any possible copyright problems with using real album artwork. The sharp-eyed long-time blog reader may notice that a lot of the album covers reuse renders that I’ve previously posted on this blog before! For the record crates themselves, I chose a layered laminated wood, which I figured in real life is a sturdy but relatively inexpensive material. Or course, instead of making all of the crates identical duplicates of each other, I gave each crate a unique wood grain pattern. The vinyl records that are sticking out here and there have a simple black glossy plastic material with bump mapping for the grooves; I was pleasantly surprised at how well the grooves catch light given that they’re entirely done through bump mapping.

Coming up with all of the different album covers was pretty fun! Different covers have different neat design elements; some have metallic gold leaf text, others have embossed designs, there are a bunch of different paper varieties, etc. The common design element tying all of the album covers together is that they all have a “SeneTone” logo on them, to go with the “SeneTone” record player prop. To create the album covers, I created the designs in Photoshop with separate masks for different elements like metallic text and whatnot, and then used the masks to drive different layers in Substance Painter. In Substance Painter, I actually created different paper finishes for different albums; some have a matte paper finish, some have a high gloss magazine-like finish, some have rough cloth-like textured finishes, some have smooth finishes, and more. I guess none of this really matters from a distance, but it was fun to make, and more importantly to myself, I know that all of those details are there! After randomizing which records get which album covers, here’s what the record crates look like:

The various piles of books sitting around the scene also took a ton of time, for similar reasons to why the records took so much time: I wanted each book to be unique. Much like the records, I don’t know why I chose to have so many books, because it sure took a long time to make around twenty different unique books! My idea was to have a whole bunch of the books scattered around suggesting that the main character has been teaching herself how to build a magic wand and cast spells and such- quite literally “books are magic” because the books are textbooks for various magical topics Here is one of the textbooks- this one about casting spells over the telephone, since the character is on the phone. Maybe she’s trying to charm whoever is on the other end!

I wound up significantly modifying the provided book model; I created several different basic book variants and also a few open book variants, for which I had to also model some pages and stuff. Because of how visible the books are in my framing, I didn’t want to have any obvious repeats in the books, so I textured every single one of them to be unique. I also added in some little sticky-note bookmarks into the books, to make it look like they’re being actively read and referenced.

Creating all of the different books with completely different cover materials and bindings and page styles was a lot of fun! Some of the most interesting covers to create were the ones with intricate gold or silver foil designs on the front; for many of these, I found pictures of really old books and did a bunch of Photoshop work to extract and clean up the cover design for use as a layer mask in Substance Painter. Here are some of the books I made:

One fun part of making all of these books was that they were a great opportunity for sneaking in a bunch of personal easter eggs. Many of the book titles are references to computer graphics and rendering concepts. Some of the book authors are just completely made up or pulled from whatever book caught my eye off of my bookshelf at the moment, but also included among the authors are all of the names of the Hyperion team’s current members at the time that I did this project. There is also, of course, a book about Seneca, and there’s a book referencing Minecraft. The green book titled “The Compleat Atlas of the House and Immediate Environs” is a reference to Garth Nix’s “Keys to the Kingdom” series, which my brother and I loved when we were growing up and had a significant influence on how the type of kind-of-a-science magic I like in fantasy settings. Also, of course, as is obligatory since I am a rendering engineer, there is a copy of Physically Based Rendering 3rd Edition hidden somewhere in the final scene; see if you can spot it!

Putting Everything Together

At this point, with all extra modeling completed and everything textured and shaded, the time came for final touches and lighting! Since one of the books I made is about levitation enchantments, I decided to use that to justify making one of the books float in mid-air in front of the character. To help sell that floating-in-air enchantment, I made some magical glowy pixie dust particles coming from the wand; the pixie dust is just some basic nParticles following a curve. The pixie dust is shaded using PxrSurface’s glow parameter. I used the particleId primvar to drive a PxrVary node, which in turn is used to randomize the pixie dust colors and opacity. Putting everything together at this point looked like this:

I originally wanted to add some cobwebs in the corners of the room and stuff, but at this point I had so little time remaining that I had to move on directly to final shot lighting. I did however have time for two small last-minute tweaks: I adjusted the character’s pose a slight amount to tilt her head towards the phone more, which is closer to how people actually talk on the phone, and I also moved up the overhead lamps a bit to try not to crowd out her head.

The final shot lighting is not actually that far of a departure from the lighting I had already roughed in at this point; mostly the final lighting just consisted of tweaks and adjustments here and there. I added a bunch of PxrRodFilters to take down hot spots and help shape the lighting overall a bit more. The rods I added were to bright down the overhead lamps and prevent the lamps from blowing out, to slightly brighten up some background shelf books, to knock down a hot spot on a foreground book, and to knock down hot spots on the floor and on the bench. I also brought down the brightness of the neon sign a bit, since the brightness of the sign should be lower relative to how incredibly bright the windows were. Here is what my Maya viewport looked like with all of the rods; everything green in this screenshot is a rod:

One of the biggest/trickiest changes I made to the lighting setup was actually for technical reasons instead of artistic reasons: the back window was originally so bright that the brightness was starting to break pixel filtering for any pixel that partially overlapped the back window. To solve this problem, I split the dome light outside of the window into two dome lights; the two new lights added up to the same intensity as the old one, but the two lights split the energy such that one light had 85% of the energy and was not visible to camera while the other light had 15% of the energy and was visible to camera. This change had the effect of preserving the overall illumination in the room while knocking down the actual whites seen through the windows to a level low enough that pixel filtering no longer broke as badly.

At this point I arrived at my final main beauty pass. In previous RenderMan Art Challenges, I broke out lights into several different render passes so that I could adjust them separately in comp before recombining, but for this project, I just rendered out everything on a single pass:

Here is a comparison of the final beauty pass with the initial putting-everything-together render from Figure 40. Note how the overall lighting is actually not too different, but there are many small adjustments and tweaks:

Figure 43: Before (left) and after (right) final lighting. For a full screen comparison, click here.

To help shape the lighting a bit more, I added a basic atmospheric volume pass. Unlike in previous RenderMan Art Challenges where I used fancy VDBs and whatnot to create complex atmospherics and volumes, for this scene I just used a simple homogeneous volume box. My main goal with the atmospheric volume pass was to capture some subtly godray-like lighting effects coming from the back windows:

For the final composite, I used the same Photoshop and Lightroom workflow that I used for the previous two RenderMan Art Challenges. For future personal art projects I’ll be moving to a DaVinci Resolve/Fusion compositing workflow, but this time around I reached for what I already knew since I was so short on time. Just like last time, I used basically only exposure adjustments in Photoshop, flattened out, and brought the image into Lightroom for final color grading. In Lightroom I further brightened things a bit, made the scene warmer, and added just a bit more glowy-ness to everything. Figure 45 is a gif that visualizes the compositing steps I took for the final image. Figure 46 shows what all of the lighting, comp, and color grading looks like applied to a 50% grey clay shaded version of the scene, and Figure 47 repeats what the final image looks like so that you don’t have to scroll all the way back to the top of this post.

Conclusion

Despite having much less free time to work on this RenderMan Art Challenge, and despite not having really intended to even enter the contest initially, I think things turned out okay! I certainly wasn’t expect to actually win a placed position again! I learned a ton about character shading, which I think is a good step towards filling a major hole in my areas of experience. For all of the props and stuff, I was pretty happy to find that my Substance Painter workflow is now sufficiently practiced and refined that I was able to churn through everything relatively efficiently. At the end of the day, stuff like art simply requires practice to get better at, and this project was a great excuse to practice!

Here is a progression video I put together from all of the test and in-progress renders that I made throughout this entire project:

Figure 48: Progression reel made from test and in-progress renders leading up to my final image.

As usual with these art projects, I owe an enormous debt of gratitude to my wife, Harmony Li, both for giving invaluable feedback and suggestions (she has a much better eye than I do!), and also for putting up with me going off on another wild time-consuming art adventure. Also, as always, Leif Pederson from Pixar’s RenderMan group provided lots of invaluable feedback, notes, and encouragement, as did everyone else in the RenderMan Art Challenge community. Seeing everyone else’s entries is always super inspiring, and being able to work side by side with such amazing artists and such friendly people is a huge honor and very humbling. If you would like to see more about my contest entry, check out the work-in-progress thread I kept on Pixar’s Art Challenge forum, and I also have an Artstation post for this project.

Finally, here’s a bonus alternate angle render of my scene. I made this alternate angle render for fun after the project and out of curiosity to see how well things held up from a different angle, since I very much “worked to camera” for the duration of the entire project. I was pleasantly surprised that everything held up well from a different angle!

References

Carlos Allaga, Carlos Castillo, Diego Gutierrez, Miguel A. Otaduy, Jorge López-Moreno, and Adrian Jarabo. 2017. An Appearance Model for Textile Fibers. Computer Graphics Forum (Proc. of Eurographics Symposium on Rendering) 36, 4 (Jun. 2017), 35-45.

Brent Burley and Dylan Lacewell. 2008. Ptex: Per-face Texture Mapping for Production Rendering. Computer Graphics Forum (Proc. of Eurographics Symposium on Rendering) 27, 4 (Jun. 2008), 1155-1164.

Brent Burley. 2015. Extending the Disney BRDF to a BSDF with Integrated Subsurface Scattering. In ACM SIGGRAPH 2015 Course Notes: Physically Based Shading in Theory and Practice.

Brent Burley, David Adler, Matt Jen-Yuan Chiang, Ralf Habel, Patrick Kelly, Peter Kutz, Yining Karl Li, and Daniel Teece. 2017. Recent Advances in Disney’s Hyperion Renderer. In ACM SIGGRAPH 2017 Course Notes: Path Tracing in Production Part 1, 26-34.

Matt Jen-Yuan Chiang, Benedikt Bitterli, Chuck Tappan, and Brent Burley. 2016. A Practical and Controllable Hair and Fur Model for Production Path Tracing. Computer Graphics Forum (Proc. of Eurographics) 35, 2 (May 2016), 275-283.

Matt Jen-Yuan Chiang, Peter Kutz, and Brent Burley. 2016. Practical and Controllable Subsurface Scattering for Production Path Tracing. In ACM SIGGRAPH 2016 Talks. Article 49.

Philip Child. 2012. Ill-Loom-inating Brave’s Handmade Fabric. In ACM SIGGRAPH 2012 Talks.

Per H. Christensen and Brent Burley. 2015. Approximate Reflectance Profiles for Efficient Subsurface Scattering. Pixar Technical Memo #15-04.

Trent Crow, Michael Kilgore, and Junyi Ling. 2018. Dressed for Saving the Day: Finer Details for Garment Shading on Incredibles 2. In ACM SIGGRAPH 2018 Talks. Article 6.

Priyamvad Deshmukh, Feng Xie, and Eric Tabellion. 2017. DreamWorks Fabric Shading Model: From Artist Friendly to Physically Plausible. In ACM SIGGRAPH 2017 Talks. Article 38.

Eugene d’Eon. 2012. A Better Dipole. http://www.eugenedeon.com/project/a-better-dipole/

Eugene d’Eon, Guillaume Francois, Martin Hill, Joe Letteri, and Jean-Marie Aubry. 2011. An Energy-Conserving Hair Reflectance Model. Computer Graphics Forum (Proc. of Eurographics Symposium on Rendering) 30, 4 (Jun. 2011), 1181-1187.

Christophe Hery. 2003. Implementing a Skin BSSRDF. In ACM SIGGRAPH 2003 Course Notes: RenderMan, Theory and Practice. 73-88.

Christophe Hery. 2012. Texture Mapping for the Better Dipole Model. Pixar Technical Memo #12-11.

Christophe Hery and Junyi Ling. 2017. Pixar’s Foundation for Materials: PxrSurface and PxrMarschnerHair. In ACM SIGGRAPH 2017 Course Notes: Physically Based Shading in Theory and Practice.

Jonathan Hoffman, Matt Kuruc, Junyi Ling, Alex Marino, George Nguyen, and Sasha Ouellet. 2020. Hypertextural Garments on Pixar’s Soul. In ACM SIGGRAPH 2020 Talks. Article 75.

Henrik Wann Jensen, Steve Marschner, Marc Levoy, and Pat Hanrahan. 2001. A Practical Model for Subsurface Light Transport In Proc. of SIGGRAPH (SIGGRAPH 2001). 511-518.

Ying Liu, Jared Wright, and Alexander Alvarado. 2020. Making Beautiful Embroidery for “Frozen 2”. In ACM SIGGRAPH 2020 Talks. Article 73.

Steve Marschner, Henrik Wann Jensen, Mike Cammarano, Steve Worley, and Pat Hanrahan. 2003. Light Scattering from Human Hair Fibers. ACM Transactions on Graphics (Proc. of SIGGRAPH) 22, 3 (Jul. 2003), 780-791.

Zahra Montazeri, Søren B. Gammelmark, Shuang Zhao, and Henrik Wann Jensen. 2020. A Practical Ply-Based Appearance Model of Woven. ACM Transactions on Graphics (Proc. of SIGGRAPH Asia) 39, 6 (Dec. 2020), Article 251.

Sean Palmer and Kendall Litaker. 2016. Artist Friendly Level-of-Detail in a Fur-Filled World. In ACM SIGGRAPH 2016 Talks. Article 32.

Leonid Pekelis, Christophe Hery, Ryusuke Villemin, and Junyi Ling. 2015. A Data-Driven Light Scattering Model for Hair. Pixar Technical Memo #15-02.

Kai Schröder, Reinhard Klein, and Arno Zinke. 2011. A Volumetric Approach to Predictive Rendering of Fabrics. Computer Graphics Forum (Proc. of Eurographics Symposium on Rendering) 30, 4 (Jun. 2011), 1277-1286.

Brian Smith, Roman Fedetov, Sang N. Le, Matthias Frei, Alex Latyshev, Luke Emrose, and Jean Pascal leBlanc. 2018. Simulating Woven Fabrics with Weave. In ACM SIGGRAPH 2018 Talks. Article 12.

Thomas V. Thompson, Ernest J. Petti, and Chuck Tappan. 2003. XGen: Arbitrary Primitive Generator. In ACM SIGGRAPH 2003 Sketches and Applications.

Walt Disney Animation Studios. 2011. SeExpr.

Magnus Wrenninge, Ryusuke Villemin, and Christophe Hery. 2017. Path Traced Subsurface Scattering using Anisotropic Phase Functions and Non-Exponential Free Flighs. Pixar Technical Memo #17-07.

Shuang Zhao, Wenzel Jakob, Steve Marschner, and Kavita Bala. 2012. Structure-Aware Synthesis for Predictive Woven Fabric Appearance. ACM Transactions on Graphics (Proc. of SIGGRAPH) 31, 4 (Aug. 2012), Article 75.

Shuang Zhao, Fujun Luan, and Kavita Bala. 2016. Fitting Procedural Yarn Models for Realistic Cloth Rendering. ACM Transactions on Graphics (Proc. of SIGGRAPH) 35, 4 (Jul. 2016), Article 51.

Raya and the Last Dragon

March 05, 2021 | Tags: Films

After a break in 2020, Walt Disney Animation Studios has two films lined up for release in 2021! The first of these is Raya and the Last Dragon, which is simultaneously out in theaters and available on Disney+ Premiere Access on the day this post is being released. I’ve been working on Raya and the Last Dragon in some form or another since early 2018, and Raya and the Last Dragon is the first original film I’ve worked on at Disney Animation that I was able to witness from the very earliest idea all the way through to release; every other project I’ve worked on up until now was either based on a previous idea or began before I started at the studio. Raya and the Last Dragon was an incredibly difficult film to make, in every possible aspect. The story took time to really get right, the technology side of things saw many challenges and changes, and the main production of the film ran headfirst into the Covid-19 pandemic. Just as production was getting into the swing of things last year, the Covid-19 pandemic forced the physical studio building to temporarily shut down, and the studio’s systems/infrastructure teams had to scramble and go to heroic lengths to get production back up and running again from around 400 different homes. As a result, Raya and the Last Dragon is the first Disney Animation film made entirely from our homes instead of from the famous “hat building”.

In the end though, all of the trials and tribulations this production saw were more than worthwhile; Raya and the Last Dragon is the most beautiful film we’ve ever made, and the movie has a message and story about trust that is deeply relevant for the present time. The Druun as a concept and villain in Raya and the Last Dragon actually long predate the Covid-19 pandemic; they’ve been a part of every version of the movie going back years, but the Druun’s role in the movie’s plot meant that the onset of the pandemic suddenly lent extra weight to this movie’s core message. Also, as someone of Asian descent, I’m so so proud that Raya and the Last Dragon’s basis is found in diverse Southeast Asian cultures. Early in the movie’s conceptualization, before the movie even had a title or a main character, the movie’s producers and directors and story team reached out to all of the people in the studio of Asian descent and engaged us in discussing how the Asian cultures we came from shaped our lives and our families. These discussions continued for years throughout the production process, and throughlines from those discussions can be seen everywhere from the movie, from major thematic elements like the importance of food and sharing meals in the world of Kumandra, all the way down to tiny details like young Raya taking off her shoes when entering the Dragon Gem chamber. The way I get to contribute to our films is always in the technical realm, but thanks to Fawn Veerasunthorn, Scott Sakamoto, Adele Lim, Osnat Shurer, Paul Briggs, and Dean Wellins, this is the first time where I feel like I maybe made some small, tiny, but important contribution creatively too! Raya and the Last Dragon has spectacular fight scenes with real combat, and the fighting styles aren’t just made up- they’re directly drawn from Thailand, Malaysia, Cambodia, Laos, and Vietnam. Young Raya’s fighting sticks are Filipino Arnis sticks, the food in the film is recognizably dishes like fish amok, tom yam, chicken satay and more, Raya’s main mode of transport is her pet Tuk Tuk, who has the same name as those motorbike carriages that can be found all over Southeast Asia; the list goes on and on.

From a rendering technology perspective, Raya and the Last Dragon in a lot of ways represents the culmination of a huge number of many-year-long initiatives that began on previous films. Water is a huge part of Raya and the Last Dragon, and the water in the film looks so incredible because we’ve been able to build even further upon the water authoring pipeline [Palmer et al. 2017] that we first built on Moana and improved on Frozen 2. One small bit of rendering tech I worked on for this movie was further improving the robustness and stability of the water levelset meshing system that we first developed on Moana. Other elements of the film, such as being able to render convincing darker skin and black hair, along with the colorful fur of the dragons, are the result of multi-year efforts to productionize path traced subsurface scattering [Chiang et al. 2016b] (first deployed on Ralph Breaks the Internet) and a highly artistically controllable principled hair shading model [Chiang et al. 2016a] (first deployed on Zootopia). The huge geometric complexity challenges that we’ve had to face on all of our previous projects prepared us for rendering Raya and the Last Dragon’s setting, the vast world of Kumandra. Even more niche features, such as our adaptive photon mapping system [Burley et al. 2018], proved to be really useful on this movie, and even saw new improvements- Joe Schutte added support for more geometry types to the photon mapping system to allow for caustics to be cast on Sisu whenever Sisu was underwater. Raya and the Last Dragon also contains a couple of more stylized sequences that look almost 2D, but even these sequences were rendered using Hyperion! These more stylized sequences build upon the 3D-2D hybrid stylization experience that Disney Animation has gained over the years from projects such as Paperman, Feast, and many of the Short Circuit shorts [Newfield and Staub 2020]. I think all of the above is really what makes a production renderer a production renderer- years and years of accumulated research, development, and experience over a variety of challenging projects forging a powerful, reliable tool custom tailored to our artists’ work and needs. Difficult problems are still difficult, but they’re no longer scary, because now, we’ve seen them before!

For this movie though, the single biggest rendering effort by far was on volume rendering. After encountering many volume rendering challenges on Moana, our team undertook an effort to replace Hyperion’s previous volume rendering system [Fong et al. 2017] with a brand new, from scratch implementation based on new research we had conducted [Kutz et al. 2017]. The new system first saw wide deployment on Ralph Breaks the Internet, but all things considered, the volumes use cases on Ralph Breaks the Internet didn’t actually require us to encounter the types of difficult cases we ran into on Moana, such as ocean foam and spray. Frozen 2 was really the show where we got a second chance at tackling the ocean foam and spray and dense white clouds cases that we had first encounted on Moana, and new challenges on Frozen 2 with thin volumes gave my teammate Wayne Huang the opportunity to make the new volume rendering system even better. Raya and the Last Dragon is the movie where I feel like all of the past few years of development on our modern volume rendering system came together- this movie threw every single imaginable type of volume rendering problem at us, often in complex combinations with each other. On top of that, Raya and the Last Dragon has volumes in basically every single shot; the highly atmospheric, naturalistic cinematography on this film demanded more volumes than we’ve ever had on any past movie. Wayne really was our MVP in the volume rendering arena; Wayne worked with our lighters to introduce a swath of powerful new tools to give artists unprecedented control and artistic flexibility in our modern volume rendering system [Bryant et al. 2021], and Wayne also made huge improvements in the volume rendering system’s overall performance and efficiency [Huang et al. 2021]. We now have a single unified volume integrator that can robustly handle basically every volume you can think of: fog, thin atmospherics, fire, smoke, thick white clouds, sea foam, and even highly stylized effects such as the dragon magic [Navarro & Rice 2021] and the chaotic Druun characters [Rice 2021] in Raya and the Last Dragon.

A small fun new thing I got to do for this movie was to add support for arbitrarily custom texture-driven camera aperture shapes. Raya and the Last Dragon’s cinematography makes extensive use of shallow depth-of-field, and one idea the film’s art directors had early on was to stylize bokeh shapes to resemble the Dragon Gem. Hyperion has long had extensive support for fancy physically-based lensing features such as uniformly bladed apertures and cateye bokeh, but the request for a stylized bokeh required much more art-directability than we previously had in this area. The texture-driven camera aperture feature I added to Hyperion is not necessarily anything innovative (similar features can be found on many commercial renderers), but iterating with artists to define and refine the feature’s controls and behavior was a lot of fun. There were also a bunch of fun nifty little details to solve, such as making sure that importance sampling ray directions based on a arbitrary textured aperture didn’t mess up stratified sampling and Sobol distributions; repurposing hierarchical sample warping [Clarberg et al. 2005] wound up being super useful here.

There are a ton more really cool technical advancements that were made for Raya and the Last Dragon, and there were also several really ambitious, inspiring, and potentially revolutionary projects that just barely missed being deployed in time for this movie. One extremely important point I want to highlight is that, as cool as all of the tech that we develop at Disney Animation is, at the end of the day our tech and tools are only as good as the artists that use them every day to handcraft our films. Hyperion only renders amazing films because the artists using Hyperion are some of the best in the world; I count myself as super lucky to be able to work with my teammates and with our artists every day. At SIGGRAPH 2021, most of the talks about Raya and the Last Dragon are actually from our artists, not our engineers! Our artists had to come up with new crowd simulation techniques for handling the huge crowds seen in the movie [Nghiem 2021, Luceño Ros et al. 2021], new cloth simulation techniques for all of the beautiful, super complex outfits worn by all of the characters [Kaur et al. 2021, Kaur & Coetzee 2021], and even new effects techniques to simulate cooking delicious Southeast Asia-inspired food [Wang et al. 2021].

Finally, here are a bunch of stills from the movie, 100% rendered using Hyperion. Normally I post somewhere between 40 to 70 stills per film, but I had so many favorite images from Raya and the Last Dragon that for this post, there are considerably more. You may notice what looks like noise in the stills below- it’s not noise! The actual renders are super clean thanks to Wayne’s volumes work and David Adler’s continued work on our Disney-Research-tech-based deep learning denoising system [Dahlberg et al. 2019, Vogels et al. 2018], but the film’s cinematography style called for adding film grain back in after rendering.

I’ve pulled these from marketing materials, trailers, and Disney+; as usual, I’ll try to update this post with higher quality stills once the film is out on Bluray. Of course, the stills here are just a few of my favorites, and represent just a tiny fraction of the incredible imagery in this film. If you like what you see here, I’d strongly encourage seeing the film on Disney+ or on Blu-Ray; whichever way, I suggest watching on the biggest screen you have available to you!

To try to help avoid spoilers, the stills below are presented in no particular order; however, if you want to avoid spoilers entirely, then please go watch the movie first and then come back here to be able to appreciate each still on its own!

Here is the credits frame for Disney Animation’s rendering and visualization teams! The rendering and visualization teams are separate teams, but seeing them grouped together in the credits is very appropriate- we all are dedicated to making the best pixels possible for our films!

All images in this post are courtesy of and the property of Walt Disney Animation Studios.

Also, one more thing: in theaters (and also on Disney+ starting in the summer), Raya and the Last Dragon is accompanied by our first new theatrical short in 5 years, called Us Again. Us Again is one of my favorite shorts Disney Animation has ever made; it’s a joyous, visually stunning celebration of life and dance and music. I’ll probably dedicate a separate post to Us Again once it’s out on Disney+.

References

Brent Burley, David Adler, Matt Jen-Yuan Chiang, Hank Driskill, Ralf Habel, Patrick Kelly, Peter Kutz, Yining Karl Li, and Daniel Teece. 2018. The Design and Evolution of Disney’s Hyperion Renderer. ACM Transactions on Graphics 37, 3 (Jul. 2018), Article 33.

Marc Bryant, Ryan DeYoung, Wei-Feng Wayne Huang, Joe Longson, and Noel Villegas. 2021. The Atmosphere of Raya and the Last Dragon. In ACM SIGGRAPH 2021 Talks. Article 51:1-51:2.

Matt Jen-Yuan Chiang, Peter Kutz, and Brent Burley. 2016. Practical and Controllable Subsurface Scattering for Production Path Tracing. In ACM SIGGRAPH 2016 Talks. Article 49.

Petrik Clarberg, Wojciech Jarosz, Tomas Akenine-Möller, and Henrik Wann Jensen. 2005. Wavelet Importance Sampling: Efficiently Evaluating Products of Complex Functions. ACM Transactions on Graphics (Proc. of SIGGRAPH) 24, 3 (Aug. 2005), 1166-1175.

Henrik Dahlberg, David Adler, and Jeremy Newlin. 2019. Machine-Learning Denoising in Feature Film Production. In ACM SIGGRAPH 2019 Talks. Article 21.

Julian Fong, Magnus Wrenninge, Christopher Kulla, and Ralf Habel. 2017. Production Volume Rendering. In ACM SIGGRAPH 2017 Courses. Article 2.

Wei-Feng Wayne Huang, Peter Kutz, Yining Karl Li, and Matt Jen-Yuan Chiang. 2021. Unbiased Emission and Scattering Importance Sampling for Heterogeneous Volumes. In ACM SIGGRAPH 2021 Talks. Article 3.

Avneet Kaur and Johann Francois Coetzee. 2021. Wrapped Clothing on Disney’s Raya and the Last Dragon. In ACM SIGGRAPH 2021 Talks. Article 28.

Avneet Kaur, Erik Eulen, and Johann Francois Coetzee. 2021. Creating Diversity and Variety in the People of Kumandra for Disney’s Raya and the Last Dragon. In ACM SIGGRAPH 2021 Talks. Article 58.

Peter Kutz, Ralf Habel, Yining Karl Li, and Jan Novák. 2017. Spectral and Decomposition Tracking for Rendering Heterogeneous Volumes. ACM Transactions on Graphics (Proc. of SIGGRAPH) 36, 4 (Aug. 2017), Article 111.

Alberto Luceño Ros, Kristin Chow, Jack Geckler, Norman Moses Joseph, and Nicolas Nghiem. 2021. Populating the World of Kumandra: Animation at Scale for Disney’s Raya and the Last Dragon. In ACM SIGGRAPH 2021 Talks. Article 39.

Mike Navarro and Jacob Rice. 2021. Stylizing Volumes with Neural Networks. In ACM SIGGRAPH 2021 Talks. Article 54.

Jennifer Newfield and Josh Staub. 2020. How Short Circuit Experiments: Experimental Filmmaking at Walt Disney Animation Studios. In ACM SIGGRAPH 2020 Talks. Article 72.

Nicolas Nghiem. 2021. Mathematical Tricks for Scalable and Appealing Crowds in Walt Disney Animation Studios’ Raya and the Last Dragon. In ACM SIGGRAPH 2021 Talks. Article 38.

Sean Palmer, Jonathan Garcia, Sara Drakeley, Patrick Kelly, and Ralf Habel. 2017. The Ocean and Water Pipeline of Disney’s Moana. In ACM SIGGRAPH 2017 Talks. Article 29.

Jacob Rice. 2021. Weaving the Druun’s Webbing. In ACM SIGGRAPH 2021 Talks. Article 32.

Thijs Vogels, Fabrice Rousselle, Brian McWilliams, Gerhard Röthlin, Alex Harvill, David Adler, Mark Meyer, and Jan Novák. 2018. Denoising with Kernel Prediction and Asymmetric Loss Functions. ACM Transactions on Graphics (Proc. of SIGGRAPH) 37, 4 (Aug. 2018), Article 124.

Cong Wang, Dale Mayeda, Jacob Rice, Thom Whicks, and Benjamin Huang. 2021. Cooking Southeast Asia-Inspired Soup in Animated Film. In ACM SIGGRAPH 2021 Talks. Article 35.

Art Exercise: Lupine Forest Cabin

December 28, 2020 | Tags: Art

1. Introduction
2. Initial Blocking
3. Forest Floor
4. Making Moss

5. Filling out the Clearing
6. Mist and Atmospherics
7. Final Lighting
8. Conclusion

9. References
10. Footnotes

Introduction

I recently did a small personal art exercise to experiment with building out a detailed forest type of environment. I have worked with a detailed forest scene before, when I used one while I was developing Takua Renderer’s mipmapping and texture caching system, but in that case I didn’t actually make the scene, I just took an off-the-shelf scene and ported it to Takua. I also made a forest for the background of my Woodville RenderMan Art Challenge piece, but to be honest, I wasn’t very happy with how that turned out. I think the background forest in my Woodville piece looks decent when viewed from a distance, as it is seen in that scene, but it doesn’t come even remotely close to holding up when viewed close up. Also, that forest was incredibly difficult to author. For the Woodville piece, I chose Maya’s MASH toolset over the XGen toolset to create the background forest; I chose MASH because MASH generally feels much more native to Maya and doesn’t have all of the extra file management that XGen requires. However, I found that MASH has performance problems when the things being instanced have really heavy geometry and there are a lot of instances, as is the case when instancing trees and bushes and the type of stuff generally found in forests. For this exercise, I wanted to focus on building a forest environment that would be detailed enough to hold up at closer distances, which means including things like leaf litter and twigs and small rocks on the ground, displacement-based bark on trees, moss on various surfaces, and so on. I also wanted to find a more efficient way to author these kinds of environments.

This post is essentially a collated version of my notes from this project, published here mostly because having them all in one place is useful for me to refer back to later but also in case anyone else finds this stuff interesting. Before I go into the details of how this project went, here’s what the final result looks like:

Initial Blocking

I didn’t really come into this project with much of a plan on what the final image would look like. To be honest, the goal of this project wasn’t even necessarily to make the most visually or artistically interesting image; the goal was primarily a technical exercise to practice building a complex scene. I actually started not with forest related things at all but instead I started with just trying to build an interesting groundplane. My plan was to use as many off-the-shelf premade assets as possible and focus on assembling everything together into a nice scene and focus on final lighting. Quixel Megascans has a lot of amazing large-scale 3D scans of rock formations and cliff faces; I had used a few of these back on the Woodville RenderMan Art Challenge and had made a note to myself back then to play with the Megascans large-scale rock formation assets more later. I started with placing a bunch of Megascans rock formations to get a sense of initial shaping for the scene. The overall shaping I went with for the scene draws from the principle of “compression and release”- I wanted the foreground to be through a narrower passage opening up into a larger space in the background, which I think helps give a sense that we’re looking into a larger world:

The next step was to put in a groundplane of dirt and soil. Since at the time RenderMan didn’t have hex-tiling [Heitz and Neyret 2018, Burley 2019] yet, I used the same technique that I used on the Woodville RenderMan Art Challenge to hide tiling in the groundplane textures: the groundplane texture is made up of several different tiled soil and dirt textures, projected using PxrRoundCube with different rotations per texture, blended together using a couple of different noise projections. One useful trick I added in this time was to also use the displacement maps to drive which texture “wins” at each blend point; this allows for nice things like larger rocks poking through layers of dirt and stuff.

Overall though I didn’t put too much effort into the groundplane since I expected it to be almost entirely covered with vegetation in the final scene, meaning very little of it would be directly visible. After getting the groundplane to a good-enough state, I fired up Maya’s XGen Geometry Instancer and started roughing in what all of the vegetation in the scene would look like.

The XGen toolset that ships with Maya today is really two essentially completely different toolsets packaged into a single plugin- XGen Interactive Grooming, and XGen Geometry Instancer. XGen Geometry Instancer was forked from Disney Animation’s internal XGen instancing system over a decade ago, while XGen Interactive Grooming was then added some years later [Todd 2013] as a totally separate authoring interface for hair workflows. For this project, I only used the Geometry Instancer half of XGen. I’ve had a fair amount of experience with Disney Animation’s version of XGen, so using the Maya version of XGen Geometry Instancer has been an interesting experience- the version in Maya was fairly heavily reworked to be more integrated into Maya compared with Disney Animation’s standalone version, and the two versions have since diverged even more due to many years of independent development and evolution, but there are still clearly a lot of Disney-isms in Maya’s XGen, such as the use of SeExpr [Disney Animation 2011] and Ptex [Burley & Lacewell 2008] to control instancing distributions and properties.

The basic workflow for Xgen Geometry Instancer is to first create a bunch of archives of geometry that the user wants to instance, and then create a description, which is a set of instances of the archives scattered across some area of geometry, called a patch. When the underlying geometry that XGen instances are being scattered on is a polygon mesh, patches bind to the faces of the mesh. Multiple decriptions can be organized together as a collection. Something that is a bit frustrating about XGen Geometry Instancer is that its integration into Maya is… somewhat strange. Maya’s XGen Geometry Instancer doesn’t really feel or behave like a native part of Maya- XGen descriptions and collections show up in the Maya outliner, but are edited through a dedicated separate interface instead of the standard Attribute or Channel editors, and instead of storing all data as part of the Maya file, XGen Geometry Instancer creates and depends on a bunch of external sidecar files, and even worse, paths baked into these external files don’t respect the project workspace system that Maya uses, making reusing XGen description between different Maya projects kind of annoying to do. All of this is the result of XGen Geometry Instancer being derived from an internal standalone tool at Disney Animation- the internal version of XGen makes a lot of sense integrated within the overall Disney Animation pipeline, but none of this was originally designed to be a deeply integrated part of Maya outside of Disney Animation, and even with all of the work that Autodesk put into making the commercially available Maya version of the system better integrated, the outside origin still shows through. However, I found on this project that XGen Geometry Instancer is nonetheless still super powerful and for huge numbers of instances is vastly more performant than the native Maya MASH instancing toolset.

The first vegetation test I did was to put down some basic grass, scattered using a super basic XGen Description setup where the base geometry used for scattering was just the groundplane itself. The grass is made up of just a couple of off-the-shelf grass models from Evermotion and Megascans, reshaded for the project’s needs, exported as XGen archives and scattered with random rotations applied per instance:

Next, I started blocking in trees since this scene was meant to be a forest scene. Originally I had planned on scattering trees using XGen as well, but I wound up placing the trees by hand. Part of the reason for placing by hand was simply that there aren’t that many trees in this scene, but the primary reason was just for more direct control since the trees play a really large role in establishing the overall framing of the scene.

The base spruce tree models I used are from Evermotion, but I opted to reshade them using the Evermotion textures as a starting point but then improved using some stuff from Megascans. The main reason for reshading was to improve how well the trees would hold up to being close to the camera, which requires more detailed displacement on the tree bark. I also put some work into making sure the spruce needles transmit light in a convincing way, which required some initial reverse-engineering of how the tree’s original Vray materials work in order to understand how to get a similar effect in RenderMan’s PxrSurface material, and then required further tweaking and adjustment to both the spruce needle textures and the material.

Here are a couple of lookdev test renders of one of the spruce trees; the first two show the overall look of the trees with my reworked shading, and the last one shows the close-up detail and also how the spruce needles look with backlighting:

One interesting thing I ran into was RenderMan’s opacity caching system. The opacity caching system stores an opacity value per micropolygon, under the assumption that micropolygon sizes in world space should be close to one micropolygon per screen-space pixel. As long as the one-micropolygon-per-screen-pixel assumption holds more or less true, this opacity caching approach works well; the average opacity value across a single screen-pixel sized micropolygon should match what the filtered mip-mapped texture lookup for opacity should be across the micropolygon, therefore, caching this value on the micropolygon allows for skipping a lot of texture lookups over the course of the render. However, as soon as micropolygons are larger than a single screen-pixel, the assumptions that allow for opacity caching to work without visual artifacts break, because the frequency that opacity caching can store values at drops below the frequency that the filtered mip-mapped signal from texturing represents; the net visual result is blurring in opacity edges, which can look super strange.

PxrSurface in RenderMan for Maya enables opacity caching by default, but for the spruce needles on the trees, I had to disable opacity caching. The spruce needles on the trees are modeled as a single quad strip per bundle of needles with opacity mapping; I left the quad strips unsubdivided to keep memory usage lower, but this meant that spruce needle polygon sizes were often way larger than one pixel in screen space for trees closer to the camera. Before doing this project I didn’t realize that RenderMan has an opacity caching system and enables it by default, so figuring out why the spruce needles looked all messed up took me a moment. This wasn’t a problem in the Woodville project because in that scene, all of the trees are in the background and therefore are so far from camera that all unsubdivided quad strips are subpixel in screen space anyway.

Here are the spruce trees blocked out along with the initial grass and rock formations. While nothing here is as close to the camera as in my lookdev tests, nonetheless the nearest foreground trees are close enough to the camera where the reworked high resolution displacement mapping and exact opacity evaluation on the spruce needles instead of cached opacity start to matter a great deal:

Forest Floor

At this stage, with conifer trees and large rock formations, the scene was starting to remind me of trips to Yosemite (even though technically Yosemite’s conifers include pine, fir, and sequoia but notably not spruce). I’ve seen lupine blooms in Yosemite before, which struck me as really visually interesting since large fields of blue-purple flowers feel somewhat unusual and unexpected normally. So, to make the foreground in this scene more interesting than just a grassy plane, I added in a couple of layers of lupine flowers to make a purple lupine meadow. The lupine flowers are placed as a couple of different overlapping XGen descriptions, with varying clustering rules per description. I also carved a footpath through the grass and flowers by painting a simple density mask.

I also realized here that the overall massing of the scene was all vertical but in a horizontal framing, which was starting to make the scene feel claustrophobic to me. Having the rock formations be completely vertical also didn’t make sense to me for how they probably would have been formed in real life; a completely vertical cut through rock cliffs like that feels like it would have had to been man-made. So, I tried sloping the rock formations back to give more of an impression of a small valley or something that could have been carved glacially or by water erosion or something. Sloping the rock formations back also helped bring some more light into the foreground of the scene. Here is a comparison of what the scene looked like at this point with vertical rock formations versus sloped rock formations:

Figure 9: Blocking in lupine flowers and carving out a footpath through the grass and flowers, with vertical rock formations (left) versus sloping rock formations (right). For a full screen comparison, click here.

One nice thing about XGen Geometry Instancing is that it treats driving properties through painted masks as a first-class citizen, as opposed to the MASH toolset where using textures as masks is a lot more annoying than I feel like it should be. In my experience, MASH encourages procedural placement or hand placement using the MASH Placer tool over using painted masks to drive everything; using a painted mask only really works well with MASH’s World node, but the World node in turn doesn’t always play well with everything else in MASH’s toolkit; in this sense I really prefer how easy using painted masks is in XGen. However, in XGen, painted masks also come with a minor complication- one Disney-ism that remains in XGen to this day is that XGen expects painted masks to all be Ptex files, which means the masks either have to be made using a paint tool that natively supports Ptex, or means that masks created as regular UV textures need to be converted to Ptex. Within Disney Animation, which has a native Ptex workflow, this is not a problem at all, but in a vanilla Maya workflow in the outside world, this Ptex dependency adds some friction and an additional layer of file juggling to the XGen workflow if one is more used to a UV-based painting workflow.

Here is what the painted density mask for the footpath looks like in Maya’s viewport. Since leaving this mask pretty coarse was fine for the use case, I just painted it directly in Maya using the built-in simple surface paint tool:

Having just the lupine flowers and base grass looked kind of monotonous, so I also threw in some small shrubs here and there to help visually break up the fields of purple. In the areas where the groundplane show through because of the footpath, I also added several more layered XGen descriptions to scatter various small bits of ground debris that I pulled from Megascans; the debris includes things you’d expect from a conifer forest, such as sticks and twigs and clusters of dead conifer needles, in addition to some pebbles and small rocks and stuff.

With all of the debris on the groundplane, even as instanced geometry, the scene was starting to get really heavy to render, so as a small optimization I took the density mask for the footpath through the lupine flowers and grass and simply inverted it and applied it to the debris XGen descriptions. This way, there’s only debris where it’ll actually be visible. Putting all of this together got the scene to a point where I think the groundplane was looking pretty good:

Here’s a top-down view of what all of the footpath debris looks like in the final version of the scene (I’ll get to the moss and volumes next). The nice thing about using instanced geometry instead of just adding decals to the groundplane texturing, or even adding displacement, is that the instanced geometry provides much more of a feeling of depth, especially in all of the areas where cavities and holes are visible under piles of dried needles:

Making Moss

In order to add a bit more detail to the scene, I wanted to add moss to some of the tree trunks and rock formations and stuff. On the Woodville piece, the vast majority of the mossy surfaces in the scene are made using a purely textured approach, with only a small amount of geometric moss in specific places. For this scene, I wanted to move to a 100% geometry based moss approach. Geometry based moss simply holds up better close to camera, and even far away from camera, geometric moss with proper transluscency allows for much more realistic lighting responses completely automatically.

I made the pieces of moss by just grabbing a bunch of Megascans atlases of moss bits and cutting out each piece from each atlas as its own little card. The cards aren’t simple rectangles; each card is shaped to cut as closely to the actual moss shape as possible while still maintaining a very low triangle count. Using simple rectangular cards means that the vast majority of each card is actually going to be opacity mapped out, and in most raytraced renderers (including RenderMan), somewhat counter-intuitively we’d rather have a few more triangles to deal with in exchange for cutting down opacity map lookups. The reason is because triangle intersection using a BVH acceleration structure means that the runtime cost of intersecting N triangles is typically less than N times the cost of a single triangle, but having to either interrupt BVH traversal to evaluate an opacity map in-line or having to evaluate opacity as part of shading and then generate continuation rays for hits shading points where opacity is less than 1 is really expensive in the aggregate. Here’s what some of the moss cards looked like:

At rendertime each moss card is actually subdivided and displacement mapped as well, so the total number of triangles per moss card actually can end up being relatively high, allowing for more detail. This card-based technique produced several dozen unique pieces of moss, which when combined with some random hue shifts in the material and some flipping and scaling was more than enough to create the impression of tons of unique moss bits when instanced and scattered.

To get an initial sense of how well this approach would work, I did a small-scale test on a single rock before expanding to the whole scene, using a simplified version of the rock for the underlying geometry to grow the XGen descriptions from and using a simple density mask painted directly in Maya:

One small problem with using XGen Geometry Instancer with Megascans assets is that XGen can be somewhat difficult to use with super high-resolution meshes. The reason high-resolution meshes get tricky to use with XGen comes back to XGen needing to bind a patch to every face in the mesh; with a really high-resolution mesh, this can result in a ton of patches, and since Maya and XGen are not particularly well multithreaded in some operations, have a ton of patches in XGen can result in really slow update times as Maya loops over and updates all patches. XGen in Maya does have a multithreading option, but I found that even with this option enabled, really high-resolution meshes still bogged down instance generation a lot. So, my workaround was to take all of the meshes that I wanted to grow moss on and create highly decimated versions to serve as base geometry for XGen.

Below is a comparison of what the full-resolution geometry that I wanted moss on looks like in the viewport, versus the decimated version of that geometry for XGen. Note that the decimated XGen base geometry obviously isn’t visible in the final render; XGen allows for seprarately hiding the underlying base geometry versus hiding the generated instances.

Figure 15: Full-resolution geometry that I wanted moss on (left) versus the decimated version used as base geometry for XGen Geometry Instancer (right). For a full screen comparison, click here.

For the moss density masks, I reached for Substance Painter instead of just painting in Maya using the basic built-in paint toolset. Painting the masks in Substance Painter made getting in nice bits of detail and complexity much easier. In order to make the Substance Painter workflow slightly easier, I also re-UV’d the decimated meshes to try to reduce the number of different UV islands; this step was much easier to do on low-resolution decimated meshes than it would have been on the full-resolution geometry!

Here’s what the painted masks for the “hero” foreground tree trunks look like in Substance Painter:

In order to use UV textures as masks in XGen, the textures have to be converted to Ptex. Maya’s paint tool actually has functionality for this; when using the paint tool, there’s an option to save and load from UV textures, which XGen then converts to Ptex upon hitting the map save button in the XGen UI.

I also used Substance Painter to paint the moss density masks for all of the rock formations and cliff walls. The viewport screenshot shows that the final resolution of the density masks actually isn’t that high, but in this case that wound up being completely okay:

One downside of using low-resolution decimated geometry as the base to grow XGen instances from while keeping high-resolution geometry for rendering is that there can be mismatches between where XGen instances are and where the high-resolution surface is; this mismatch means instances can either float in space or be embedded too far down in the high-resolution surface. XGen Geometry Instancer allows for specifying a displacement map on the base geometry to help correct for this, but in this case I found that the mismatch didn’t really matter. To address floating geometry I just applied a universal offset back along the surface normal, and for something like moss, pieces being embedded too far down in the high-resolution surface just look like shorter pieces of moss.

Here is a test render of what the moss looks like on just the smaller foreground rocks, which I did to confirm that at least for this use case, mismatches between the low-resolution surface used for XGen and the high-resolution render surface weren’t important:

Rendering everything together at this point produced the following image; things were starting to look pretty good at this point! I think the moss really helps with adding in a ton of additional detail, which in turn makes the whole scene look and feel more believable. In retrospect I suppose I should have adjust the moss so that it only grows in areas that stay in the shade longer, but even without taking that additional real-world factor into account, I think the moss looks pretty convincing:

This scene really shows the importance of having a robust instancing system in a production renderer- with all of the XGen instances added in, the scene at this stage took up about 50GB of memory in RenderMan for geometry alone, but 50GB is actually not too bad when considering that the total number of visible triangles is around a billion. According to RenderMan’s stats breakdown, of that 50GB of geometry memory, around 33GB is used just for storing instancing information! Most of the triangles just belong to instanced geometry, which is crucial to keeping geometry memory under control.

I ran some analytics to count up the total number of XGen instances (debris, grass, lupine flowers, and moss) in this scene; the total tally is 11,433,11. Here’s a comparison of what the final viewport looks like with only non-instanced geometry, and with every single XGen instance drawn as a blue bounding box on top:

Figure 20: Final viewport with only non-instanced geometry (left) and with every single XGen instance drawn as a blue bounding box on top (right). For a full screen comparison, click here.

The original stated goal of this project was to learn to create forest detail that could hold up well when viewed close up, but admittedly the closest tree in the final composition for this project is only in the mid-ground at best. I wanted to see how ell the moss would hold up when viewed closer, so after I finished the rest of the project, I set up an alternate camera much closer up to one of the trees. Here’s what the final version of the scene looks like with the alternate close-up camera. It’s far from perfect; in order to really look convincing this close up, I think the moss actually needs to be even smaller and even denser, but at least at a glance I think it does an okay job of holding up:

Filling out the Clearing

To finish building the scene, I wanted to put something in the clearing that the footpath leads to. Because I started this project with no real plan whatsoever, I was a bit stumped at this point what to put in the clearing! One initial idea I played with was throwing in a hot air balloon- I figured that a hot air balloon could provide a visually interesting splash of color into the background. I actually fully carried out this idea, including adding in some emissive volume flames for the hot air balloon’s burners. I also filled out the background with some more trees:

While I think the hot air balloon concept wasn’t the worse idea, I decided against it after seeing the hot air balloon fully in-context with the rest of the scene. Visually the balloon stood out, but I felt that it stood out too much and distracted from the overall composition. Also, in terms of the internal logic of the scene’s world-building, I thought that the balloon didn’t make a whole lot of sense- the clearing is pretty small, which when combined with all of the surrounding tall trees, makes for a hazardous landing site for a hot air balloon! I moved on to trying to find something more appropriate to the scene both visually and logically. I did keep the additional trees though.

Since the scene’s original layout principle of “compression and release” comes from the architecture world, I figured it would be fun to turn this scene into a bit of an architectural scene (which tend to be my favorite type of scene to make anyway). Evermotion released a wonderful 15th Anniversary scene earlier this year that includes two small cabins; I thought that something along the lines of those cabins would fit the bill well for the clearing in my scene. I was looking for something that would be small and relatively unassuming, but also still be visually interesting; I didn’t really have any preference for what the style of the building would be, but something super modern felt like a nice change of pace after working on the Woodville project. I didn’t want to just grab one of the Evermotion 15th Anniversary cabins and drop it into the scene as-is though. To keep things more visually interesting, I wanted to be able to see into the cabin more than is possible in the vanilla Evermotion 15th Anniversary cabins, and I also wanted the cabin to look cozy and inviting on the inside. To build the cabin for this scene, I kitbashed together parts from the two Evermotion 15th Anniversary cabins and made the front of my cabin one gigantic glass sheet:

To give the giant front window a bit more realism, I added a subtle amount of unevenness and wobbliness to the glass sheet, much like how giant glass sheets on skyscrapers often have small amounts of visible unevenness. In the following test render, note the back window behind the staircase; the wobbliness of the front glass is most noticeable by the effect it has on how the back window’s straight frame looks:

To integrate the cabin in the scene, I expanded the footpath further to reach the cabin and expand out into a small sort of pad surrounding the cabin. One major mistake I made in putting together this scene though was that I didn’t keep track of the scale of anything, so finding the appropriate scale for the cabin to sit well in the scene took a few tries; for future projects I’ll have to keep track of everything’s scale more closely.

Overall I think compared with the hot air balloon idea, the cabin does a much better job of integrating into the scene without overly drawing attention to itself while still being visually interesting:

Much like in the Shipshape Art Challenge project, I used PxrSurface exclusively as the shading model for everything in this project. However, for the cabin, using PxrSurface became kind of cumbersome because the shader parameterization that Evermotion assets come with doesn’t match up directly with PxrSurface’s parameterization. Specifically, Evermotion assets typically come with Vray, Corona, or generic “PBR” materials. Corona and generic “PBR” materials are the easiest to adapt to other renderers because they both follow the roughness/metallic/basecolor workflow that was originally introduced by the Disney BRDF [Burley 2012] and has since become the de-facto norm across many popular and widely-used physically based shading systems [Karis 2013, Kulla and Estevez 2017, Georgiev et al. 2019, Häussler et al. 2020, Noguer et al. 2021, Stockner 2022, Andersson et al. 2024]¹. PxrSurface, however, notably does not follow the roughness/metallic/basecolor parameterization [Hery et al. 2017]; roughness is the same in PxrSurface as in pretty much every other modern physical shading model, but instead of metallic, PxrSurface uses a combination of face color, edge color, and fresnel exponent to control how metallic a material looks. In order to convert metallic/basecolor maps into something PxrSurface could use, the approach I took was:

\[diffuseColor_{PxrSurface} = baseColor_{PBR} * (1-metallic_{PBR}) \]

\[specFaceColor_{PxrSurface} = baseColor_{PBR} * metallic_{PBR} \]

\[specEdgeColor_{PxrSurface} = 1 \]

When creating textures from scratch using Substance Painter, as I did for the Shipshape project, PxrSurface’s parameterization isn’t an issue because Substance Painter has built in export templates to save out textures in PxrSurface’s expected parameterization even though Substance Painter uses a standard metallic/basecolor workflow. When bringing in existing metallic/basecolor textures though, the above conversion has to be done within the material’s node network. I had to wire up a small node subnetwork to do this for every single Evermotion material that had metallic, which got really old really quickly. At least as of RenderMan 23 (the current version while I was working on this project), RenderMan doesn’t have any utility node to do this conversion from metallic/basecolor to PxrSurface’s diffuse/specular face/specular edge color paramterization²; such a node would be really helpful! Alternatively, a full implementation of the Disney BSDF [Burley 2015] (as opposed to the reduced-paramter Disney BRDF that is already available) would be useful too since a full implementation of the Disney BSDF would both be able to accept roughness/metallic/basecolor parameterized inputs and provide the full range of abilities that a production-quality shading model needs.

Mist and Atmospherics

With the surface geometry and shading for the scene complete, the last things to do were atmospherics and final lighting. Over the last two RenderMan Art Challenges, I’ve evolved my atmospherics and volume rendering workflow a lot. For the Woodville Art Challenge, I relied on a combination of homogeneous mist and heterogeneous fog made procedurally using Maya’s volume noise nodes plugged into the density field of a PxrVolume. This workflow was simple, but ultimately challenging to control and slow to iterate on since seeing what anything looked like required doing a full render. For the Shipshape Art Challenge, I moved to an entirely VDB-based volumes workflow, even for mist and stuff. Using an entirely VDB-based workflow allowed for more direct control, and because RenderMan for Maya can preview VDBs in Maya’s viewport, I was able to get a good sense of how volumes were placed and how they massed together in the scene without having to wait for full renders.

For this project, the volumes workflow was similar to the one on the Shipshape Art Challenge; everything is VDB-based, even the background haze. In making the mist and atmospherics in this scene, I knew I wanted to break things down into three distinct types of volumes: background haze, foreground godrays and atmospherics, and some mist low down clinging to the ground. I broke out and rendered these three types of volumes in their own passes/layers to give more control during compositing.

Here is the background haze layer, which is mostly uniform but does have a small amount of variation that creates a bit of a cascading effect:

For the main foreground godrays and atmospherics layer, I use similar types of volumes as in the background haze layer, but with a different lighting scheme. The main lighting setup for the entire scene is basically just a single super high-resolution skydome IBL with a sun baked in. However, I wanted the godrays to be more prominent than could be achieved by just using the skydome IBL as-is. For for this layer, I created a separate distant light with an orientation matched to the IBL’s sun position, cranked up the exposure of the distant light, and twiddled with the size (angle) of the distant light until the godrays became more visually prominent:

The low ground-hugging mist is made up of VDBs with higher-frequency detail than the previous two layers. There are only a handful of unique VDBs here, but I instanced them around a bunch using native Maya instancing to cover the entire valley floor. I really like how the low mist settles down in between the lupine flowers and floods the footpath area:

Finally, here’s what the three volumetrics layers look like stacked together, with the per-layer adjustments used in the final composite. One thing that I think is kind of neat about looking at the composited volumes together is that we can actually clearly see the shadow cast by the cabin; this detail isn’t actually necessarily clear in any of the volumes layers in isolation, but becomes apparently obvious once all three are stacked since each layer brings something different into the whole:

The clouds in the background in the final render are just part of the skydome IBL; in the future I’m hoping to build background cloudscapes from scratch, but that’ll have to be for a future art exercise.

Final Lighting

The final lighting setup for this scene was super simple; effectively the whole scene is lit by just a simple sun + skydome IBL setup. My goal was to keep the lighting as naturalistic as possible, so a simple sun + sky setup fit the bill, and even with the simple setup, the complexity of the forest geometry still provided a lot of interesting areas of highlight and shadow. For final lighting I did split out the sun as a distant light from the skydome IBL, much like what I did for the volumetrics passes, and I painted out the baked-in sun in the skydome IBL. The reason for splitting out the sun in this case wasn’t to adjust its brightness independently from the sky though, but instead was to allow for slightly adjusting the color temperature of the sun without affecting the fill provided by the sky. The sun is just a touch warmer in the final lighting than it was in all of the in-progress renders up to this point.

There are a bunch of practical lights in the cabin as well; the interior of the cabin is actually lit only using the practical lights and contains no other stage lighting. I initially calibrated all of the practical lights against the sun’s brightness, but that wound up making the interior of the cabin a bit too dark to be visible in a bright sunny day, so I cheated and applied a uniform exposure boost to all of the cabin’s practical lights to brighten up the inside of the cabin.

Putting everything together with the volumetrics produced the following main daylight version of the scene. I also did a small amount of color correction in Lightroom, but in this case the color correction was limited to some basic contrast and saturation adjustments. The color correction for this piece wasn’t nearly as extensive as what I did for the Woodville and Shipshape projects. Here is the final piece, followed by a 50% grey clay shaded version with the same final lighting, comp, and color correction:

I had some fun with making some alternate time-of-day versions of the final lighting setup. These were quick and just for fun, so the sun angle is the same as in the daylight version even though in reality that would make zero sense. I kept the sun angle the same in the variants since they use the same volumetrics passes, just color corrected differently. These variants were made quickly on a whim, which is why I didn’t put in the additional effort to redo everything to work with more plausible sun angles.

Here’s a late afternoon golden hour variant, followed by a 50% grey clay shaded version. A lot more of the look of this variant was achieved by extensive color correction in Lightroom. I think this variant shows off the godray volumetrics better than the main daylight variant, even if the angle of the godrays and by extension the angle of the sun don’t make much sense for golden hour:

The other variant I played with is a cooler overcast morning. A lot of the look of this variant also comes from Lightroom, possibly even more so that for the golden hour variant. I also boosted the contribution of the low ground-hugging mist in this variant to make it more visible, since low mist is typically characterstic of overcast mornings. I left the godrays in, even though having godrays at all on an overcast morning is fairly implausible; again, these two variants were quick and just for fun.

Here’s the overcast morning variant, followed by a 50% grey clay shaded version:

Conclusion

I learned a lot from this project; the goal of the project was to figure out how to put together a detailed forest environment that holds up well at relatively close camera distances, and I think the end result accomplishes that well enough. The final image in this project is decent- it’s not the most narratively interesting image, but it makes for a passable archviz type piece, but to be honest a nice looking image was somewhat secondary to the goal of figuring out how to build a scene like this. Maya’s version of the XGen toolset makes building this type of scene relatively easy and the XGen toolset is super powerful, even if it doesn’t feel particularly native to the rest of Maya. XGen certainly feels a lot more scalable for this type of project than MASH, which I’ve run into performance problems with on scenes much smaller than this one. At the same time though, in my opinion, Maya’s version of XGen isn’t nearly as easy to use and isn’t nearly as featureful as tools like ForestPack for 3ds Max, which remains the gold standard for instancing/scatter tools, or the Geo-Scatter plugin for Blender, or Houdini’s native toolset.

At the end of the (metaphorical) day, I had a lot of fun making this piece! The process was much more experimental and aimless compared with working on RenderMan Art Challenge pieces, which was a nice change of pace for an art project. I learned a bunch of techniques that I’m sure I’ll make use of again in the future, and I got in more practice, which is ultimately the most important part of getting better at anything. I’m really happy that I’ve more or less figured out a reliable way of making good trees now, given how much time I spent a long time ago trying to figure out how to make decent trees.

Finally, here is a progression video I put together from all of the test and in-progress renders that I made throughout this entire project:

Figure 36: Progression reel made from test and in-progress renders leading up to my final image.

References

Zap Andersson, Paul Edmondson, Julien Guertault, Adrien Herubel, Alan King, Peter Kutz, Andréa Machizaud, Jamie Portsmouth, Frédéric Servant, and Jonathan Stone. 2024. OpenPBR Surface Specification. Academy Software Foundation white paper.

Autodesk. 2020. XGen Geometry Instancer. Autodesk Maya 2020 User Documentation.

Autodesk. 2020. XGen Interactive Grooming. Autodesk Maya 2020 User Documentation.

Brent Burley and Dylan Lacewell. 2008. Ptex: Per-face Texture Mapping for Production Rendering. Computer Graphics Forum (Proc. of Eurographics Symposium on Rendering) 27, 4 (Jun. 2008), 1155-1164.

Brent Burley. 2012. Physically Based Shading at Disney. In ACM SIGGRAPH 2012 Course Notes: Practical Physically-Based Shading in Film and Game Production.

Brent Burley. 2015. Extending the Disney BRDF to a BSDF with Integrated Subsurface Scattering. In ACM SIGGRAPH 2015 Course Notes: Physically Based Shading in Theory and Practice.

Brent Burley. 2019. On Histogram-Preserving Blending for Randomized Texture Tiling. Journal of Computer Graphics Techniques 8, 4 (Nov. 2019), 31-53.

Iliyan Georgiev, Jamie Portsmouth, Zap Andersson, Adrien Herubel, Alan King, Shinji Ogaki, Frederic Servant. 2019. Autodesk Standard Surface. Autodesk white paper.

Tobias Häussler, Holger Dammertz, and Bastian Sdorra. 2020. Enterprise PBR Shading Model. Dassault Systèmes white paper.

Eric Heitz and Fabrice Neyret. 2018. High-Performance By-Example Noise using a Histogram-Preserving Blending Operator. Proceedings of the ACM on Computer Graphics and Interactive Techniques (Proc. of High Performance Graphics) 1, 2 (Aug. 2018), Article 31.

Christophe Hery, Ryusuke Villemin, and Junyi Ling. 2017. Pixar’s Foundation for Materials: PxrSurface and PxrMarschnerHair. In ACM SIGGRAPH 2017 Course Notes: Physically Based Shading in Theory and Practice.

Brian Karis. 2013. Real Shading in Unreal Engine 4. In ACM SIGGRAPH 2013 Course Notes: Physically-Based Shading in Theory and Practice.

Christopher Kulla and Alejandro Conty Estevez. 2017. Revisiting Physically Based Shading at Imageworks. In ACM SIGGRAPH 2017 Course Notes: Physically Based Shading in Theory and Practice.

Jeremie Noguer, Paul Edmondson, Michael Bond, and Peter Kutz. 2021. Adobe Standard Material Specification. Adobe white paper.

Lukas Stockner. 2022. The new Principled BSDF Model in Cycles. In BCON 2022: Blender Conference 2022.

Thomas V. Thompson, Ernest J. Petti, and Chuck Tappan. 2003. XGen: Arbitrary Primitive Generator. In ACM SIGGRAPH 2003 Sketches & Applications.

Michael Todd. 2013. Interactive Grooming System: User Friendly Hair & Fur. Michael Todd Portfolio.

Walt Disney Animation Studios. 2011. SeExpr.

Footnotes

¹ Additional post-2020 references added in update to post in May 2025. keyboard_return

² As of RenderMan 24, released in 2021, RenderMan does now have a PxrMetallicWorkflow node that converts metallic/basecolor inputs into the diffuse/specular face/specular edge color inputs that PxrSurface expects. RenderMan 24 also includes a new implementation of the full version of the Disney BSDF. keyboard_return

RenderMan Art Challenge: Shipshape

July 31, 2020 | Tags: Art

1. Introduction
2. Initial Explorations
3. Layout and Framing
4. UV Unwrapping
5. Texturing the Ship

6. Shading the Ship
7. Shading and Texturing the Robots
8. The Wet Shader
9. Additional Props and Elements
10. Rain FX

11. Lighting and Compositing
12. Conclusion
13. References

Introduction

Last year, I participated in one of Pixar’s RenderMan Art Challenges as a way to learn more about modern RenderMan [Christensen et al. 2018] and as a way to get some exposure to tools outside of my normal day-to-day toolset (Disney’s Hyperion Renderer professionally, Takua Renderer as a hobby and learning exercise). I had a lot of fun, and wound up doing better in the “Woodville” art challenge contest than I expected to! Recently, I entered another one of Pixar’s RenderMan Art Challenges, “Shipshape”. This time around I entered just for fun; since I had so much fun last time, I figured why not give it another shot! That being said though, I want to repeat the main point I made in my post about the previous “Woodville” art challenge: I believe that for rendering engineers, there is enormous value in learning to use tools and renderers that aren’t the ones we work on ourselves. Our field is filled with brilliant people on every major rendering team, and I find both a lot of useful information/ideas and a lot of joy in seeing the work that friends and peers across the field have put into commercial renderers such as RenderMan, Arnold, Vray, Corona, and others.

As usual for the RenderMan Art Challenges, Pixar supplied some base models without any uvs, texturing, shading, lighting or anything else, and challenge participants had to start with the base models and come up with a single compelling image for a final entry. I had a lot of fun spending evenings and weekends throughout the duration of the contest to create my final image, which is below. I got to explore and learn a lot of new things that I haven’t tried before, which this post will go through. To my enormous surprise, this time around my entry won first place in the contest!

Initial Explorations

For this competition, Pixar provided five models: a futuristic scifi ship based on an Ian McQue concept, a robot based on a Ruslan Safarov concept, an old wooden boat, a butterfly, and a sextant. The fact that one of the models was based on an Ian McQue concept was enough to draw me in; I’ve been a big fan of Ian McQue’s work for many years now! I like to start these challenges by just rendering the provided assets as-is from a number of different angles, to try to get a sense of what I like about the assets and how I will want to showcase them in my final piece. I settled pretty quickly on wanting to focus on the scifi ship and the robot, and leave the other three models aside. I did find an opportunity to bring in the sextant in my final piece as well, but wound up dropping the old wooden boat and the butterfly altogether. Here are some simple renders showing what was provided out of the box for the scifi ship and the robot:

I initially had a lot of trouble settling on a concept and idea for this project; I actually started blocking out an entirely different idea before pivoting to the idea that eventually became my final image. My initial concept included the old wooden boat in addition the scifi ship and the robot; this initial concept was called “River Explorer”. My initial instinct was to try to show the scifi ship from a top-down view, in order to get a better view of the deck-boards and the big VG engine and the crane arm. I liked the idea of putting the camera at roughly forest canopy height, since forest canopy height is a bit of an unusual perspective for most photographs due to canopy height being this weird height that is too high off the ground for people to shoot from, but too low for helicopters or drones to be practical either. My initial idea was about a robot-piloted flying patrol boat exploring an old forgotten river in a forest; the ship would be approaching the old sunken boat in the river water. With this first concept, I got as far as initial compositional blocking and initial time-of-day lighting tests:

If you’ve followed my blog for a while now, those pine trees might look familiar. They’re actually the same trees from the forest scene I used a while back, ported from Takua’s shading system to RenderMan’s PxrSurface shader.

I wasn’t ever super happy with the “River Explorer” concept; I think the overall layout was okay, but it lacked a sense of dynamism and overall just felt very static to me, and the robot on the flying scifi ship felt kind of lost in the overall composition. Several other contestants wound up also going for similar top-down-ish views, which made me worry about getting lost in a crowd of similar-looking images. After a week of trying to get the “River Explorer” concept to work better, I started to play with some completely different ideas; I figured that this early in the process, a better idea was worth more than a week’s worth of sunk time.

Layout and Framing

I had started UV unwrapping the ship already, and whilst tumbling around the ship unwrapping all of the components one-by-one, I got to see a lot more of the ship and a lot more interesting angles, and I suddenly came up with a completely different idea for my entry. The idea that popped into my head was to have a bunch of the little robots waiting to board one of the flying ships at a quay or something of the sort. I wanted to convey a sense of scale between the robots and the flying scifi ship, so I tried putting the camera far away and zooming in using a really long lens. Since long lenses have the effect of flattening perspective a bit, using a long lens helped make the ships feel huge compared to the robots. At this point I was just doing very rough, quick, AO render “sketches”. This is the AO sketch where my eventual final idea started:

I’ve always loved the idea of the mundane fantastical; the flying scifi ship model is fairly fantastical, which led me to want to do something more everyday with them. I thought it would be fun to texture the scifi ship model as if it was just part of a regular metro system that the robots use to get around their world. My wife, Harmony, suggested a fun idea: set the entire scene in drizzly weather and give two of the robots umbrellas, but give the third robot a briefcase instead and have the robot use the briefcase as a makeshift umbrella, as if it had forgotten its umbrella at home. The umbrella-less robot’s reaction to seeing the ship arriving provided the title for my entry- “Oh Good, The Bus Is Here”. Harmony also pointed out that the back of the ship has a lot more interesting geometric detail compared to the front of the ship, and suggested placing the focus of the composition more on the robots than on the ships. To incorporate all of these ideas, I played more with the layout and framing until I arrived at the following image, which is broadly the final layout I used:

I chose to put an additional ship in the background flying away from the dock for two main reasons. First, I wanted to be able to showcase more of the ship, since the front ship is mostly obscured by the foreground dock. Second, the background ship helps fill out and balance the right side of the frame more, which would otherwise have been kind of empty.

In both this project and in the previous Art Challenge, my workflow for assembling the final scene relies heavily on Maya’s referencing capabilities. Each separate asset is kept in its own .ma file, and all of the .ma files are referenced into the main scene file. The only the things the main scene file contains are references to assets, along with scene-level lighting, overrides, and global-scale effects such as volumes and, in the case of this challenge, the rain streaks. So, even though the flying scifi ship appears in my scene twice, it is actually just the same .ma file referenced into the main scene twice instead of two separate ships.

The idea of a rainy scene largely drove the later lighting direction of my entry; from this point I basically knew that the final scene was going to have to be overcast and drizzly, with a heavy reliance on volumes to add depth separation into the scene and to bring out practical lights on the ships. I had a lot of fun modeling out the dock and gangway, and may have gotten slightly carried away. I modeled every single bolt and rivet that you would expect to be there in real life, and I also added lampposts to use later as practical light sources for illuminating the dock and the robots. Once I had finished modeling the dock and had made a few more layout tweaks, I arrived at a point where I was happy to start with shading and initial light blocking. Zoom in if you want to see all of the rivets and bolts and stuff on the dock:

UV Unwrapping

UV unwrapping the ship took a ton of time. For the last challenge, I relied on a combination of manual UV unwrapping by hand in Maya and using Houdini’s Auto UV SOP, but I found that the Auto UV SOP didn’t work as well on this challenge due to the ship and robot having a lot of strange geometry with really complex topology. On the treehouse in the last challenge, everything was more or less some version of a cylinder or a rectangular prism, with some morphs and warps and extra bits and bobs applied. Almost every piece of the ship aside from the floorboards are very complex shapes that aren’t easy to find good seams for, so the Auto UV SOP wound up making a lot of choices for UV cuts that I didn’t like. As a result, I basically manually UV unwrapped this entire challenge in Maya.

A lot of the complex undercarriage type stuff around the back thrusters on the ship was really insane to unwrap. The muffler manifold and mechanical parts of the crane arm were difficult too. Fortunately though, the models came with subdivision creases, and a lot of the subd crease tags wound up proving to be useful hints towards good places to place UV edge cuts. I also found that the new and improved UV tools in Maya 2020 performed way better than the UV tools in Maya 2019. For some meshes, I manually placed UV cuts and then used the unfold tool in Maya 2020, which I found generally worked a lot better than Maya 2019’s version of the same tool. For other meshes, Maya 2020’s auto unwrap actually often provided a useful starting place as long a I rotated the piece I was unwrapping into a more-or-less axis-aligned orientation and froze its transform. After using the auto-unwrap tool, I would then transfer the UVs back onto the piece in its original orientation using Maya’s Mesh Transfer Attributes tool. The auto unwrap tended to cut meshes into too many UV islands, so I would then re-stitch islands together and place new cuts where appropriate.

When UV unwrapping, a good test to see how good the resultant UVs are is to assign some sort of a checkerboard grid texture to the model and look for distortion in the checkerboard pattern. Overall I think I did an okay job here; not terrible, but could be better. I think I managed to hide the vast majority of seams pretty well, and the total distortion isn’t too bad (if you look closely, you’ll be able to pick out some less than perfect areas, but it was mostly okay). I wound up with a high degree of variability in the grid size between different areas, but I wasn’t too worried about that since my plan was to adjust texture resolutions to match.

After UV unwrapping the ship, UV unwrapping the robot proved to be a lot easier in comparison. Many parts of the robot turn out to be the same mesh just duplicated and squash/stretch/scaled/rotated, which means that they share the same underlying topology. For all parts that share the same topology, I was able to just UV unwrap one of them, and then copy the UVs to all of the others. One great example is the robot’s fingers; most components across all fingers shared the same topology. Here’s the checkerboard test applied to my final UVs for the robot:

Texturing the Ship

After trying out Substance Painter for the previous RenderMan Art Challenge and getting fairly good results, I went with Substance Painter again on this project. The overall texturing workflow I used on this project was actually a lot simpler compared with the workflow I used for the previous Art Challenge. Last time I tried to leave a lot of final decisions about saturation and hue and whatnot as late as possible, which meant moving those decisions into the shader so that they could be changed at render-time. This time around, I decided to make those decisions upfront in Substance Painter; doing so makes the Substance Painter workflow much simpler since it means I can just paint colors directly in Substance Painter like a normal person would, as opposed to painting greyscale or desaturated maps in Substance Painter that are expected to be modulated in the shader later. Also, because of the nature of the objects in this project, I actually used very little displacement mapping; most detail was brought in through normal mapping, which makes more sense for hard surface metallic objects. Not having to worry about any kind of displacement mapping simplified the Substance Painter workflow a bit more too, since that was one fewer texture map type I had to worry about managing.

One the last challenge I relied on a lot of Quixel Megascans surfaces as starting points for texturing, but this time around I (unintentionally) found myself relying on Substance smart materials more for starting points. One thing I like about Substance Painter is how it comes with a number of good premade smart materials, and there are even more good smart materials on Substance Source. Importantly though, I believe that smart materials should only serve as a starting point; smart materials can look decent out-of-the-box, but to really make texturing shine, a lot more work is required on top of the out-of-the-box result in order to really create story and character and a unique look in texturing. I don’t like when I see renders online where a smart material was applied and left in its out-of-the-box state; something gets lost when I can tell which default smart material was used at a glance! For every place that I used a smart material in this project, I used a smart material (or several smart materials layered and kitbashed together) as a starting point, but then heavily customized on top with custom paint layers, custom masking, decals, additional layers, and often even heavy custom modifications to the smart material itself.

I was originally planning on using a UDIM workflow for bringing the ship into Substance Painter, but I wound up with so many UDIM tiles that things quickly became unmanageable and Substance Painter ground to a halt with a gigantic file containing 80 (!!!) 4K UDIM tiles. To work around this, I broke up the ship into a number of smaller groups of meshes and brought each group into Substance Painter separately. Within each group I was able to use a UDIM workflow with usually between 5 to 10 tiles.

I had a lot of fun creating custom decals to apply to various parts of the ships and to some of the robots; even though a lot of the details and decals aren’t very visible in the final image, I still put a good amount of time into making them simply to keep things interesting for myself. All of the decals were made in Photoshop and Illustrator and then brought in to Substance Painter along with opacity masks and applied to surfaces using Substance Painter’s projection mode, either in world space or in UV space depending on situation. In Substance Painter, I created a new layer in with a custom paint material and painted the base color for the paint material by projecting the decal, and then masked the decal layer using the opacity mask I made using the same projection that I used for the base color. The “Seneca” logo seen throughout my scene has shown up on my blog before! A few years ago on a Minecraft server that I played a lot on, a bunch of other players and I had a city named Seneca; ever since then, I’ve tried to sneak in little references to Seneca in projects here and there as a small easter egg.

Many of the buses around where I live have an orange and silver color scheme, and while I was searching the internet for reference material, I also found pictures of the Glasgow Subway’s trains, which have an orange and black and white color scheme. Inspired by the above, I picked an orange and black color scheme for the ship’s Seneca Metro livery. I like orange as a color, and I figured that orange would bring a nice pop of color to what was going to be an overall relatively dark image, I made the upper part of the hull orange but kept the lower part of the hull black since the black section was going to be the backdrop that the robots would be in front of in the final image; the idea was that keeping that part of the hull darker would allow the robots to pop a bit more visually.

One really useful trick I used for masking different materials was to just follow edgeloops that were already part of the model. Since everything in this scene is very mechanical anyway, following straightedges in the UVs helps give everything a manufactured, mechanical look. For example, Figure 12 shows how I used Substance Painter’s Polygon Fill tool to mask out the black paint from the back metal section of the ship’s thrusters. In some other cases, I added new edgeloops to the existing models just so I could follow the edgeloops while masking different layers.

Shading the Ship

For the previous Art Challenge, I used a combination of PxrDisney and PxrSurface shaders; this time around, in order to get a better understanding of how PxrSurface works, I opted to go all-in on using PxrSurface for everything in the scene. Also, for the rain streaks effect (discussed later in this post), I needed some features that are available in the extended Disney Bsdf model [Burley 2015] and in PxrSurface [Hery et al. 2017], but RenderMan 23 only implements the base Disney Brdf [Burley 2012] without the extended Bsdf features; this basically meant I had to use PxrSuface.

One of the biggest differences I had to adjust to was how metallic color is controlled in PxrSurface. The Disney Bsdf drives the diffuse color and metallic color using the same base color parameter and shifts energy between the diffuse/spec and metallic lobes using a “metallic” parameter, but PxrSurface separates the diffuse and metallic colors entirely. PxrSurface uses a “Specular Face Color” parameter to directly drive the metallic lobe and has a separate “Specular Edge Color” control; this parameterization reminds me a lot of Framestore’s artist-friendly metallic fresnel parameterization [Gulbrandsen 2014], but I don’t know if this is actually what PxrSurface is doing under the hood. PxrSurface also has two different modes for its specular controls: an “artistic” mode and a “physical” mode; I only used the artistic mode. To be honest, while PxrSurface’s extensive controls are extremely powerful and offer an enormous degree of artistic control, I found trying to understand what every control did and how they interacted with each other to be kind of overwhelming. I wound up paring back the set of controls I used back to a small subset that I could mentally map back to what the Disney Bsdf or VRayMtl or Autodesk Standard Surface [Georgiev et al. 2019] models do.

Fortunately, converting from the Disney Bsdf’s baseColor/metallic parameterization to PxrSurface’s diffuse/specFaceColor is very easy:

\[ diffuse = baseColor * (1 - metallic) \\ specFaceColor = baseColor * metallic \]

The only gotcha to look out for is that everything needs to be in linear space first. Alternatively, Substance Painter already has a output template for PxrSurface as well. Once I had the maps in the right parameterization, for the most part all I had to do was plug the right maps into the right parameters in PxrSurface and then make minor manual adjustments to dial in the look. In addition to two different specular parameterization modes, PxrSurface also supports choosing from a few different microfacet models for the specular lobes; by default PxrSurface is set to use the Beckmann model [Beckmann and Spizzichino 1963], but I selected the GGX model [Walter et al. 2007] for everything in this scene since GGX is what I’m more used to.

For the actual look of the ship, I didn’t want to go with the dilapidated look that a lot of the other contestants went with. Instead, I wanted the ship to look like it was a well maintained working vehicle, but with all of the grime and scratches that build up over daily use. So, there are scratches and dust and dirt streaks on the boat, but nothing is actually rusting. I also did modeled some glass for the windows at the top of the tower superstructure, and added some additional lamps to the top of the ship’s masts and on the tower superstructure for use in lighting later. After getting everything dialed, here is the “dry” look of the ship:

Here’s a close-up render of the back engine section of the ship, which has all kinds of interesting bits and bobs on it. The engine exhaust kind of looks like it could be a volume, but it’s not. I made the engine exhaust by making a bunch of cards, arranging them into a truncated cone, and texturing them with a blue gradient in the diffuse slot and a greyscale gradient in PxrSurface’s “presence” slot. The glow effect is done using the glow parameter in PxrSurface. The nice thing about using this more cheat-y approach instead of a real volume is that it’s way faster to render!

Most of the ship’s metal components are covered over using a black, semi-matte paint material, but in areas that I thought would be subjected to high temperatures, such as exhaust vents or the inside of the thrusters or the many floodlights on the ship, I chose to use a beaten copper material instead. Basically wherever I wound up placing a practical light, the housing around the practical light is made of beaten copper. Well, I guess it’s actually some kind of high-temperature copper alloy or copper-colored composite material, since real copper’s melting point is lower than real steel’s melting point. The copper color had an added nice effect of making practical lights look more yellow-orange, which I think helps sell the look of engine thrusters and hot exhaust vents more.

Each exhaust vent and engine thruster actually contains two practical lights: one extremely bright light near the back of the vent or thruster pointing into the vent or thruster, and one dimmer but more saturated light pointing outwards. This setup produces a nice effect where areas deeper into the vent or thruster look brighter and yellower, while areas closer to the outer edge of the vent or thruster look a bit dimmer and more orange. The light point outwards also casts light outside of the vent or thruster, providing some neat illumination on nearby surfaces or volumes. Later in this post, I’ll write more about how I made use of this in the final image.

Here’s a turntable video of the ship, showcasing all of the texturing and shading that I did. I had a lot of fun taking care of all of the tiny details that are part of the ship, even though many of them aren’t actually visible in my final image. The dripping wet rain effect is discussed later in this post.

Figure 16: Turntable of the ship showing both dry and wet variants.

Shading and Texturing the Robots

For the robots, I used the same Substance Painter based texturing workflow and the same PxrSurface based shading workflow that I used for the ship. However, since the robot has far fewer components than the ship, I was able to bring all of the robot’s UDIM tiles into Substance Painter at once. The main challenge with the robots wasn’t the sheer quantity of parts that had to be textured, but instead was in the variety of robot color schemes that had to be made. In order to populate the scene and give my final image a sense of life, I wanted to have a lot of robots on the ships, and I wanted all of the robots to have different paint and color schemes.

I knew from an early point that I wanted the robot carrying the suitcase to be yellow, and I knew I wanted a robot in some kind of conductor’s uniform, but aside from that, I didn’t much pre-planned for the robot paint schemes. As a result, coming up with different robot paint schemes was a lot of fun and involved a lot of just goofing around and improvisation in Substance Painted until I found ideas that I liked. To help unify how all of the robots looked and to help with speeding up the texturing process, I came up with a base metallic look for the robot’s legs and arms and various functional mechanical parts. I alternated between steel and copper parts to help bring some visual variety to all of the mechanical parts. The metallic parts are the same across all of the robots; the parts that vary between robots are the body shell and various outer casing parts on the arms:

I wanted very different looks for the other two robots that are on the dock with the yellow robot. I gave one of them a more futuristic looking white glossy shell with a subtle hexagon imprint pattern and red accents. The hexagon imprint pattern is created using a hexagon pattern in the normal map. The red stripes use the same edgeloop-following technique that I used for masking some layers on the ship. I made the other robot a matte green color, and I thought it would be fun make him into a sports fan. He’s wearing the logo and colors of the local in-world sports team, the Seneca Senators! Since the robots don’t wear clothes per se, I guess maybe the sports team logo and numbers are some kind of temporary sticker? Or maybe this robot is such a bit fan that he had the logo permanently painted on… I don’t know! Since I knew these two robots would be seen from the back in the final image, I made sure to put all of the interesting stuff on their sides and back.

For the conductor robot, I chose a blue and gold color scheme based on real world conductor uniforms I’ve seen before. I made the conductor robot overall a bit more cleaned up compared to the other robots, since I figured the conductor robot should look a bit more crisp and professional. I also gave the conductor robot a gold mustache, for a bit of fun! To complete the look, I modeled a simple conductor’s hat for the conductor robot to wear. I also made a captain robot, which has a white/black/gold color scheme derived from the conductor robot. The white/black/gold color scheme is based on old-school ship’s captain uniforms. The captain robot required a bit of a different hat from the conductor hat; I made the captain hat a little bigger and a little bit more elaborate, complete with gold stitching on the front around the Seneca Metro emblem. In the final scene you don’t really see the captain robots, since they wound up inside of the wheelhouse at the top of the ship’s tower superstructure, but hey, at least the captain robots were fun to make, and at least I know that they’re there!

As a bit of a joke, I tried making a poncho for one of the robots. I thought it would look very silly, which for me was all the more reason to try! To make the poncho, I made a big flat disc in Maya and turned it into nCloth, and just let it fall onto the robot with the robot’s geometry acting as a static collider. This approach basically worked out-of-the-box, although I made some manual edits to the geometry afterwards just to get the poncho to billow a bit more on the bottom. The poncho’s shader is a simple glass PxrSurface shader, with the bottom frosted section and smooth diamond-shaped window section both driven using just roughness. The crinkly plastic sheet appearance is achieved entirely through a wrinkle normal map. The poncho bot is also not really visible in the final image, but somewhere in the final image, this robot is in the background on the deck of the front ship behind some other robots!

Don’t worry, I didn’t forget about the fact that the robots have antennae! For the poncho robot, I modeled a hole into the poncho for the antenna to pass through, and I modeled similar holes into the captain robot and conductor robot’s hats as well. Again, this is a detail that isn’t visible in the final image at all, but is there mostly just so that I can know that it’s there:

In total I created 12 different unique robot variants, which some variants duplicated in the final image. All 12 variants are actually present in the scene! Most of them are in the background (and a few variants are only on the background ship), so most of them aren’t very visible in the final image. You, the reader, have probably noticed a theme in this post now where I put a lot of effort into things that aren’t actually visible in the final image… for me, a large part of this project wasn’t necessarily about the final image and was instead just about having fun and getting some practice with the tools and workflows.

Here is a turntable showcasing all 12 robot variants. In the turntable, only the yellow robot has both a wet and dry variant, since all of the other robots in the scene remembered their umbrellas and were therefore able to stay dry. The green sports fan robot does have a variant with a wet right arm though, since in the final image the green sports fan robot’s right arm is extended beyond the umbrella to wave at the incoming ship.

Figure 24: Turntable of the robots, with all 12 robot variants.

The Wet Shader

Going into the shading process, the single problem that worried me the most was how I was going to make everything in the rain look wet. Having a good wet look is extremely important for selling the overall look of a rainy scene. I actually wasn’t too worried about the base dry shading, since hard metal/plastic surfaces are one of the things that CG is really good at by default. By contrast, getting a good wet rainy look took an enormous amount of experimentation and effort, and wound up even involving some custom tools.

From a cursory search online, I found some techniques for creating a wet rainy look that basically work by modulating the primary specular lobe and applying a normal map to the base normal of the surface. However, I didn’t really like how this looked; in some cases, this approach basically makes it look like the underlying surface itself has rivulets and dots in it, not like there’s water running on top of the surface. My hunch was to use PxrSurface’s clearcoat lobe instead, since from a physically motivated perspective, water streaks and droplets behave more like an additional transparent refractive coating layer on top of a base surface. A nice bonus from trying to use the clearcoat lobe is that PxrSurface supports using different normal maps for each specular lobe; this way, I could have a specific water droplets and streaks normal map plugged into the bump normal parameter for the clearcoat lobe without having to disturb whatever normal map I had plugged into the bump normal parameter to the base diffuse and primary specular lobes. My idea was to create a single shading graph for creating the wet rainy look, and then plug this graph into the clearcoat lobe parameters for any PxrSurface that I wanted a wet appearance for. Here’s what the final graph looked like:

In the graph above, note how the input textures are fed into PxrRemap nodes for ior, edge color, thickness, and roughness; this is so I can rescale the 0-1 range inputs from the textures to whatever they need to be for each parameter. The node labeled “mastercontrol” allows for disabling the entire wet effect by feeding 0.0 into the clearcoat edge color parameter, which effectively disables the clearcoat lobe.

Having to manually connect this graph into all of the clearcoat parameters in each PxrSurface shader I used was a bit of a pain. Ideally I would have preferred if I could have just plugged all of the clearcoat parameters into a PxrLayer, disabled all non-clearcoat lobes in the PxrLayer, and then plugged the PxrLayer into a PxrLayerSurface on top of underlying base layers. Basically, I wish PxrLayerSurface supported enabling/disabling layers on a per-lobe basis, but this ability currently doesn’t exist in RenderMan 23. In Disney’s Hyperion Renderer, we support this functionality for sparsely layering Disney Bsdf parameters [Burley 2015], and it’s really really useful.

There are only four input maps required for the entire wet effect: a greyscale rain rivulets map, a corresponding rain rivulets normal map, a greyscale droplets map, and a corresponding droplets normal map. The rivulets maps are used for the sides of a PxrRoundCube projection node, while the droplets maps are used for the top of the PxrRoundCube projection node; this makes the wet effect look more like rain drop streaks the more vertical a surface is, and more like droplets splashing on a surface the more horizontal a surface is. Even though everything in my scene is UV mapped, I chose to use PxrRoundCube to project the wet effect on everything in order to make the wet effect as automatic as possible; to make sure that repetitions in the wet effect textures weren’t very visible, I used a wide transition width for the PxrRoundCube node and made sure that the PxrRoundCube’s projection was rotated around the Y-axis to not be aligned with any model in the scene.

To actually create the maps, I used a combination of Photoshop and a custom tool that I originally wrote for Takua Renderer. I started in Photoshop by kit-bashing together stuff I found online and hand-painting on top to produce a 1024 by 1024 pixel square example map with all of the characteristics I wanted. While in Photoshop, I didn’t worry about making sure that the example map could tile; tiling comes in the next step. After initial work in Photoshop, this is what I came up with:

Next, to make the maps repeatable and much larger, I used a custom tool I previously wrote that implements a practical form of histogram-blending hex tiling [Burley 2019]. Hex tiling with histogram preserving blending, originally introduced by Heitz and Neyret [2018], is one of the closest things to actual magic in recent computer graphics research; using hex tiling instead of normal rectilinear tiling basically completely hides obvious repetitions in the tiling from the human eye, and the histogram preserving blending makes sure that hex tile boundaries blend in a way that makes them completely invisible as well. I’ll write more about hex tiling and make my implementation publicly available in a future post. What matters for this project is hex tiling allowed me to convert my exemplar map from Photoshop into a much larger 8K seamlessly repeatable texture map with no visible repetition patterns. Below is a cropped section from each 8K map:

For the previous Art Challenge, I also made some custom textures that had to be tileable. Last time though, I used Substance Designer to make the textures tileable, which required setting up a big complicated node graph and produced results where obvious repetition was still visible. Conversely, hex tiling basically works automatically and doesn’t require any kind of manual setup or complex graphs or anything.

To generate the normal maps, I used Photoshop’s “Generate Normal Map” filter, which is found under “Filter > 3D”. For generating normal maps from simple greyscale heightmaps, this Photoshop feature works reasonably well. Because of the deterministic nature of the hex tiling implementation though, I could have also generated normal maps from the grey scale exemplars and then fed the normal map exemplars through the hex tiling tool with the same parameters as how I fed in the greyscale maps, and I would have gotten the same result as below.

For the wet effect’s clearcoat lobe, I chose to use the physical mode instead of the artistic mode (unlike for the base dry shaders, where I only used the artistic mode). The reason I used the physical mode for the wet effect is because of the layer thickness control, which darkens the underlying base shader according to how thick the clearcoat layer is supposed to be. I wanted this effect, since wet surfaces appear darker than their dry counterparts in real life. Using the greyscale wet map, I modulated the layer thickness control according to how much water there was supposed to be at each part of the surface.

Finally, after wiring everything together in Maya’s HyperShade editor, everything just worked! I think the wet look my approach produces looks reasonable convincing, especially from the distances that everything is from the camera in my final piece. Up close the effect still holds up okay, but isn’t as convincing as using real geometry for the water droplets with real refraction and caustics drive by manifold next event estimation [Hanika et al. 2015]. In the future, if I need to do close up water droplets, I’ll likely try an MNEE based approach instead; fortunately, RenderMan 23’s PxrUnified integrator already comes with an MNEE implementation as an option, along with various other strategies for handling caustic cases [Hery et al. 2016]. However, the approach I used for this project is far cheaper from a render time perspective compare to using geometry and MNEE, and from a mid to far distance, I’m pretty happy with how it turned out!

Below are some comparisons of the ship and robot with and without the wet effect applied. The ship renders are from the same camera angles as in Figures 13, 14, and 15. drag the slider left and right to compare:

Figure 29: Wide view of the ship with (left) and without (right) the wet shader applied. For a full screen comparison, click here.

Figure 30: Back view of the ship with (left) and without (right) the wet shader applied. For a full screen comparison, click here.

Figure 31: Side view of the ship with (left) and without (right) the wet shader applied. For a full screen comparison, click here.

Figure 32: Main yellow robot with (left) and without (right) the wet shader applied. For a full screen comparison, click here.

Additional Props and Set Elements

In addition to texturing and shading the flying scifi ship and robot models, I had to create from scratch several other elements to help support the story in the scene. By far the single largest new element that had to be created was the entire dock structure that the robots stand on top of. As mentioned earlier, I wound up modeling the dock to a fairly high level of detail; the dock model contains every single bolt and rivet and plate that would be necessary for holding together a similar real steel frame structure. Part of this level of detail is justifiable by the fact that the dock structure is in the foreground and therefore relatively close to camera, but part of having this level of detail is just because I could and I was having fun while modeling. To model the dock relatively quickly, I used a modular approach where I first modeled a toolkit of basic reusable elements like girders, connection points, bolts, and deckboards. Then, from these basic elements, I assembled larger pieces such as individual support legs and crossbeams and such, and then I assembled these larger pieces into the dock itself.

Shading the dock was relatively fast and straightforward; I created a basic galvanized metal material and applied it using a PxrRoundCube projection. To get a bit more detail and break up the base material a bit, I added a dirt layer on top that is basically just low-frequency noise multiplied by ambient occlusion. I did have to UV map the gangway section of the dock in order to add the yellow and black warning stripe at the end of the gangway; however, since the dock is made up almost entirely of essentially rectangular prisms oriented at 90 degree angles to each other, just using Maya’s automatic UV unwrapping provided something good enough to just use as-is. The yellow and black warning stripe uses the same thick worn paint material that the warning stripes on the ship uses. On top of all of this, I then applied my wet shader clearcoat lobe.

The metro sign on the dock is just a single rectangular prism with a dark glass material applied. The glowing text is a color texture map plugged into PxrSurface’s glow parameter; whereever there is glowing text, I also made the material diffuse instead of glass, with the diffuse color matching the glow color. To balance the intensity of the glow, I had to cheat a bit; turning the intensity of the glow down enough so that the text and colors read well means that the glow is no longer bright enough to show up in reflections or cast enough light to show up in a volume. My solution was to turn down the glow in the PxrSurface shader, and then add a PxrRectLight immediately in front of the metro sign driven by the same texture map. The PxrRectLight is set to be invisible to the camera. I suppose I could have done this in post using light path expressions, but cheating it this way was simpler and allowed for everything to just look right straight out of the render.

The suitcase was a really simple prop to make. Basically it’s just a rounded cube with some extra bits stuck on to it for the handles and latch; the little rivets are actually entirely in shading and aren’t part of the geometry at all. I threw on a basic burlap material for the main suitcase, multiplied on some noise to make it look a bit dirtier and worn, and applied basic brass and leather materials to the latch and handle, and that was pretty much it. Since the suitcase was going to serve as the yellow robot’s makeshift umbrella, making sure that the suitcase looked good with the wet effect applied turned out to be really important. Here’s a lookdev test render of the suitcase, with and without the wet effect applied (slide left and right to compare):

Figure 35: Suitcase with (left) and without (right) the wet shader applied. For a full screen comparison, click here.

From early on, I was fairly worried about making the umbrellas look good; I knew that making sure the the umbrellas looked convincingly wet was going to be really important for selling the overall rainy day setting. I originally was going to make the umbrellas opaque, but realized that opaque umbrellas were going to cast a lot of shadows and block out a lot of parts of the frame. Switching to transparent umbrellas made out of clear plastic helped a lot with brightening up parts of the frame and making sure that large parts of the ship weren’t completely blocked out in the final image. As a bonus, I think the clear umbrellas also help the overall setting feel slightly more futuristic. I modeled the umbrella canopy as a single-sided mesh, so the “thin” setting in PxrSurface’s glass parameters was really useful here. Since the umbrella canopy is transparent with refraction roughness, having the wet effect work through the clearcoat lobe proved really important here since doing so allowed for the rain droplets and rivulets to have sharp specular highlights while simultaneously preserving the more blurred refraction in the underlying umbrella canopy material. In the end, lighting turned out to be really important for selling the look of the wet umbrella as well; I found that having tons of little specular highlights coming from all of the rain drops helped a lot.

As a bit of an aside, settling on a final umbrella canopy shape took a surprising amount of time! I started with a much flatter umbrella canopy, but eventually made it more bowed after looking at various umbrellas I have sitting around at home. Most clear umbrella references I found online are of these Japanese bubble umbrellas which are actually far more bowed than a standard umbrella, but I wanted a shape that more closely matched a standard opaque umbrella.

One late addition I made to the umbrella was the small lip at the bottom edge of the umbrella canopies; for much of the development process, I didn’t have this small lip and kept feeling like something was off about the umbrellas. I eventually realized that some real umbrellas have a bit of a lip to help catch and guide water runoff; adding this feature to the umbrellas helped them feel a bit more correct.

Shortly before the due date for the final image, I made a last-minute addition to my scene: I took the sextant that came with Pixar’s base models and made the white/red robot on the dock hold it. Since the green and yellow robots were both doing something a bit more dynamic than just standing around, I wanted the middle white/red robot to be doing something as well. Maybe the white/red robot is going to navigation school! I did a very quick-and-dirty shading job on the sextant using Maya’s automatic UVs; overall the sextant prop is not shaded to the same level of detail as most of the other elements in my scene, but considering how small the sextant is in the final image, I think it holds up okay. I still tried to add a plausible amount of wear and age to the metal materials on the sextant, but I didn’t have time to put in carved numbers and decals and grippy textures and stuff. There are also a few small areas where you can see visible texture stretching at UV seams, but again, in the final image, it didn’t matter too much.

Rain FX

Having a good wet surface look was one half of getting my scene to look convincingly rainy; the other major problem to solve was making the rain itself! My initial, extremely naive plan was to simulate all of the rainfall as one enormous FLIP sim in Houdini. However, I almost immediately realized what a bad idea that was, due to the scale of the scene. Instead, I opted to simulate the rain as nParticles in Maya.

To start, I first duplicated all of the geometry that I wanted the rain to interact with, combined it all into one single huge mesh, and then decimated the mesh heavily and simplified as much as I could. This single mesh acted as a proxy for the full scene for use as a passive collider in the nParticles network. Using a decimated proxy for the collider instead of the full scene geometry was very important for making sure that the sim ran fast enough for me to be able to get in a good number of different iterations and attempts to find the look that I wanted. I mostly picked geometry that was upward facing for use in the proxy collider:

Next, I set up a huge volume nParticle emitter node above the scene, covering the region visible in the camera frustum. The only forces I set up were gravity and a small amount of wind, and then I ran the nParticles system and let it run until rain had filled all parts of the scene visible to the camera. To give the impression of fast moving motion-blurred rain droplets, I set the rendering mode of the nParticles to ‘multistreak’, which makes each particle look like a set of lines with lengths varying according to velocity. I had to play with the collider proxy mesh’s properties a bit to get the right amount of raindrops bouncing off of surfaces and to dial in how high raindrops bounced. I initially tried allowing particles to collide with each other as well, but this slowed the entire sim down to basically a halt, so for the final scene I have particle-to-particle collision disabled.

After a couple of rounds of iteration, I started getting something that looked reasonably like rain! Using the proxy collision geometry wa really useful for creating “rain shadows”, which are areas that rain isn’t present due to being stopped by something else. I also tuned the wind speed a lot in order to get rain particles bouncing off of the umbrellas to look like they were being blown aside in the wind. After getting a sim that I liked, I baked out the frame of the sim that I wanted for my final render using Maya’s nCache system, which caches the nParticle simulation to disk so that it can be rapidly loaded up later without having to re-run the entire simulation.

To add just an extra bit of detail and storytelling, near the end of the competition period I revisited my original idea for making the rain in Houdini using a FLIP solver. I wanted to add in some “hero” rain drops around the foreground robots, running off of their umbrellas and suitcases and stuff. To create these “hero” droplets, I brought the umbrella canopies and suitcase into Houdini and built a basic FLIP simulation, meshed the result, and brought it back into Maya to integrate back into the scene.

Dialing in the look of the rain required a lot of playing with both the width of the rain drop streaks and with the rain streak material. I was initially very wary of making the rain in my scene heavy, since I was concerned about how much a heavy rain look would prevent me from being able to pull good detail and contrast from the ships. However, after some successful initial tests, I felt a bit more confident about a heavier rain look. I took the test from yesterday with more rain, and tried increasing the amount of rain by around 10x. I originally started working on the sim with only around a million particles, but by the end I had bumped up the particle count to around 10 million. In order to prevent the increased amount of rain from completely washing out the scene, I made each rain drop streak on the thinner and shorter side, and also tweaked the material to be slightly more forward scattering. My rain material is basically a mix of a rough glass and grey diffuse, with the reasoning being rain needs to be a glass material since rain is water, but since the rain droplet streaks are meant to look motion blurred, throwing in some diffuse just helps them show up better in camera; making the rain material more forwards scattering in this case just means changing the ratio of glass/diffuse to be more glass. I eventually arrived at a ratio of 60% diffuse light grey to 40% glass, which I found helped the rain show up in the camera and catch light a bit better. I also used the “presence” parameter (which is really just opacity) in PxrShader to make final adjustments to balance how visible the rain was with how much it was washing out other details. For the “hero” droplets, I used a completely bog-standard glass material.

Figuring out how to simulate the rain and make it look good was by far the single largest source of worries for me in this whole project, so I was incredibly relieved at the end when it all came together and started looking good. Here’s a 2K crop from my final image showing the “hero” droplets and all of the surrounding rain streaks around the foreground robots.

Lighting and Compositing

Lighting this scene proved to be very interesting and very different from what I did for the previous challenge! Looking back, I think I actually may have “overlit” the scene in the previous challenge; I tend to prefer a slightly more naturalistic look, but while in the thick of lighting, it’s easy to get carried away and push things far beyond the point of looking naturalistic. Another aspect of this scene that it made it very different from anything I’ve tried before is both the sheer number of practical lights in the scene and the fact that practical lights are the primary source of all lighting in this scene!

The key lighting in this scene is provided by the overhead lampposts on the dock, which illuminate the foreground robots. I initially had a bunch of additional invisible PxrRectLights providing additional illumination and shaping on the robots, but I got rid of all of them and in the final image I relied only on the actual lights on the lampposts. To prevent the visible light surfaces themselves from blowing out an aliasing, I used two lights for every lamppost: one visible-to-camera PxrRectLight set to a low intensity that wouldn’t alias in the render, and one invisible-to-camera PxrRectLight set to a relatively higher intensity for providing the actual lighting. The visible-to-camera PxrRectLight is rendered out as the only element on a separate render layer, which can then be added back in to the main key lighting render layer.

To better light the ships, I added a number of additional floodlights to the ship that weren’t part of the original model; you can see these additional floodlights mounted on top of the various masts of the ships and also on the sides of the tower superstructure. These additional floodlights illuminate the decks of the ships and help provide specular highlights to all of the umbrellas on the deck of the foreground ship, which enhances the rainy water droplet covered look. For the foreground robots on the dock, the ship floodlights also act as something of a rim light. Each of the ship floodlights is modeled as a visible-to-camera PxrDiscLight behind a glass lens with a second invisible-to-camera PxrDiscLight in front of the glass lens. The light behind the glass lens is usually lower in intensity and is there to provide the in-camera look of the physical light, while the invisible light in front of the lens is usually higher in intensity and provides the actual illumination in the scene.

In general, one of the major lessons I learned on this project was that when lighting using practical lights that have to be be visible in camera, a good approach is to use two different lights: one visible-to-camera and one invisible-to-camera. This approach allows for separating how the light itself looks versus what kind of lighting it provides.

The overall fill lighting and time of day is provided by the skydome, which is of an overcast sky at dusk. I waffled back and forth for a while between a more mid-day setting versus a dusk setting, but eventually settled on the dusk skydome since the overall darker time of day allows the practical lights to stand out more. I think allowing the background trees to fade almost completely to black actually helps a lot in keeping the focus of the image on the main story elements in the foreground. One feature of RenderMan 23 that really helped in quickly testing different lighting setups and iterating on ideas was RenderMan’s IPR mode, which has come a long way since RendermMan first moved to path tracing. In fact, throughout this whole project, I used the IPR mode extensively for both shading tests and for the lighting process. I have a lot of thoughts about the huge, compelling improvements to artist workflows that will be brought by even better interactivity (RenderMan XPU is very exciting!), but writing all of those thoughts down is probably better material for a different blog post in the future.

In total I had five lighting render layers: the key from the lampposts, the foreground rim and background fill from the floodlights, overall fill from the skydome, and two practicals layers for the visible-to-camera parts of all of the practical lights. Below are the my lighting render layers, although with the two practicals layers merged:

I used a number of PxrRodLightFilters to knock down some distractingly bright highlights in the scene (especially on the foreground robots’ umbrellas in the center of the frame). As a rendering engineer, rod light filters are a constant source of annoyance due to the sampling problems they introduce; rods allow for arbitrarily increasing or decreasing the amount of light going through an area, which throws off energy conservation, which can mess up importance sampling strategies that depend on a degree of energy conservation. However, as a user, rod light filters have become one of my favorite go-to tools for shaping and adjusting lighting on a local basis, since they offer an enormous amount of localized artistic control.

To convey the humidity of a rainstorm and to provide volumetric glow around all of the practical lights in the scene, I made extensive use of volume rendering on this project as well. Every part of the scene visible in-camera has some sort of volume in it! There are generally two types of volumes in this scene: a group of thinner, less dense volumes to provide atmospherics, and then a group of thicker, denser “hero” volumes that provide some of the more visible mist below the foreground ship and swirling around the background ship. All of these volumes are heterogeneous volumes brought in as VDB files.

One odd thing I found with volumes was some major differences in sampling behavior between RenderMan 23’s PxrPathtracer and PxrUnified integrators. I found that by default, whenever I had a light that was embedded in a volume, areas in the volume near the light were extremely noisy when rendered using PxrUnified but rendered normally when using PxrPathtracer. I don’t know enough about the details of how PxrUnified and PxrPathtracer’s volume integration [Fong et al. 2017] approaches differ, but it almost looks to me like PxrPathtracer is correctly using RenderMan’s equiangular sampling implementation [Kulla and Fajardo 2012] in these areas and PxrUnified for some reason is not. As a result, for rendering all volume passes I relied on PxrPathtracer, which did a great job with quickly converging on all passes.

An interesting unintended side effect of filling the scene with volumes was in how the volumes interacted with the orange thruster and exhaust vent lights. I had originally calibrated the lights in the thrusters and exhaust vents to provide an indication of heat coming from those areas of the ship without being so bright as to distract from the rest of the image, but the orange glows these lights produced in the volumes made the entire bottom of the image orange, which was distracting anyway. As a result, I had to re-adjust the orange thruster and exhaust vent lights to be considerably dimmer than I had originally had them, so that when interacting with the volumes, everything would be brought up to the apparent image-wide intensity that I had originally wanted.

In total I had eight separate render passes for volumes; each of the consolidated lighting passes from above had two corresponding volume passes. Within the two volume passes for each consolidated lighting pass, one volume pass was for the atmospherics and one was for the heavier mist and fog. Below are the volume passes consolidated into four images, with each image showing both the atmospherics and mist/fog in one image:

One final detail I added in before final rendering was to adjust the bokeh shape to something more interesting than a uniform circle. RenderMan 23 offers a variety of controls for customizing the camera’s aperture shape, which in turn controls the bokeh shape when using depth of field. All of the depth of field in my final image is in-render, and because of all of the tiny specular hits from all of the raindrops and from the wet shader, there is a lot of visible bokeh going on. I wanted to make sure that all of this bokeh was interesting to look at! I picked a rounded 5-bladed aperture with a significant amount of non-uniform density (that is, the outer edges of the bokeh are much brighter than the center core).

For final compositing, I used a basic Photoshop and Lightroom workflow like I did in the previous challenge, mostly because Photoshop is a tool I already know extremely well and I don’t have Nuke at home. I took a relatively light-handed approach to compositing this time around; adjustments to layers were limited to just exposure adjustments. All of the layers shown above already have the exposure adjustments I made baked in. After making adjustments in Photoshop and flattening out to a single layer, I then brought the image into Lightroom for final color grading. For the final color grade, I tried push the overall look to be a bit moodier and a bit more contrast-y, with the goal of having the contrast further draw the viewer’s eye to the foreground robots where the main story is. Figure 50 is a gif that visualizes the compositing process for my final image by showing how all of the successive layers are added on top of each other. Figure 51 shows what all of the lighting, comp, and color grading looks like applied to a 50% grey clay shaded version of the scene, and if you don’t want to scroll all the way back to the top of this post to see the final image, I’ve included it again as Figure 52.

Conclusion

On a whole, I’m happy with how this project turned out! I think a lot of what I did on this project represents a decent evolution over and applies a lot of lessons learned on the previous RenderMan Art Challenge. I started this project mostly as an excuse to just have fun, but along the way I still learned a lot more, and going forward I’m definitely hoping to be able to do more pure art projects alongside my main programming and technical projects.

Here is a progression video I put together from all of the test and in-progress renders that I made throughout this entire project:

Figure 53: Progression reel made from test and in-progress renders leading up to my final image.

My wife, Harmony Li, deserves an enormous amount of thanks on this project. First off, the final concept I went with is just as much her idea as it is mine, and throughout the entire project she provided valuable critiques and suggestions and direction. As usual with the RenderMan Art Challenges, Leif Pederson from Pixar’s RenderMan group provided a lot of useful tips, advice, feedback, and encouragement as well. Many other entrants in the Art Challenge also provided a ton of support and encouragement; the community that has built up around the Art Challenges is really great and a fantastic place to be inspired and encouraged. Finally, I owe an enormous thanks to all of the judges for this RenderMan Art Challenge, because they picked my image for first place! Winning first place in a contest like this is incredibly humbling, especially since I’ve never really considered myself as much of an artist. Various friends have since pointed out that with this project, I no longer have the right to deny being an artist! If you would like to see more about my contest entry, check out the work-in-progress thread I kept on Pixar’s Art Challenge forum, and I also made an Artstation post for this project.

As a final bonus image, here’s a daylight version of the scene. My backup plan in case I wasn’t able to pull off the rainy look was to just go for a boring daylight setup; I figured that the lighting would be a lot more boring, but the additional visible detail would be an okay consolation prize for myself. Thankfully, the rainy look worked out and I didn’t have to go to my backup plan! After the contest wrapped up, I went back and made a daylight version out of curiosity:

References

Petr Beckmann and André Spizzichino. 1963. The Scattering of Electromagnetic Waves from Rough Surfaces.

Brent Burley. 2012. Physically Based Shading at Disney. In ACM SIGGRAPH 2012 Course Notes: Practical Physically-Based Shading in Film and Game Production.

Brent Burley. 2015. Extending the Disney BRDF to a BSDF with Integrated Subsurface Scattering. In ACM SIGGRAPH 2015 Course Notes: Physically Based Shading in Theory and Practice.

Brent Burley. 2019. On Histogram-Preserving Blending for Randomized Texture Tiling. Journal of Computer Graphics Techniques 8, 4 (Nov. 2019), 31-53.

Per Christensen, Julian Fong, Jonathan Shade, Wayne Wooten, Brenden Schubert, Andrew Kensler, Stephen Friedman, Charlie Kilpatrick, Cliff Ramshaw, Marc Bannister, Brenton Rayner, Jonathan Brouillat, and Max Liani. 2018. RenderMan: An Advanced Path-Tracing Architeture for Movie Rendering. ACM Transactions on Graphics 37, 3 (Jul. 2018), Article 30.

Johannes Hanika, Marc Droske, and Luca Fascione. 2015. Manifold Next Event Estimation. Computer Graphics Forum (Proc. of Eurographics Symposium on Rendering) 34, 4 (Jun. 2015), 87-97.

Christophe Hery, Ryusuke Villemin, and Florian Hecht. 2016. Towards Bidirectional Path Tracing at Pixar. In ACM SIGGRAPH 2016 Course Notes: Physically Based Shading in Theory and Practice.

Julian Fong, Magnus Wrenninge, Christopher Kulla, and Ralf Habel. 2017. Production Volume Rendering. In ACM SIGGRAPH 2017 Courses. Article 2.

Iliyan Georgiev, Jamie Portsmouth, Zap Andersson, Adrien Herubel, Alan King, Shinji Ogaki, Frederic Servant. 2019. Autodesk Standard Surface. Autodesk white paper.

Ole Gulbrandsen. 2014. Artistic Friendly Metallic Fresnel. Journal of Computer Graphics Techniques 3, 4 (Dec. 2014), 64-72.

Christopher Kulla and Marcos Fajardo. 2012. Important Sampling Techniques for Path Tracing in Participating Media. Computer Graphics Forum (Proc. of Eurographics Symposium on Rendering) 31, 4 (Jun. 2012), 1519-1528.

Bruce Walter, Steve Marschner, Hongsong Li, and Kenneth E. Torrance. 2007. Microfacet Models for Refraction through Rough Surfaces. In Proc. of Eurographics Symposium on Rendering (Rendering Techniques 2007), 195-206.

Porting Takua Renderer to 64-bit ARM- Part 1

Table of Contents

Introduction

Motivation

Porting to arm64 Linux

Floating Point Consistency (or lack thereof) on Different Systems

Weak Memory Ordering in arm64 and Atomic Bugs in Takua

A Deep Dive on x86-64 versus arm64 Through the Lens of Compiling std::atomic::compare_exchange_weak()

Performance Testing

Conclusion to Part 1

Acknowledgements

References

New Responsive Layout and Blog Plans

RenderMan Art Challenge: Magic Shop

Table of Contents

Introduction

Character Explorations

Skin Shading and Subsurface Scattering

Clothes and Fuzz

Layout, Framing, and Building the Shop

So Many Props!

Signs and Records and Books

Putting Everything Together

Conclusion

References

Raya and the Last Dragon

References

Art Exercise: Lupine Forest Cabin

Table of Contents

Introduction

Initial Blocking

Forest Floor

Making Moss

Filling out the Clearing

Mist and Atmospherics

Final Lighting

Conclusion

References

Footnotes

RenderMan Art Challenge: Shipshape

Table of Contents

Introduction

Initial Explorations

Layout and Framing

UV Unwrapping

Texturing the Ship

Shading the Ship

Shading and Texturing the Robots

The Wet Shader

Additional Props and Set Elements

Rain FX

Lighting and Compositing

Conclusion

References

A Deep Dive on x86-64 versus arm64 Through the Lens of Compiling `std::atomic::compare_exchange_weak()`