Code & Visuals

Zootopia 2

2025-12-17T00:00:00+00:00

Disney Animation’s movie for 2025 is Zootopia 2, which is the studio’s 64th animated feature film. Zootopia 2 picks up where the first film left off, taking us deeper into the wonderful and wild animal world of the city. One of the really fun things about Zootopia projects is that each one expands the world further. The first film introduced the setting, the Zootopia+ series on Disney+ offered fun character vignettes to expand that world, and Zootopia 2 now takes us deep into the city’s history and to places both familiar and brand new. I’ve had a great time working on Zootopia 2 for the past two years!

From a technology perspective, sequels are always interesting to work on because they give us the ability to evaluate where our filmmaking capabilities presently stand compared against a known past benchmark; we know roughly what it takes to make a Zootopia movie already, and so we can see how much better we have gotten at it in the intervening years. I think Zootopia 2 is an especially interesting case because of how important the first Zootopia (2016) was in the history of Disney Animation’s technology development. For a bit of context: the decade of Disney Animation films leading up to Zootopia (2016) was a time when the studio was rapidly climbing a steep learning curve for making CG movies. Every film had technical challenges that called for the studio to overcome unprecedented obstacles. Zootopia (2016) similarly presented an enormous list of challenges, but upon completing the film I felt there was a stronger sense of confidence in what the studio could achieve together. A small anecdote about Zootopia (2016) that I am very proud of is that at SIGGRAPH 2017, I heard from a friend at a major peer feature animation studio that they were blown away and had absolutely no idea how we had made Zootopia.

Ever since then, the sense in the studio has always been “this movie will be hard to make, but we know how to make it.” This isn’t to say that we don’t have interesting and difficult challenges to overcome in each movie we make; we always do! But, ever since Zootopia (2016)’s completion, I think we’ve been able to approach the challenges in each movie with greater confidence that we will be able to find solutions.

The major technology challenges on Zootopia 2 ultimately were pretty similar to the challenges on Zootopia (2016): everything is about detail and scale [Burkhard et al. 2016]. The world of Zootopia is incredibly detailed and visually rich, and that detail has to hold up at scales ranging from a tiny shrew to the tallest giraffe. Most characters are covered in detailed fur and hair, and because the setting is a modern city, shots can have hundreds or even thousands of characters on screen all at the same time, surrounded by all of the vehicles and lights and zillions of other props and details one expects in a city. Almost every shot in the movie has some form of complex simulation or FX work, and the nature of the story takes us through every environment and lighting scenario imaginable, all of which we have to be able to render cohesively and efficiently. Going back and rewatching Zootopia (2016), I still notice how much incredible geometry and shading detail is packed into every frame, and in the nine years since, our artists have only pushed things even further.

To give an example of the amazing amount of detail in Zootopia 2: at one point during production, our rendering team noticed some shots that had incredibly detailed snow with tons of tiny glints, so out of curiosity we opened up the shots to see how the artists had shaded the snow, and we found that they had constructed the snow out of zillions upon zillions of individual ice crystals. We were completely blown away; constructing snow this way was an idea that Disney Research had explored shortly after the first Frozen movie was made [Müller et al. 2016], but at the time it was purely a theoretical research idea, and a decade later our artists were just going ahead and actually doing it. The result in the final film looks absolutely amazing, and on top of that, instead of needing a specialized technology solution to make this approach feasible, in the past decade both our renderer and computers in general have gotten so much faster and our artists have improved their workflows so much that a brute-force solution was good enough to achieve this effect without much trouble at all.

One of the largest rendering advancements we made on Zootopia (2016) was the development of the Chiang hair shading model, which has since become the de-facto industry standard for fur/hair shading and is implemented in most major production renderers. For Zootopia 2, we kept the Chiang hair shading model [Chiang et al. 2016] as-is, but instead put a lot of effort into improving the accuracy and performance of our hair ray-geometry intersection algorithms. Making improvements to our ray-curve intersector actually took a large amount of close iteration with our Look Development artists. This may sound surprising since we didn’t change the fur shader at all, but the final look of our fur is an effect that arises from extensive multiple-scattering between fur strands, for which small energy differences that arise from inaccuracies in ray-curve intersection can multiply over many bounces into pretty significant overall look differences. In an original film, if the look of a character’s hair drifts slightly during early preproduction due to underlying renderer changes, generally these small visual changes can be tolerated and factored in as the look of the film evolves, but in a sequel with established characters that have a known target look that we must meet, we have to be a lot more careful.

I’ve been lucky enough to have gotten to work on a wide variety of types and scales of projects over the past decade at Disney Animation, and for Zootopia 2 I got to work on two of my absolute favorite types of projects. The first type of favorite project is the ones where we get to work on a custom solution for a very specific visual need in the film; these are the projects where I can point out a specific thing in final frames that is there because I wrote the code for it. My second type of favorite project is ones where we get to take something super bleeding edge from pure research and take it all the way through to practical, wide production usage. Getting to do both of these types of projects on the same film was a real treat! On Zootopia 2, working on the water tubes sequence was the first project type, and working closely with Disney Research Studios to widely deploy our next-generation path guiding system was the second project type. Hopefully we’ll have a lot more to present on both of these at SIGGRAPH/DigiPro 2026, but in the meantime here’s a quick summary.

One of the big projects I worked on for Moana 2 was a total, from-scratch rethink of our entire approach to rendering water. For the most part the same system we used on Moana 2 proved to be equally successful on Zootopia 2, but for the sequence where Nick, Judy, and Gary De’Snake zoom across the city in a water tube transport system, we had to extend the water rendering system from Moana 2 a little bit further. During this sequence, our characters are inside of glass tubes filled with water moving at something like a hundred miles per hour, with the surrounding environment visible through the tubes and whizzing by. In order to achieve the desired art direction, the tubes had to be modeled with actual water geometry inside since things like bubbles and sloshes and murk and such had to be visible, so going from inside to outside the geometry we had to render was characters inside of water inside of double-sided glass tubes set in huge complex forest and city environments. To both give artists the ability to efficiently model this setup and efficiently render these shots, we wound up building out a customized version of the standard nested dielectrics solution [Schmidt and Budge 2002]. Normally nested dielectrics is pretty straightforward to implement in a simple academic renderer (I’ve written about implementing nested dielectrics in my hobby renderer before), but implementing nested dielectrics to work correctly with the myriad of other advanced features in a production renderer while also remaining performant and robust within the context of a wavefront path tracing architecture proved to require a bit more work compared with in a toy renderer.

During Moana 2’s production, we started work with Disney Research|Studios on a next-generation path guiding system in Hyperion that supports both volumes and surfaces (unlike our previous path guiding system, which only supported surfaces); this new system is built on top of the excellent and state-of-the-art Open Path Guiding (OpenPGL) library [Herholz and Dittebrandt 2022]. Zootopia 2 is the first film where we’ve been able to deploy our next-generation path guiding on a wide scale, rendering about 12% of the entire movie using this system. We presented a lot of the technical details of this new system in our course on path guiding [Reichardt et al. 2025] at SIGGRAPH 2025, but a lot more work beyond what we presented in that course had to go into making path guiding a really production scalable renderer feature. This effort required deep collaboration between a handful of developers on the Hyperion team and a bunch of folks at Disney Research|Studios, to the point where over the past few years Disney Research|Studios has been using Hyperion essentially as one of their primary in-house research renderer platforms and Disney Research staff have been working directly with us on the same codebase. Having come from a more academic rendering background, I think this is one of the coolest things that being part of the larger Walt Disney Company enables our team and studio to do. Our next-generation path guiding system proved to be a really valuable tool on Zootopia 2; in several parts of the movie, entire sequences that we had anticipated to be extraordinarily difficult to render saw enormous efficiency and workflow improvements thanks to path guiding and wound up going through with relative ease!

One particularly fun thing about working on Zootopia 2 was that my wife, Harmony Li, was one of the movie’s Associate Technical Supervisors; this title means she was one of the leads for Zootopia 2’s TD department. Harmony being a supervisor on the show meant I got to work closely with her on a few things! She oversaw character look, simulation, technical animation, crowds, and something that Disney Animation calls “Tactics”, which is essentially optimization across the entire show ranging from pipeline and workflows all the way to render efficiency. As part of Zootopia 2’s Tactics strategy, the rendering team was folded more closely into the asset building process than in previous shows. Having huge crowds of thousands of characters on screen meant that every single individual character needed to be as optimized as possible, and to that end the rendering team helped provide guidance and best practices early in the character modeling and look development process to try to keep everything optimized while not compromising on final look. However, render optimization was only a small part of making the huge crowds in Zootopia 2 possible; various production technology teams and the TD department put enormous groundbreaking work into developing new ways to efficiently author and represent crowd rigs in USD and to interactively visualize huge crowds covered in fur inside of our 3D software packages. All of this also had to be done while, for the first time on a feature film project, Disney Animation switched from Maya to Presto for animation, and all on a movie which by necessity contains by far the greatest variety of different rig types and characters in any of our films (possible in any animated film, period). Again, more on all of this at SIGGRAPH 2026, hopefully.

I think all of the things I’ve written about in this post are just a few great examples of why I think having a dedicated in-house technology development team is so valuable to the way we make films- Disney Animation’s charter is to always be making animated films that push the limits of the art form, and making sure our films are the best looking films we can possibly make is a huge part of that goal. As an example, while Hyperion has a lot of cool features and unique technologies that are custom tailored to support Disney Animation’s needs and workflows, in my opinion the real value Hyperion brings at the end of the day is that our rendering team partners extremely closely with our artists and TDs to build exactly the tools that are needed for each of our movies, with maximum flexibility and customization since we know and develop the renderer from top to bottom. This is true of every technology team at Disney Animation, and it’s a big part of why I love working on our movies. I’ve written only about the projects I worked directly on in this post, which is a tiny subset of the whole of what went into making this movie. Making Zootopia 2 took dozens and dozens of these types of projects to achieve, and I’m so glad to have gotten to be a part of it!

On another small personal note, my wife and I had our first kid during the production of Zootopia 2, and our baby’s name is in the credits in the production babies section. What a cool tradition, and what a cool thing that our baby will forever be a part of!

Below are some beautiful frames from Zootopia 2. Every last detail in this movie was hand-crafted by hundreds of artists and TDs and engineers out of a dedication to and love for animation as an art form, and I promise this movie is worth seeing on the biggest theater screen you can find!

All images in this post are courtesy of and the property of Walt Disney Animation Studios.

References

Nicholas Burkard, Hans Keim, Brian Leach, Sean Palmer, Ernest J. Petti, and Michelle Robinson. 2016. From Armadillo to Zebra: Creating the Diverse Characters and World of Zootopia. In ACM SIGGRAPH 2016 Production Sessions. Article 24.

Matt Jen-Yuan Chiang, Benedikt Bitterli, Chuck Tappan, and Brent Burley. 2016. A Practical and Controllable Hair and Fur Model for Production Path Tracing. Computer Graphics Forum (Proc. of Eurographics) 35, 2 (May 2016), 275-283.

Sebastian Herholz and Addis Dittebrandt. 2022. Intel® Open Path Guiding Library.

Thomas Müller, Marios Papas, Markus Gross, Wojciech Jarosz, and Jan Novák. 2016. Efficient Rendering of Heterogeneous Polydisperse Granular Media. ACM Transactions on Graphics (Proc. of SIGGRAPH Asia) 35, 6 (Nov. 2016), Article 168.

Lea Reichardt, Brian Green, Yining Karl Li, and Marco Manzi. 2025. Path Guiding Surfaces and Volumes in Disney’s Hyperion Renderer- A Case Study. In ACM SIGGRAPH 2025 Course Notes: Path Guiding in Production and Recent Advancements. 30-66.

Charles M. Schmidt and Brian Budge. 2002. Simple Nested Dielectrics in Ray Traced Images. Journal of Graphics Tools 7, 2 (Jan. 2002), 1–8.

SIGGRAPH 2025 Course Notes- Path Guiding Surfaces and Volumes in Disney's Hyperion Renderer- A Case Study

2025-08-12T00:00:00+00:00

This year at SIGGRAPH 2025, Sebastian Herholz from Intel organized a followup to 2019’s Path Guiding in Production Course [Vorba et al. 2019]. This year’s edition of the course includes presentations by Sebastian on Intel’s Open Path Guiding library and on general advice for integrating path guiding techniques into a unidirectional path tracing renderer, a presentation by Martin Šik on how Chaos’s Corona Renderer uses advanced photon guiding techniques in their caustics solver, and a presentation by Lea Reichardt and Marco Manzi on the work Disney Animation and DisneyResearch|Studios have put into Hyperion’s second-generation path guiding system for surfaces and volumes. I strongly encourage checking out the whole course, but wanted to highlight Lea and Marco’s presentation in particular; they put a ton of care and effort into what I think is a really cool and unique look into what it takes to bridge cutting edge research into a production rendering environment. The course notes were written by the four presenters above, in addition to Brian Green and myself from the Hyperion development team.

The course will be presented on Tuesday August 12th, startng at 3:45 PM.

Here is the abstract:

We present our approach to implementing a second-generation path guiding system in Disney’s Hyperion Renderer, which draws upon many lessons learned from our earlier first-generation path guiding system. We start by focusing on the technical challenges associated with implementing path guiding in a wavefront style path tracer and present our novel solutions to these challenges. We will then present some powerful visualization and debugging tools that we developed along the way to both help us validate our implementation’s correctness and help us gain deeper insight into how path guiding performs in a complex production setting. Deploying path guiding in a complex production setting raises various interesting challenges that are not present in purely academic settings; we will explore what we learned from solving many of these challenges. Finally, we will look at some concrete production test results and discuss how these results inform our large scale deployment of path guiding in production. By providing a comprehensive review of what it took for us to achieve this deployment on a large scale in our production environment, we hope that we can provide useful lessons and inspiration for anyone else looking to similarly deploy path guiding in production, and also provide motivation for interesting future research directions.

The paper and related materials can be found at:

All of the technical details are in the paper and presentation (and with 80 pages of course notes, of which 36 pages is the Disney Animation / DisneyResearch|Studios chapter, there are technical details for days!), so this blog post is just some personal thoughts on this project.

As mentioned in the abstract, what we’re presenting in this course is our second-generation path guiding system. Disney Animation and DisneyResearch|Studios have a long history of working on path guiding; one of landmark papers in modern path guiding was Practical Path Guiding (PPG) [Müller et al. 2017], which came out of DisneyResearch|Studios, and Hyperion was one of the first production renderers to implement PPG [Müller 2019]. We’ve used path guiding on a limited number of shots on most movies starting with Frozen 2, but as the course notes go into more detail on, for a variety of reasons our first generation path guiding system never gained widespread adoption. Several years ago, based on a research proposal drafted by Wei-Feng Wayne Huang while he was still at Disney Animation, we kicked off of a large scale project to further improve path guiding and bring it to a point where we could get widespread adoption and provide significant benefits to production. This project is a collaboration between DisneyResearch|Studios, Disney Animation, Pixar, ILM, and Sebastian Herholz from Intel; the second-generation path guiding system in Hyperion is a product of this project.

Personally, this project is one of my all-time favorite projects that I’ve gotten to be a part of, and I think this project really highlights what an incredible research organization DisneyResearch|Studios is. Getting to collaborate with rendering colleagues from across multiple labs and studios is always fun and interesting, and I think for projects like this, the huge amount of production and engineering expertise that the various Disney studios bring combined with the world-class research talent at DisneyResearch|Studios and ETH Zürich (which DisneyResearch|Studios is academically partnered with) gives unique perspectives and capabilities for tackling difficult problems. On top of the production experience from three cutting edge studios, DisneyResearch|Studios also has deep access to both source code and the engineering teams for not one but two extremely mature production rendering systems, Hyperion [Burley et al. 2018] and RenderMan [Christensen et al. 2018]; I don’t think there’s anything else quite like this setup in our industry, and I think it’s insanely cool that we get to work together like this!

One of the largest focus points for our second-generation path guiding efforts was to find a way to guide jointly for both surfaces and volumes; our first-generation PPG based system only supported surfaces. Over the past decade our artists have made heavier and heavier use of volumetrics in each successive project, to the point where now almost every shot in our movies contains some form of volumetrics, ranging from subtly atmospherics all the way to enormously complex setups like the storm battle at the end of Moana 2. We already knew from past experience that extending PPG to volumes wasn’t as easy as it might look, and a second-generation path guiding system would likely need to be a significant departure from PPG. Towards the start of this project we learned that Sebastian Herholz at Intel was working in a similar direction and had incorporated a wide swath of recent path guiding research [Müller et al. 2017, Herholz et al. 2019, Ruppert et al. 2020, Xu et al. 2024] into Intel’s open source OpenPGL library; at this point the project was expanded to include a collaboration with Sebastian. This collaboration has been extraordinarily fruitful, with work from the Disney side of things helping inform development for OpenPGL and expertise from Sebastian helping us build on top of OpenPGL.

An interesting aspect of our next-generation path guiding project is that this project has been both an academic research project and a production engineering project; over the past several years, this project has spawned a series of cool research papers, but has also included a huge effort to get all of this research implemented in Hyperion and RenderMan and into artists’ hands to actually make movies with, which means solving tons of practical production problems that sit outside of the usual research focus. On the research side, so far the project has spawned three papers. Dodik et al. [2022] improved upon PPG by using spatio-directional mixture models to improve guiding’s ability to learn and product sample arbitrarily oriented BSDFs. Xu et al. [2024] introduces a way to guide volume scattering probaiblity (which is indirectly related to distance sampling) in volume rendering, which historically has been a major missing piece in path guiding in volumes. The most recent paper, Rath et al. [2025], looks at how to incorporate GPU based neural forms of path guiding into existing CPU based renderers. Each of these research papers tackles a major challenge we’ve found while working towards making path guiding practical in production.

To bridge between the research work and building a practical production system, we’ve put a lot of work into solving both more architectural technical challenges, and more artist facing user experience challenges. One of the largest architectural technical challenges has been fitting path guiding, which learns from full path histories, into Hyperion’s wavefront rendering architecture [Eisenacher et al. 2013], in which path histories are not kept beyond the current bounce for each path (RenderMan XPU [Christensen et al. 2025] is also a wavefront system, so the challenges there are similar). Artist facing user experience challenges involve the reality that production renderers include many features that break physics to allow for better artistic control and more predictable results, which are difficult to account for when developing path guiding techniques in a purely academic renderer. Solving these engineering and user experience challenges in order to build a practical production system is the focus of our part of the course this year. What we’re presenting in the course is really a snapshot of where we were at the beginning of the year; the material in the course represents enormous progress towards a robust practical system, but this project is very much still in progress and we’ve made additional advancements since we finished writing the course materials; hopefully we can present even more next year!

My favorite part of working on these projects is always getting to work with and learn from really cool people. On the DisneyResearch|Studios side, this project has been led by Marco Manzi and Marios Papas, with significant contributions by Alexander Rath, Tiziano Portenier, and engineering support from Lento Manickathan. On the Disney Animation side, Lea Reichardt, Brian Green, and I have been working closely with Marco to build out the production system. Of course we have a long list of artists and TDs to thank on the production side for supporting and trying out this project; a full list of acknowledgements can be found on my project page and in the course notes. Pushing path guiding forward in research and production together has been the type of project where everyone on the project learns a lot from each other, the studios learn a lot and gain incredibly powerful and useful new tools, and the wider research community benefits as a whole. Investing in these types of large scale, longer term research projects is always a daunting task, and the fact that our studio leadership has given so much support this project and given us the time and resources to really make it a big success is just another testament towards the commitment Disney as a whole has towards making the best movies we can possibly make!

References

Brent Burley, David Adler, Matt Jen-Yuan Chiang, Hank Driskill, Ralf Habel, Patrick Kelly, Peter Kutz, Yining Karl Li, and Daniel Teece. 2018. The Design and Evolution of Disney’s Hyperion Renderer. ACM Transactions on Graphics. 37, 3 (2018), Article 33.

Per H. Christensen, Julian Fong, Jonathan Shade, Wayne L Wooten, Brenden Schubert, Andrew Kensler, Stephen Friedman, Charlie Kilpatrick, Cliff Ramshaw, Marc Bannister, Brenton Rayner, Jonathan Brouillat, and Max Liani. 2018. RenderMan: An Advanced Path Tracing Architecture for Movie Rendering. ACM Transactions on Graphics. 37, 3 (2018), Article 30.

Per H. Christensen, Julian Fong, Charlie Kilpatrick, Francisco Gonzalez Garcia, Srinath Ravichandran, Akshay Shah, Ethan Jaszewski, Stephen Friedman, James Burgess, Trina M. Roy, Tom Nettleship, Meghana Seshadri, and Susan Salituro, 2025. RenderMan XPU: A Hybrid CPU+GPU Renderer for Interactive and Final-Frame Rendering. Computer Graphics Forum (Proc. of High Performance Graphics) 44, 8 (Jun. 2025), Article e70218.

Ana Dodik, Marios Papas, Cengiz Öztireli, and Thomas Müller. 2022. Path Guiding Using Spatio-Directional Mixture Models. Computer Graphics Forum (Proc. of Eurographics) 41, 1 (Feb. 2022), 172-189.

Christian Eisenacher, Gregory Nichols, Andrew Selle, and Brent Burley. 2013. Sorted Deferred Shading for Production Path Tracing. Computer Graphics Forum. 32, 4 (2013), 125-132.

Sebastian Herholz, Yangyang Zhao, Oskar Elek, Derek Nowrouzezahrai, Hendrik P. A. Lensch, and Jaroslav Křivánek. 2019. Volume Path Guiding Based on Zero-Variance Random Walk Theory. ACM Transactions on Graphics (Proc. of SIGGRAPH) 38, 3 (Jun 2019), Article 25.

Thomas Müller, Markus Gross, and Jan Novák. 2017. Practical Path Guiding for Efficient Light-Transport Simulation. Computer Graphics Forum (Proc. of Eurographics Symposium on Rendering) 36, 4 (Jun. 2017), 91-100.

Thomas Müller. 2019. Practical Path Guiding in Production. In ACM SIGGRAPH 2019 Course Notes: Path Guiding in Production. 37-50.

Alexander Rath, Marco Manzi, Farnood Salehi, Sebastian Weiss, Tiziano Portenier, Saeed Hadadan, and Marios Papas. 2025. Neural Resampling with Optimized Candidate Allocation. In Proc. of Eurographics Symposium on Rendering (EGSR 2025). Article 20251181.

Lukas Ruppert, Sebastian Herholz, and Hendrik P. A. Lensch. 2020. Robust Fitting of Parallax-Aware Mixtures for Path Guiding. ACM Transactions on Graphics (Proc. of SIGGRAPH) 39, 4 (Aug 2020), Article 147.

Jiří Vorba, Johannes Hanika, Sebastian Herholz, Thomas Müller, Jaroslav Křivánek, and Alexander Keller. 2019. Path Guiding in Production. In ACM SIGGRAPH 2019 Courses. Article 18.

Kehan Xu, Sebastian Herholz, Marco Manzi, Marios Papas, and Markus Gross. 2024. Volume Scattering Probability Guiding. ACM Transactions on Graphics (Proc. of SIGGRAPH Asia) 43, 6 (Nov. 2024), Article 184.

SIGGRAPH 2025 Talk- A Texture Streaming Pipeline for Real-Time GPU Ray Tracing

2025-08-10T00:00:00+00:00

This year at SIGGRAPH 2025, Mark Lee, Nathan Zeichner, and I have a talk about a GPU texture streaming system we’ve been working on for Disney Animation’s in-house real-time GPU ray tracing previsualization renderer. Of course, GPU texture streaming systems are not exactly something novel; pretty much every game engine and every GPU-based production renderer out there has one. However, because Disney Animation’s texturing workflow is 100% based on Ptex, our texture streaming system has to be built to support Ptex really well, and this imposes some interesting design requirements and constraints on the problem. We thought that these design choices would make for an interesting talk!

Nathan will be presenting the talk at SIGGRAPH 2025 in Vancouver as part of the “Real-Time and Mobile Techniques” session on Sunday August 10th, starting at 9am.

Here is the paper abstract:

Disney Animation makes heavy use of Ptex [Burley and Lacewell 2008] across our assets [Burley et al. 2018], which required a new texture streaming pipeline for our new real-time ray tracer. Our goal was to create a scalable system which could provide a real-time, zero-stall experience to users at all times, even as the number of Ptex files expands into the tens of thousands. We cap the maximum size of the GPU cache to a relatively small footprint, and employ a fast LRU eviction scheme when we hit the limit.

The paper and related materials can be found at:

As usual, all of the technical details are in the paper and presentation, so this blog post is just my personal notes on this project.

We’ve been working on this project for a pretty long while, and what we’re presenting in this talk is actually the second generation our team has built of a GPU texture streaming system with Ptex support. The earlier first prototype of our GPU Ptex system was largely written by Joe Schutte, who we are very indebted to for paving the way and proving out various ideas, such as the use of cuckoo hashing [Erlingsson 2006] for storing keys. We learned a ton of lessons from that first prototype, which informed the modern incarnation of the system. The core of the modern system was primarily written by Mark, with a lot of additional work from Nathan to generalize the system to support both our CUDA/Optix based GPU ray tracing previsualization renderer, and our in-house fork of Hydra’s Storm rasterizer. My role on the project was pretty small; I essentially was just a consultant contributing to some ideas and brainstorming, so I’m very grateful to Mark and Nathan for having allowed me to contribute to this talk!

One of the biggest lessons I’ve learned during my professional career has been the value of building systems twice- Chapter 11 of the famous Mythical Man Month book by Frederick Brooks is all about the value of building a first version to throw away, because much of what is required to build an actually good solution can only be learned through the process of attempting to build a solution in the first place. A lot of the design choices that went into the system described in our talk draws from both the earlier prototype that was built, and also upon past experience building texture streaming systems in other renderers. For example, one big lesson that Mark and I both learned independently is that texture filtering is extremely hard (and it’s famously even harder in Ptex due to the need to filter across faces with potentially very different resolutions), and in a stochastic ray tracing renderer, a better solution is often to just point sample and lineraly interpolate between the two closest MIP levels. Mark learned this on Moonray [Lee et al. 2017], and I’ve written about learning this on the blog before. I think this project is a great example of both learning from previous attempts at the same general problem domain, but also avoiding the second-system effect; what we have today is really fast, really robust, and given how hard texture streaming generally is as a problem domain, I think Mark and Nathan did an impressive job in keeping the actual implementation compact, elegant, and easy to work with.

Of course we need to acknowledge that there have in fact been previous attempts at implementing Ptex on the GPU, with varying degrees of success. McDonald & Burley [2011] was the first demonstrated implementation of Ptex on the GPU, but required a preprocessing step and had to deal with various complications imposed by using OpenGL/DirectX hardware texturing; this early implementation also didn’t support texture streaming. Our implementation is built primarily in CUDA and bypasses all of the traditional graphics API stuff for texturing; we deal with the per-face textures at the raw memory block level, which allows us to have zero preprocessing steps and have robust fast streaming from the CPU to the GPU. Kim et al. [2011]’s solution was to pack all of the individual per-face textures into a single giant atlased texture; back when I worked on Pixar’s Optix-based preview path tracer (called RTP) this essentially was the solution that we used. However, this solution faces major problems with MIP mapping, since faces that are next to each other in the atlas but non-adjacent in the mesh topology can bleed into each other while filtering to generate each level in the MIP chain for the single giant atlas. By streaming the original per-face textures to the GPU and using exactly the same data as what’s in the CPU Ptex implementation, we avoid all of the issues with atlasing.

An interesting thing that I think our system demonstrates is that some of the preexisting assumptions about Ptex that float around in the wider industry aren’t necessarily true. For some time now there’s been an assumption that Ptex cannot be fast for incoherent access; while it is true that Hyperion gains performance advantages from coherent shading [Eisenacher et al. 2013] and therefore coherent Ptex reads, I don’t think this is really a property of Ptex itself (as hinted at by PBRT’s integration of Ptex). One notable thing about our interactive GPU path tracing use case is that the Ptex access pattern is totally incoherent for secondary bounces- we use depth-first integrators in our previz path tracer. The demo video we included with the talk doesn’t really show this since in the demo video we just show a headlight-shaded view for the sake of clearly illustrating the texture streaming behavior, but in actual production usage our texture streaming system serves multi-bounce depth-first path tracing at interactive rates without a problem.

A final note on the demo video- unfortunately I had to capture the demo video over remote desktop at the time, so there are a few frame hitches and stalls in the video. Those hitches and stalls come entirely from recording over remote desktop, not from the texture streaming system; in Nathan’s presentation, we have some better demo videos that were recorded via direct video capture off of HDMI, and in those videos there are zero frame drops or stalls even when we force-evict the entire contents of the on-GPU texture cache.

I want to thank both the Hyperion development and Interactive Visualization development teams at Disney Animation for supporting this project, and of course we thank Brent Burley and Daniel Teece for their feedback and assistance with various Ptex topics. Finally, thanks again to Mark and Nathan for being such great collaborators. I’ve had the pleasure of working closely with Mark on a number of super cool projects over the years and I’ve learned vast amounts from him. Nathan and I go back a very long way; we first met in school and we’ve been friends for around 15 years now, but this was the first time we’ve actually gotten to do a talk together, which was great fun!

That’s all of my personal notes for this talk. If this is interesting to you, please go check out the paper and catch the presentation either live at the conference or recorded afterwards!

References

Frederick P. Brooks, Jr. 1975. The Mythical Man-Month: Essays on Software Engineering, 1st ed. Addison-Wesley.

Brent Burley and Dylan Lacewell. 2008. Ptex: Per-face Texture Mapping for Production Rendering. Computer Graphics Forum (Proc. of Eurographics Symposium on Rendering) 27, 4 (Jun. 2008), 1155-1164.

Christian Eisenacher, Gregory Nichols, Andrew Selle, and Brent Burley. 2013. Sorted Deferred Shading for Production Path Tracing. Computer Graphics Forum (Proc. of Eurographics Symposium on Rendering) 32, 4 (Jul. 2013), 125-132.

Ulfar Erlingsson, Mark Manasse, and Frank Mcsherry. 2006. A Cool and Practical Alternative to Traditional Hash Tables. Microsoft Research Tech Report.

Sujeong Kim, Karl Hillesland, and Justin Hensley. 2011. A Space-efficient and Hardware-friendly Implementation of Ptex. In ACM SIGGRAPH Asia 2011 Sketches. Article 31.

Mark Lee, Brian Green, Feng Xie, and Eric Tabellion. 2017. Vectorized Production Path Tracing. In Proc. of High Performance Graphics (HPG 2017). Article 10.

John McDonald and Brent Burley. 2011. Per-Face Texture Mapping for Real-time Rendering. In ACM SIGGRAPH 2011 Talks. Article 10.

Photography Show at Disney Animation

2025-05-11T00:00:00+00:00

The inside of Disney Animation’s Burbank building is basically one gigantic museum-quality art gallery that happens to have an animation studio embedded within, and one really cool thing that the studio does from time to time is to put on an internal art show with work from various Disney Animation employees. The latest show is a photography show, and I got to be a part of it and show some of my photos! The show, titled HAVE CAMERA, WILL TRAVEL, was coordinated and designed by the amazing Justin Hilden from Disney Animation’s legendary Animation Research Library, and features work from seven Disney Animation photographers: Alisha Andrews, Rehan Butt, Joel Dagang, Brian Gaugler, Ashley Lam, Madison Kennaugh, and myself. My peers in the show are all incredible photographers whose work I find really inspiring; I encourage checking out their photography work online! The show will be up inside of Disney Animation’s Burbank studio for several months.

Ever since my dad gave my brother and me a camera when I was in high school, photography has been a major hobby of mine. Today I have several cameras, a bunch of weird and fun and interesting lenses that I have collected over the years, and I take a lot of photos every year (which has only ramped up even more after I became a dad myself). However, I rarely, if ever, post or share my photos publicly; for me, my photography hobby is purely for myself and my close friends and family. Participating in a photography show was a bit of a leap of faith for me, even within the restricted domain of inside of my workplace instead of in the general public. I think I’m a passable photographer at this point, but certainly nowhere near amazing. However, one advantage of having taken tens of thousands of photos over the past 15 years is that even if only a tiny percentage of my photos are good enough to show, a tiny percentage of tens of thousands is still enough to pull together a small collection to show.

I thought I’d share the photos I have in the show here on my blog as well. There isn’t really a coherent theme; these are just photos I’ve taken that I liked from the past several years. Some are travel photos, some are of my family, and others are just interesting moments that I noticed. I won’t go into my photography and editing process and whatnot here; I’ll save that for a future post.

I color grade my photos for both SDR and HDR; if you are using a device/browser that supports HDR¹ , a toggle will appear below giving the ability to enable HDR on this page. Give it a try!

HDR is not supported on this browser/display.

Enable HDR:

I wrote a small artist’s statement for the show:

To me, a camera is actually a time machine. Taking photos gives me a way to connect back to moments and places in the past; for this reason I take a lot of photos mostly for my own memory, and every once in a rare while one of them is actually good enough to show other people!

I shoot with whatever camera I happen to have on me at the moment. Sometimes it’s a big fancy DSLR, sometimes it’s the phone in my pocket, sometimes it’s something in between. I learned a long time ago that the best camera is just whatever one is in reach at the moment.

Thanks to Harmony for her patience every time I fumbled a lens in my backpack.

Here are my photos from the show, presented in no particular order:

Los Angeles, California | Nikon Z8 | Smena Lomo T-43 40mm ƒ/4 | Display Mode: SDR

Los Angeles, California | Nikon Z8 | Smena Lomo T-43 40mm ƒ/4 | Display Mode: HDR

Denver, Colorado | Nikon Z8 | Zeiss Planar T* 50mm ƒ/1.4 C/Y | Display Mode: SDR

Denver, Colorado | Nikon Z8 | Zeiss Planar T* 50mm ƒ/1.4 C/Y | Display Mode: HDR

Mammoth, California | iPhone 14 Pro | Telephoto Lens 77mm ƒ/2.8 | Display Mode: SDR

Mammoth, California | iPhone 14 Pro | Telephoto Lens 77mm ƒ/2.8 | Display Mode: HDR

Burbank, California | Nikon Z8 | Zeiss Kipronar 105mm ƒ/1.9 | Display Mode: SDR

Burbank, California | Nikon Z8 | Zeiss Kipronar 105mm ƒ/1.9 | Display Mode: HDR

Shanghai, China | Nikon Z8 | Nikon Nikkor Z 24-120mm ƒ/4 S | Display Mode: SDR

Shanghai, China | Nikon Z8 | Nikon Nikkor Z 24-120mm ƒ/4 S | Display Mode: HDR

Burbank, California | Nikon Z8 | Asahi Pentax Super-Takumar 50mm ƒ/1.4 | Display Mode: SDR

Burbank, California | Nikon Z8 | Asahi Pentax Super-Takumar 50mm ƒ/1.4 | Display Mode: HDR

Philadelphia, Pennsylvania | iPhone 5s | Main Lens 29mm ƒ/2.2 | Display Mode: SDR

Philadelphia, Pennsylvania | iPhone 5s | Main Lens 29mm ƒ/2.2 | Display Mode: HDR

Shanghai, China | Nikon D5100 | Nikon AF-S DX NIkkor 18-55mm ƒ/3.5-5.6 | Display Mode: SDR

Shanghai, China | Nikon D5100 | Nikon AF-S DX NIkkor 18-55mm ƒ/3.5-5.6 | Display Mode: HDR

Burbank, California | Nikon Z8 | Nikon Nikkor Z 24-120mm ƒ/4 S | Display Mode: SDR

Burbank, California | Nikon Z8 | Nikon Nikkor Z 24-120mm ƒ/4 S | Display Mode: HDR

Additionally, there were a few photos that I had originally picked out for the show but didn’t make the cut in the end due to limited wall space. I thought I’d include them here as well:

Hualien, Taiwan | Nikon Z8 | Nikon Nikkor Z 24-120mm ƒ/4 S | Display Mode: SDR

Hualien, Taiwan | Nikon Z8 | Nikon Nikkor Z 24-120mm ƒ/4 S | Display Mode: HDR

Burbank, California | Fujifilm X-M1 | Fujifilm Fujinon XF 27mm ƒ/2.8 | Display Mode: SDR

Burbank, California | Fujifilm X-M1 | Fujifilm Fujinon XF 27mm ƒ/2.8 | Display Mode: HDR

Los Angeles, California | iPhone 5s | Main Lens 29mm ƒ/2.2 | Display Mode: SDR

Los Angeles, California | iPhone 5s | Main Lens 29mm ƒ/2.2 | Display Mode: HDR

Here’s some additional commentary for each of the photos, presented in the same order that the photos are in:

The south hall of the Los Angeles Convention Center, taken while walking between sessions at a past SIGGRAPH.
My wife, Harmony Li, at Meow Wolf’s Convergence Station art installation. The lens flares were a total happy accident.
The Panorama Gondola disappearing into a quickly descending blizzard near the top of Mammoth, taken while we were getting off of the mountain as quickly as we could. It doesn’t look like it, but this is actually a color photograph.
Our then-four-month-old daughter hanging out with her grandparents in our backyard. This was the day she held a flower for the first time.
Someone taking a photo from inside of Shanghai’s Museum of Art Pudong. I wonder if I’m in his photo too.
Our half border collie / half golden retriever, Tux, in a Santa hat for a Christmas shoot. I think my wife actually took this one, but she insisted that I include it in the show.
My then-girlfriend now-wife shooting a video project when we were in university. This was in Penn’s Singh Center for Nanotechnology building.
A worker hanging a chandelier in Shanghai’s 1933 Laoyangfang complex. This place used to be a municipal slaughterhouse but now contains creative spaces.
The Los Angeles skyline, as seen from the Stough Canyon trail above Burbank. The tiny dot in the center of the frame is actually a plane on landing approach to LAX.
My friend Alex stopping to take in the waves as a storm was approaching the eastern coast of Taiwan.
Looking past the Roy O. Disney building towards the Team Disney headquarters building on Disney’s Burbank studio lot.
A past SIGGRAPH party somewhere in the fashion district in downtown Los Angeles.

Finally, here’s a few snapshots of what the show looks like, towards the end of the show’s opening. The opening had a great turnout; thanks to everyone that came by!

Justin's awesome logo for the show. | Display Mode: SDR

Justin's awesome logo for the show. | Display Mode: HDR

Crowds dying down towards the end of the show's opening. | Display Mode: SDR

Crowds dying down towards the end of the show's opening. | Display Mode: HDR

The gallery hallway looking in the other direction. | Display Mode: SDR

The gallery hallway looking in the other direction. | Display Mode: HDR

My pieces framed and on the wall. | Display Mode: SDR

My pieces framed and on the wall. | Display Mode: HDR

Footnotes

¹ At time of posting, this post’s HDR mode makes use of browser HDR video support to display HDR pictures as single-frame HDR videos, since no browser has HDR image support enabled by default yet. The following devices/browsers are known to support HDR videos by default:

Safari on iOS 14 or newer, running on the iPhone 12 generation or newer, and on iPhone 11 Pro.
Safari on iPadOS 14 or newer, running on the 12 inch iPad Pros with M1 or M2 chip, and on all iPad Pros with M4 chip or newer.
Safari or Chrome 87 or newer on macOS Big Sur or newer, running on the 2021 14 and 16 inch MacBook Pros or newer, or any Mac using a Pro Display XDR or other compatible HDR monitor.
Chrome 87 or newer, or Edge 87 or newer, on Windows 10 or newer, running on any PC with a compatible DisplayHDR-1000 or higher display (examples: any monitor on this list). You may also need to adjust HDR settings in Windows.
Chrome 87 or newer on Android 14 or newer, running on devices with an Android Ultra HDR compatible display (examples: Google Pixel 8 generation or newer, Samsung Galaxy S21 generation or newer, OnePlus 12 or newer, and various others).

On Apple devices without HDR-capable displays, iOS and macOS’s EDR system may still allow HDR imagery to look correct under specific circumstances. keyboard_return

New Unified Site Design

2025-05-06T00:00:00+00:00

Over the past month or so, I’ve undertaken another overhaul of my blog and website, this time to address a bunch of niggling things that have annoyed me for a long time. In terms of pure technical change, this round’s changes are not as extensive as the ones I had to make to implement a responsive layout a few years ago. Most of this round was polishing and tweaking and refining things, but enough things were touched that in the aggregate this set of change represents the largest number of visual updates to the site in a long time. Broadly things still look similar to before, but everything is a little bit tighter and more coherent and the details are sweated over just a little bit more. The biggest change this round of updates brings is that the blog and portfolio halves of my site now have a completely unified design, and both halves are now stiched together into one cohesive site instead of feeling and working like two separate sites. So, in the grand tradition of writing about making one’s website on one’s own website, here’s an overview of what’s changed and how I approached building the new unified design.

One unusual quirk of my site is that the portfolio half of the site and the blog half of the site run on completely different tech stacks. Both halves of the site are fundamentally based on static site generators, but pretty much everything under the hood is different, down to the very servers they are hosted on. The blog is built using Jekyll and served from Github Pages, fronted using Cloudflare. The portfolio, meanwhile, is built using a custom minimal static site generator called OmjiiCMS. When I say minimal, I really do mean minimal- OmjiiCMS is essentially just a fancy script that takes in hand-written HTML files containing the raw content of each page and simply glues on the sitewide header, footer, and nav menu. Calling it a CMS is a misnomer because it really doesn’t do any content management at all- the name is a holdover from back when my personal site and blog both ran on a custom PHP-based content management and publishing system that I wrote in high school. I eventually moved my blog to Wordpress briefly, which I found far too complicated for what I needed, and then landed on Blogger for a few years, and then in 2013 I moved to Ghost for approximately one week because Ghost had good Markdown support before I realized that if I wanted to write Markdown files, I should just use Jekyll. The blog has been powered by Jekyll ever since. As a bonus, moving to a static site generator made everything both way faster and way easier. Meanwhile, the portfolio part of the site has always been a completely custom thing because the portfolio site has a lot of specific custom layouts and I always found that building those layouts by hand was easier and simpler than trying to hammer some pre-existing framework into the shape I wanted. Over time I stripped away more and more of the underlying CMS until I realized I didn’t need one at all, at which point I gutted the entire CMS and made the portfolio site just a bunch of hand-written HTML files with a simple script to apply the site’s theming to every page before uploading to my web server. This dual-stack setup has stuck for a long time now because at least for me it allows me to run a blog and personal website with a minimal amount of fuss; the goal is to spend far more time actually writing posts than mucking around with the site’s underlying tech stack.

However, one unfortunate net result of these two different evolutionary paths is that while I have always aimed to make the blog and portfolio sites look similar, they’ve always looked kind of different from each other, sometimes in strange ways. The blog and portfolio have always had different header bars and navigation menus, even if the overall style of the header was similar. Both parts of the site always used the same typefaces, but in different places for different things, with completely inconsistent letter spacing, sizing, line heights, and more. Captions have always worked differently between the two parts of the site as well. Even the responsive layout system worked differently between the blog and portfolio, with layout changes happening at different window widths and different margins and paddings taking effect at the same window widths between the two. These differences have always bothered me, and about a month ago they finally bothered me enough for me to do something about it and finally undertake the effort of visually aligning and unifying both sites, down to the smallest details. Before breaking things down, here’s some before and afters:

Figure 1: Main site home page, before (left) and after (right) applying the new unified theme. For a full screen comparison, click here.

Figure 2: Blog front page, before (left) and after (right) applying the new unified theme. For a full screen comparison, click here.

The process I took to unify the designs for the two halves was to start from near scratch on new CSS files and rebuild the original look of both halves as closely as possible, while resolving differences one by one. The end result is that the blog didn’t just wholesale take on the details of the portfolio, or vice versa- instead, wherever differences arose, I thought about what I wanted the design to accomplish and decided on what to do from there. All of this was pretty easy to do because despite running on different tech stacks, both parts of the site were built using as much adherence to semantic HTML as possible, with all styling provided by two CSS files; one for each half. To me, a single CSS file containing all styling separate from the HTML is the obvious way to build web stuff and is how I learned to do CSS over a decade ago from the CSS Zen Garden, but apparently a bunch of popular alternative methods exist today such as Tailwind, which directly embeds CSS snippets in the HTML markup. I don’t know a whole lot about what the cool web kids do today, but Tailwind seems completely insane to me; if I had built my site with CSS snippets scattered throughout the HTML markup, then this unifying project would have taken absolute ages to complete instead of just a few hours spread over a weekend or two. Instead, this project was easy to do because all I had to do was make new CSS files for both parts of the site and I barely had to touch the HTML at all, aside from an extra wrapper div or two.

The general philosophy of this site’s design has always been to put content first and keep things information dense, all with a modern look and presentation. The last big revision of the site added responsive design as a major element and also pared back some unneeded flourishes with the goal of keeping the site lightweight. For the new unified design, I wanted to keep all of the above and also lean more into a lightweight site and improve general readability and clarity, all while keeping the site true to its preexisting design.

Here’s the list of what went into the new unified design:

Previously the blog’s body text was fairly dense and had very little spacing between lines, while the portfolio’s body text was slightly too large and too spaced out. The unified design now defines a single body text style with a font size somewhere in between what the two halves previously had, and with line spacing that grants the text a bit more room to breathe visually for improved readability while still maintaining relatively high density.
Page titles, section headings, and so on now use the same font size, color, letter spacing, margins, etc. between both halves.
I experimented with some different typefaces, but in the end I still like what I had before, which is Proxima Nova for easy-to-read body text and Futura for titles, section headings, etc; previously how these two typefaces were applied was inconsistent though, and the new unified design makes all of this uniform.
Code and monospaced text is now typeset in Berkeley Mono by US Graphics Company.
Image caption styles are now the same across the entire site and now do a neat trick where if they fit on a single line, they are center aligned, but as soon as the caption spills over onto more than one line, the caption becomes left aligned. While the image caption system does use some simple Javascript to set up, the line count dependent alignment trick is pure CSS. Here is a comparison:

Figure 3: Image caption, before (left) and after (right) applying the new unified theme. Before, captions were always center aligned, whereas now, captions are center aligned if they fit on one line but automatically become left aligned if they spill onto more than one line. For a full screen comparison, click here.

The blog now uses red as its accent color, to match the portfolio site. The old blue accent color was a holdover from when the blog’s theme was first derived from what is now a super old variant of Ghost’s Casper theme.
Links now are underlined on hover for better visibility.
Both sites now share an identical header and navigation bar. Previously the portfolio and blog had different wordmarks and had different navigation items; they now share the same “Code & Visuals” wordmark and same navigation items.
As part of unifying the top navigation bars, the blog’s Atom feed is no longer part of the top navigation but instead is linked to from the blog archive and is in the site’s new footer.
The site now has a footer, which I found useful for delineating the end of pages. The new footer has a minimal amount of information in it: just copyright notice, a link to the site’s colophon, and the Atom feed. The footer always stays at the bottom of the page, unless the page is smaller than the current browser window size, in which case the footer floats at the bottom of the browser window, and the neat thing is that this is implemented entirely using CSS with no Javascript.
Responsive layouts now kick in at the same window widths for both parts of the site, and the margins and various text size changes applied for responsive layouts are the same between both halves as well. As a result, the site now looks the same across both halves at all responsive layout widths across all devices.
All analytics and tracking code has been completely removed from both halves of the site.
The “About” section of the site has been reorganized with several informational slash pages. Navigation between the various subpages of the About section is integrated into the page headings.
The “Projects” section of the site used to just be one giant list of projects; this list is now reorganized into subpages for easier navigation, and navigation is also integrated into the Project section’s page headings.
Footnotes and full screen image comparison pages now include backlinks to where they were linked to from main body text.
Long posts with multiple subsections now include a table of contents at the beginning.

Two big influences on how I’ve approached building and designing my site over the past few years have been Tom Macwright’s site and Craig Mod’s site. From Tom Macwright’s site, I love the ultra-brutalist and super lightweight design, and I also like his site navigation, choice of sections, and slash pages. From Craig Mod’s site, I really admire the typography and how the site presents his various extensive writings with excellent readability and beautiful layouts. My site doesn’t really resemble those two sites at all visually (and I wouldn’t want it to; I like my own thing!), but I drew a lot of influence from both of those sites when it comes to how I thought about an overall approach to design. In addition to the two sites mentioned above, I regularly draw inspiration from a whole bunch of other sites and collections of online work; I keep an ongoing list on my about page if you’re curious.

Hee’s a brief overview of how the portfolio half of the site has changed over the years. The earliest 2011 version was just a temporary site I threw together while I was applying to the Pixar Undergraduate Program internship (and it worked!); in some ways I kind of miss the ultra-brutalist utilitarian design of this version. I actually still keep that old version around for fun. The 2013 version was the first version of the overall design that continues to this day, but was really heavy-handed with both a header and footer that hovered in place when scrolling. The 2014 version consolidated the header and footer into just a single header that still hovered in place but shrunk down when scrolling. The 2017 version added dual-column layouts to the home page and project pages, and the 2018 version cleaned up a bunch of details. The 2021 version was a complete rebuild that introduced responsive design, and the 2022 version was a minor iteration that added things like a image carousel to the home page. The latest version rounds out the evolutionary history up to now:

Meanwhile, the blog has actually seen less change overall. Unfortuantely I don’t have any screenshots or a working version of the codebase for the pre-2011 version of the blog anymore, but by the 2011 version the blog was on Blogger with a custom theme that I spent forever fighting against Blogger’s theming system to implement; that custom theme is actually the origin of my site’s entire look. The 2013 version was a wholesale port to Jekyll and as part of the port I built a new Jekyll theme that carried over much of the previous design. The 2014 version of the blog added an archive page and Atom feed, and then the blog more or less stayed untouched until the 2021 version’s responsve design overhaul. This latest version is the largest overhaul the blog has seen in a very long time:

I’m pretty happy with how the new unified design turned out; both halves of the site now feel like one integrated, cohesive whole, and the fact that the two halves of the site run different tech stacks on different webservers is no longer made obvious to visitors and readers. I named the new unified site theme Einheitsgrafik, which translates roughly to “uniform graphic” or “standard graphic”, which I think is fitting. With this iteration, there are no longer any major things that annoy me every time I visit the site to double check things; hopefully that means that the site is also a better experience for visitors and readers now. I think that this particular iteration of the site is going to last a very long time!

Moana 2

2024-12-18T00:00:00+00:00

This fall marked the release of Moana 2, Walt Disney Animation’s 63rd animated feature and the 10th feature film rendered entirely using Disney’s Hyperion Renderer. Moana 2 brings us back to the beautiful world of Moana, but this time on a larger adventure with a larger canoe, a crew to join our heroine, bigger songs, and greater stakes. The first Moana was at the time of its release one of the most beautiful animated films ever made, and Moana 2 lives up to that visual legacy with frames that match or often surpass what we did in the original movie. I got to join Moana 2 about two years ago and this film proved to be an incredibly interesting project!

While we’ve now used Disney’s Hyperion Renderer to make several sequels to previous Disney Animation films, Moana 2 is the first time we’ve used Hyperion to make a sequel to a previous film that also used Hyperion. From a technical perspective, the time between Moana and Moana 2 is filled with almost a decade of continual advancement in our rendering technology and in our wider production pipeline. At the time that we made the first Moana, Hyperion was only a few years old and we spent a lot of time on the first Moana fleshing out various still-underdeveloped features and systems in the renderer. Going into Moana 2, Hyperion is now an extremely mature, extremely feature rich, battle-tested production renderer with which we can make essentially anything we can imagine. Almost every single feature and system in Hyperion today has seen enormous advancement and improvement over what we had on the first Moana; many of these advancements were in fact driven by hard lessons that we learned on the first Moana! Compared with the first Moana, here’s a short, very incomplete laundry list of improvements made over the past decade that we were able to leverage on Moana 2:

Moana 2 uses a completely new water rendering system that represents an enormous leap in both render-time efficiency and easier artist workflows compared with what we used on the first Moana; more on this later in this post.
After the first Moana, we completely rewrote Hyperion’s previous volume rendering subsystem [Habel 2017] from scratch; the modern system is a state-of-the-art delta-tracking system that required us to make foundational research advancements in order to implement [Kutz et al. 2017, Huang et al. 2021].
Our traversal system was completely rewritten to better handle thread scalability and to incorporate a form of rebraiding to efficiently handle gigantic world-spanning geometry; this was inspired directly by problems we had rendering the huge ocean surfaces and huge islands in the first Moana [Burley et al. 2018].
On the original Moana, ray self-intersection with things like Maui’s feathers presented a major challenge; Moana 2 is the first film using our latest ray self-intersection prevention system that notably does away with any form of ray bias values.
We introduced a limited form of photon mapping on the first Moana that only worked between the sun and water surfaces [Burley et al. 2018].; Moana 2 uses an evolved version of our photon mapper that supports all of our light types, many or our standard lighting features, and even has advanced capabilities like a form of spectral dispersion.
We’ve made a number of advancements [Burley et al. 2017, Chiang et al. 2016, Chiang at al. 2019, Zeltner et al. 2022] to various elements of the Disney BSDF shading model.
Subsurface scattering on the first Moana was done using normalized diffusion; since then we’ve moved all subsurface scattering to use a state-of-the-art brute force path tracing approach [Chiang et al. 2016].
Eyes on the first Moana used our old ad-hoc eye shader; eyes on Moana 2 use our modern physically plausible eye shader that includes state-of-the-art iris caustics calculated using manifold next event estimation [Chiang & Burley 2018].
The emissive mesh importance sampling system that we implemented on the first Moana and our overall many-lights sampling system has seen many efficiency improvements [Li et al. 2024].
Hyperion has gained many more powerful features granting artists an enormous degree of artistic control both in the renderer and post-render in compositing [Burley 2019, Burley et al. 2024].
Since the first Moana, Hyperion’s subdivision/tessellation system has gained an advanced fractured mesh system that makes many of the huge-scale effects in the first Moana movie much easier for us to create today [Burley & Rodriguez 2022].
We’ve introduced path guiding into Hyperion to handle particularly difficult light transport cases [Müller et al. 2017, Müller 2019].
The original Moana used our somewhat ad-hoc first-generation denoiser, while Moana 2 uses our best-in-industry, Academy Award winning¹ second-generation deep learning denoiser jointly developed by Disney Research Studios, Disney Animation, Pixar, and ILM [Vogels et al. 2018, Dahlberg et al. 2019].
Even Hyperion’s internal architecture has changed enormously; Hyperion originally was famous for being a batched wavefront renderer, but this has evolved significantly since then and continues to evolve.

There are many many more changes to Hyperion that there simply isn’t room to list here. To give you some sense of how far Hyperion has evolved between Moana and Moana 2: the Hyperion used on Moana was internally versioned as Hyperion 3.x; the Hyperion used on Moana 2 is internally versioned as Hyperion 16.x, with each version number in between representing major changes. In addition to improvements in Hyperion, our rendering team has also been working for the past few years on a next-generation interactive lighting system that extensively leverages hardware GPU ray tracing; Moana 2 saw the widest deployment yet of this system. I can’t say much more on this topic yet, but we’ve started to publish bits and pieces of work from this project, such as how we’ve created a new realtime ray tracing GPU Ptex implementation [Lee et al. 2025].

Of course, there are also still parts of Hyperion that have more or less remained exactly the same as they were during the original Moana; these parts of the renderer have stood the test of time and proven to be reliable foundational pieces of the Disney Animation filmmaking process. A great example is the fur/hair shading model [Chiang et al. 2016] that was originally developed for Zootopia and used on human characters for the first time in Moana. Even though our hair simulation continues to advance with every movie [Kaur et al. 2025], it turns out that the Chiang fur/hair model has been so reliable thaat we haven’t really had to change it since, and in fact it has since become a de-facto standard across the entire graphics industry!

Outside of the rendering group, literally everything else about our entire studio production pipeline has changed as well; the first Moana was made mostly on proprietary internal data formats, while Moana 2 was made using the latest iteration [Zhuang 2025] of our cutting-edge modern USD pipeline [Miller et al. 2022, Vo et al. 2023, Li et al. 2024]. The modern USD pipeline has granted our pipeline many amazing new capabilities and far more flexibility, to the point where it became possible to move our entire lighting workflow to a new DCC [Endo et al. 2025, Joseph and Butt 2025] for Moana 2 without needing to blow up the entire pipeline. Our next-generation interactive lighting system is similarly made possible by our modern USD pipeline. The modern pipeline has also allowed us to continue to push the scale of our films ever larger, with the ever-growing complexity of our crowds [Ros and Berriz 2025, Ros and Sullivan 2025] being a particular standout.

While I get to work on every one of our feature films and get to do fun and interesting things every time, Moana 2 is the most direct and deep I’ve worked on one of our films probably since the original Moana. There are two specific projects I worked on for Moana 2 that I am particularly proud of: a completely new water rendering system that is part of Moana 2’s overall new water FX workflow, and the volume rendering work that was done for the storm battle in the movie’s third act.

On the original Moana, we had to develop a lot of custom water simulation and rendering technology because commercial tools at the time couldn’t quite handle what the movie required. On the simulation side, the original Moana required Disney Animation to invent new techniques such as the APIC (affine particle-in-cell) fluid simulation model [Jiang et al. 2015] and the FAB (fluxed animated boundary) method for integrating procedural and simulated water dynamics [Stomakhin and Selle 2017]. Disney Animation’s general philosophy towards R&D is that we will develop and invent new methods when needed, but will then aim to publish our work with the goal of allowing anything useful we invent to find its way into the wider graphics field and industry; a great outcome is when our publications are adopted by the commercial tools and packages that we build on top of. APIC and FAB were both published and have since become a part of the stock toolset in Houdini, which in turn allowed us to build more on top of Houdini’s built-in SOPs for Moana 2’s water FX workflow.

On the rendering side, the main challenge on the original Moana for rendering water was the need to combine water surfaces from many different sources (procedural, manually animated, and simulated) into a single seamless surface that could be rendered with proper refraction, internal volumetric effects, caustics, and so on. Our solution to combine different water surfaces on the original Moana was to convert all input water elements into signed distance fields, composite all of the signed distance fields together into a single world-spanning levelset, and then mesh that levelset into a triangle mesh for ray intersection [Palmer et al. 2017]. While this process produced great visual results, running this entire world-spanning levelset compositing and meshing operation at renderer startup for each frame proved to be completely untenable due to how slow it made interaction for artists, so an extensive system for pre-caching ocean surfaces overnight to disk had to be built out. All in all, the water rendering and caching system on the first Moana required a dedicated team of over half a dozen developers and TDs to maintain, and setting up the levelset compositing system correctly proved to be challenging for artists.

For Moana 2, we decided to revisit water rendering with the goal of coming up with something much easier for artists to use, much faster to render, and much easier to maintain by a smaller group of engineers and TDs. Over the course of about half a year, we completely replaced the old levelset compositing and meshing system with a new ray-intersection-time CSG system. Our new system requires almost zero work for artists to set up, requires zero preprocessing time before renderer startup and zero on-disk caching, renders with negligible impact on ray tracing speed, and required zero dedicated TDs and only part of my time as an engineer to support once primary development was completed. In addition to all of the above, the new system also allows for both better looking and more memory efficient water because the level of detail that water meshes have to exist at is no longer constrained by the resolution of a world-size meshed levelset, even with an adaptive levelset meshing. I think this was a great example where by revisiting a world that we already knew how to make, we were given an opportunity to reevaluate what we learned on Moana in order to come up with something better by every metric for Moana 2.

We knew that returning to the world of Moana was likely going to require a heavy lift from a volume rendering perspective. With a mind towards this, we worked closely with Disney Research|Studios in Zürich to implement next-generation volume path guiding techniques in Hyperion [Reichardt et al. 2025], which wound up not seeing wide deployment this time but nonetheless proved to be a fun and interesting project from which we learned a lot. We also realized that the third act’s storm battle was going to be incredibly challenging from both an FX and rendering perspective; creating the storm battle required FX to invent whole new techniques [Rice 2025]! My last few months on Moana 2 were spent helping get the storm battle sequences finished; one extremely unusual thing we wound up doing was providing custom builds of Hyperion with specific optimizations tailored to the specific requirements of the storm sequence, sometimes going as far as to provide specific builds and settings tailored on a per-shot basis. Normally this is something any production rendering team tries to avoid if possible, but one of the benefits of having our own in-house team and our own in-house renderer is that we are still able to do this when the need arises. From a personal perspective, being able to point at specific shots and say “I wrote code for that specific thing” is pretty neat!

From both a story and a technical perspective, Moana 2 is everything we loved from Moana brought back, plus a lot of fun, big, bold new stuff [Bhatawadekar et al. 2025]. Making Moana 2 both gave us new challenges to solve and allowed us to revisit and come up with better solutions to old challenges from Moana. I’m incredibly proud of the work that my teammates and I were able to do on Moana 2; I’m sure we’ll have a lot more to share at SIGGRAPH 2025, and in the meantime I strongly encourage you to see Moana 2 on the biggest screen you can find!

To give you a taste of how beautiful this film looks, here are some frames from Moana 2 from the Blu-ray, 100% created using Disney’s Hyperion Renderer by our amazing artists. These are presented in no particular order:

Here is the credits frame for the Hyperion team, along with several of the other teams that we work closely with to make rendering happen at Disney Animation. Specifically, the Lighting & Materials team develops our render translation pipeline and much of the artist-facing user interfaces in our lighting and shading tools, and the Interactive Visualization team is our sibling team that develops our in-house realtime rasterizer viewports.

All images in this post are courtesy of and the property of Walt Disney Animation Studios.

References

Sucheta Bhatawadekar, Behzad Mansoori-Dara, Adolph Lusinsky, and Rob Dressel. 2025. The Cinematography of Songs in Disney’s “Moana 2”. In ACM SIGGRAPH 2025 Talks. Article 64.

Brent Burley, David Adler, Matt Jen-Yuan Chiang, Ralf Habel, Patrick Kelly, Peter Kutz, Yining Karl Li, and Daniel Teece. 2017. Recent Advancements in Disney’s Hyperion Renderer. In ACM SIGGRAPH 2017 Course Notes: Path Tracing in Production Part 1. 26-34.

Brent Burley. 2019. On Histogram-Preserving Blending for Randomized Texture Tiling. Journal of Computer Graphics Techniques 8, 4 (Nov. 2019), 31-53.

Brent Burley and Francisco Rodriguez. 2022. Fracture-Aware Tessellation of Subdivision Surfaces. In ACM SIGGRAPH 2022 Talks. Article 10.

Brent Burley, Brian Green, and Daniel Teece. 2024. Dynamic Screen Space Textures for Coherent Stylization. In ACM SIGGRAPH 2024 Talks. Article 50.

Matt Jen-Yuan Chiang, Peter Kutz, and Brent Burley. 2016. Practical and Controllable Subsurface Scattering for Production Path Tracing. In ACM SIGGRAPH 2016 Talks. Article 49.

Matt Jen-Yuan Chiang and Brent Burley. 2018. Plausible Iris Caustics and Limbal Arc Rendering. In ACM SIGGRAPH 2018 Talks. Article 15.

Matt Jen-Yuan Chiang, Yining Karl Li, and Brent Burley. 2019. Taming the Shadow Terminator. In ACM SIGGRAPH 2019 Talks. Article 71.

Henrik Dahlberg, David Adler, and Jeremy Newlin. 2019. Machine-Learning Denoising in Feature Film Production. In ACM SIGGRAPH 2019 Talks. Article 21.

Colvin Kenji Endo, Norman Moses Joseph, Alex Nijmeh, and Todd Scopio. 2025. Prototype to Production: Building a Lighting Workflow in Houdini for Animation. In Proc. of Digital Production Symposium (DigiPro 2025). Article 3.

Ralf Habel. 2017. Volume Rendering in Hyperion. In ACM SIGGRAPH 2017 Course Notes: Production Volume Rendering. 91-96.

Wei-Feng Wayne Huang, Peter Kutz, Yining Karl Li, and Matt Jen-Yuan Chiang. 2021. Unbiased Emission and Scattering Importance Sampling for Heterogeneous Volumes. In ACM SIGGRAPH 2021 Talks. Article 3.

Chenfafu Jiang, Craig Schroeder, Andrew Selle, Joseph Teran, and Alexey Stomakhin. 2015. The Affine Particle-in-Cell Method. ACM Transactions on Graphics (Proc. of SIGGRAPH) 34, 4 (Aug. 2015), Article 51.

Norman Moses Joseph and Rehan Butt. 2025. The Design Opportunities of Moving to Houdini for Lighting with the world of Animation. In ACM SIGGRAPH 2025 Talks. Article 52.

Avneet Kaur, Hubert Leo, Nachiket Pujari, and Bret Bays. 2025. Choreography of Hair and Cloth in Disney’s “Moana 2”. In ACM SIGGRAPH 2025 Talks. Article 13.

Peter Kutz, Ralf Habel, Yining Karl Li, and Jan Novák. 2017. Spectral and Decomposition Tracking for Rendering Heterogeneous Volumes. ACM Transactions on Graphics (Proc. of SIGGRAPH) 36, 4 (Aug. 2017), Article 111.

Mark S. Lee, Nathan Zeichner, and Yining Karl Li. 2025. A Texture Streaming Pipeline for Real-Time GPU Ray Tracing. In ACM SIGGRAPH 2025 Talks. Article 12.

Harmony M. Li, George Rieckenberg, Neelima Karanam, Emily Vo, and Kelsey Hurley. 2024. Optimizing Assets for Authoring and Consumption in USD. In ACM SIGGRAPH 2024 Talks. Article 30.

Yining Karl Li, Charlotte Zhu, Gregory Nichols, Peter Kutz, Wei-Feng Wayne Huang, David Adler, Brent Burley, and Daniel Teece. 2024. Cache Points for Production-Scale Occlusion-Aware Many-Lights Sampling and Volumetric Scattering. In Proc. of Digital Production Symposium (DigiPro 2024). Article 6.

Tad Miller, Harmony M. Li, Neelima Karanam, Nadim Sinno, and Todd Scopio. 2022. Making Encanto with USD: Rebuilding a Production Pipeline Working from Home. In ACM SIGGRAPH 2022 Talks. Article 12.

Thomas Müller. 2019. Practical Path Guiding in Production. In ACM SIGGRAPH 2019 Course Notes: Path Guiding in Production. 37-50.

Sean Palmer, Jonathan Garcia, Sara Drakeley, Patrick Kelly, and Ralf Habel. 2017. The Ocean and Water Pipeline of Disney’s Moana. In ACM SIGGRAPH 2017 Talks. Article 29.

Jacob Rice. 2025. Steerable Perlin Noise. In ACM SIGGRAPH 2025 Talks. Article 1.

Alberto J Luceño Ros and Cecilia Berriz. 2025. Creating the Mudskipper Pile in Disney’s “Moana 2”: A Slippery Problem Space. In ACM SIGGRAPH 2025 Talks. Article 63.

Alberto J Luceño Ros and Jeff Sullivan. 2025. The Art of Crowds Animation. In ACM SIGGRAPH 2025 Talks. Article 62.

Alexey Stomakhin and Andy Selle. 2017. Fluxed Animated Boundary Method. ACM Transactions on Graphics (Proc. of SIGGRAPH) 36, 4 (Aug. 2017), Article 68.

Emily Vo, George Rieckenberg, and Ernest Petti. 2023. Honing USD: Lessons Learned and Workflow Enhancements at Walt Disney Animation Studios. In ACM SIGGRAPH 2023 Talks. Article 13.

Thijs Vogels, Fabrice Rousselle, Brian McWilliams, Gerhard Röthlin, Alex Harvill, David Adler, Mark Meyer, and Jan Novák. 2018. Denoising with Kernel Prediction and Asymmetric Loss Functions. ACM Transactions on Graphics (Proc. of SIGGRAPH) 37, 4 (Aug. 2018), Article 124.

Tizian Zeltner, Brent Burley, and Matt Jen-Yuan Chiang. 2022. Practical Multiple-Scattering Sheen Using Linearly Transformed Cosines. In ACM SIGGRAPH 2022 Talks. Article 7.

Rikki Zhuang. 2025. Transitioning to an Explicit Dependency USD Asset Resolver. In ACM SIGGRAPH 2025 Course Notes: USD in Production. 73-106.

Footnotes

¹ Our deep learning denoiser technology is one of the 2025 Academy of Motion Picture Arts and Sciences Scientific and Engineering Award winners. keyboard_return

DigiPro 2024 Paper- Cache Points For Production-Scale Occlusion-Aware Many-Lights Sampling And Volumetric Scattering

2024-07-30T00:00:00+00:00

This year at DigiPro 2024, we had a conference paper that presents a deep dive into Hyperion’s unique solution to the many-light sampling problem; we call this system “cache points”. DigiPro is one of my favorite computer graphics conferences precisely because of the emphasis the conference places on sharing how ideas work in the real world of production, and with this paper we’ve tried to combine a more traditional academic theory paper with DigiPro’s production-forward mindset. Instead of presenting some new thing that we’ve recently come up with and have maybe only used on one or two productions so far, this paper presents something that we’ve now actually had in the renderer and evolved for over a decade, and along with the core technique, the paper also goes into lessons we’ve learned from over a decade of production experience.

Here is the paper abstract:

A hallmark capability that defines a renderer as a production renderer is the ability to scale to handle scenes with extreme complexity, including complex illumination cast by a vast number of light sources. In this paper, we present Cache Points, the system used by Disney’s Hyperion Renderer to perform efficient unbiased importance sampling of direct illumination in scenes containing up to millions of light sources. Our cache points system includes a number of novel features. We build a spatial data structure over points that light sampling will occur from instead of over the lights themselves. We do online learning of occlusion and factor this into our importance sampling distribution. We also accelerate sampling in difficult volume scattering cases.

Over the past decade, our cache points system has seen extensive production usage on every feature film and animated short produced by Walt Disney Animation Studios, enabling artists to design lighting environments without concern for complexity. In this paper, we will survey how the cache points system is built, works, impacts production lighting and artist workflows, and factors into the future of production rendering at Disney Animation.

The paper and related materials can be found at:

One extremely important thing that I tried to get across in the acknowledgements section of the paper and presentation and that I want to really emphasize here is: although I’m the lead author of this paper, I am not at all the lead developer or primary inventor of the cache points system. Over the past decade, many developers have since contributed to the system and the system has evolved significantly, but the core of cache points system was originally invented by Gregory Nichols and Peter Kutz, and the volume scattering extensions were primarily developed by Wei-Feng Wayne Huang. Since Greg, Peter, and Wayne are no longer at Disney Animation, Charlotte and I wound up spearheading the paper because we’re the developers who currently have the most experience working in the cache points system and therefore were in the best position to write about it.

The way this paper came about was somewhat circuitous and unplanned. This paper actually originated as a section in what was intended to have been a course at SIGGRAPH a few years ago on path guiding techniques, to have been presented by Intel’s graphics research group, Disney Research Studios, Disney Animation’s Hyperion team, WetaFX’s Manuka team, and Chaos Czech’s Corona team. However, because of scheduling and travel difficulties for several of the course presenters, the course wound up having to be withdrawn, and the material we had put together for presenting cache points got shelved. Then, as the DigiPro deadline started to approach this year, we were asked by higher ups in the studio if we had anything that could make a good DigiPro submission. After some thought, we realized that DigiPro was actually a great venue for presenting the cache points system because we could structure the paper and presentation as a combination of technical breakdown and production perspective from a decade’s worth of production usage. The final paper is a composed from three sources: a reworked version of what we had originally prepared for the abandoned course, a greatly expanded version of the material from our 2021 SIGGRAPH talk on our cache point based volume importance sampling techniques [Huang et al. 2021], and a bunch of new material consisting of production case studies and results on production scenes.

Overall I hope that the final paper is an interesting and useful read for anyone interested in light transport and production rendering, but I have to admit, I think that there are a couple of things I would have liked to rework and improve in the paper if we had more time. I think the largest missing piece from the paper is a direct head-to-head comparison with a light BVH approach [Estevez and Kulla 2018]; in the paper and presentation we discuss how our approach differs from light BVH approaches and why we chose our approach over a light BVH, but we don’t actually present any direct comparisons in the results. In the past we actually have more directly compared cache points to a light BVH implementation, but in the window we had to write this paper, we simply didn’t have enough time to resurrect that old test, bring it up to date with the latest code in the production renderer, and conduct a thorough performance comparison. Similarly, in the paper we mention that we actually implemented Vevoda et al. [2018]’s Bayesian online regression approach in Hyperion as a comparison point, but again, in the writing window for this paper, we just didn’t have time to put together a fair up-to-date performance comparison. I think that even without these comparisons our paper brings a lot of valuable information and insights (and evidently the paper referees agreed!), but I do think that the paper would be stronger had we found the time to include those direct comparisons. Hopefully at some point in the near future I can find time to do those direct comparisons as a followup and put out the results in a supplemental followup or something.

Another detail of the paper that sits in the back of my head for revisting is the fact that even though cache points provides correct unbiased results, a lot of the internal implementation details depend on essentially empirically derived properties. Nothing in cache points is totally arbitrary per se; in the paper we try to provide a strong rationale or justification for how we arrived upon each empirical property through logic and production experience. However, at least from an abstract mathematical perspective, the empirically derived stuff is nonetheless somewhat unsatisfying! On the other hand, however, in a great many ways this property is simply part of practical reality- what puts the production in production rendering.

A topic that I think would be a really interesting bit of future work is combining cache points with ReSTIR [Bitterli et al. 2020]. One of the interesting things we’ve found with ReSTIR is that in terms of absolute quality, ReSTIR generally can benefit significantly from higher quality initial input samples (as opposed to just uniform sampling), but the quality benefit is usually more than offset by the greatly increased cost of drawing better initial samples from something like a light BVH. Walking a light BVH on the GPU is a lot more computationally expensive than just drawing a uniform random number! One thought that I’ve had is that because cache points aren’t hierarchical, we could store them in a hash grid instead of a tree, allowing for fast constant-time lookups that might provide a better quality-vs-cost tradeoff that in turn might make use with ReSTIR feasible.

The presentation for this paper was an interesting challenge and a lot of fun to put together. Our paper is very much written with a core rendering audience in mind, but the presentation at the DigiPro conference had to be built for a more general audience because the audience at DigiPro includes a wide, diverse selection of people from all across computer graphics, animation, and VFX, with varying levels of technical background and varying levels of familiarity with rendering. The approach we took for the presentation was to keep things at a much higher level than the paper and try to convey the broad strokes of how cache points work and focus more on production results and lessons, while referring to the paper for the more nitty gritty details. We put a lot of work into including a lot of animations in the presentation to better illustrate how each step of cache points works; the way we used animations was directly inspired by Alexander Rath’s amazing SIGGRAPH 2023 presentation on Focal Path Guiding [Rath et al. 2023]. However, instead of building custom presentation software with a built-in 2D ray tracer like Alex did, I just made all of our animations the hard and dumb way in Keynote.

Another nice thing the presentation includes is a better visual presentation (and somewhat expanded version) of the paper’s results section. A recording of the presentation is available on both my project page for the paper and on the official Disney Animation website’s page for the paper. I am very grateful to Dayna Meltzer, Munira Tayabji, and Nick Cannon at Disney Animation for granting permission and making it possible for us to share the presentation recording publicly. The presentation is a bit on the long side (30 minutes), but hopefully is a useful and interesting watch!

References

Benedikt Bitterli, Chris Wyman, Matt Pharr, Peter Shirley, Aaron Lefohn, and Wojciech Jarosz. 2020. Spatiotemporal Reservoir Sampling for Real-Time Ray Tracing with Dynamic Direct Lighting. ACM Transactions on Graphics (Proc. of SIGGRAPH) 39, 4 (Jul. 2020), Article 148.

Alejandro Conty Estevez and Christopher Kulla. 2018. Importance Sampling of Many Lights with Adaptive Tree Splitting. Proc. of the ACM on Computer Graphics and Interactive Techniques (Proc. of High Performance Graphics) 1, 2 (Aug. 2018), Article 25.

Alexander Rath, Ömercan Yazici, and Philipp Slusallek. 2023. Focus Path Guiding for Light Transport Simulation. In ACM SIGGRAPH 2023 Conference Proceedings. Article 30.

Petr Vévoda, Ivo Kondapaneni, and Jaroslav Křivánek. 2018. Bayesian Online Regression for Adaptive Direct Illumination Sampling. ACM Transactions on Graphics (Proc. of SIGGRAPH) 37, 4 (Aug. 2018), Article 125.

Porting Takua Renderer to Windows on Arm

2024-06-07T00:00:00+00:00

1. Introduction
2. OpenGL on arm64 Windows 11
3. Building Embree on arm64 Windows 11

4. Running x86-64 code on arm64 Windows 11
5. Conclusion
6. References

Introduction

A few years ago I ported Takua Renderer to build and run on arm64 systems. Porting to arm64 proved to be a major effort (see Parts 1, 2, 3, and 4) which wound up paying off in spades; I learned a lot, found and fixed various longstanding platform-specific bugs in the renderer, and wound up being perfectly timed for Apple transitioning the Mac to arm64-based Apple Silicon. As a result, for the past few years I have been routinely building and running Takua Renderer on arm64 Linux and macOS, in addition to building and runninng on x86-64 Linux/Mac/Windows. Even though I take somewhat of a Mac-first approach for personal projects since I daily drive macOS, I make a point of maintaining robust cross-platform support for Takua Renderer for reasons I wrote about in the first part of this series.

Up until recently though, my supported platforms list for Takua Renderer notably did not include Windows on Arm. There are two main reasons why I never ported Takua Renderer to build and run on Windows on Arm. The first reason is that Microsoft’s own support for Windows on Arm has up until recently been in a fairly nascent state. Windows RT added Arm support in 2012 but only for 32-bit processors, and Windows 10 added arm64 support in 2016 but lacked a lot of native applications and developer support; notably, Visual Studio didn’t gain native arm64 support until late in 2022. The second reason I never got around to adding Windows on Arm support is simply that I don’t have any Windows on Arm hardware sitting around and generally there just have not been many good Windows on Arm devices available in the market. However, with the advent of Qualcomm’s Oryon-based Snapdragon X SoCs and Microsoft’s push for a new generation of arm64 PCs using the Snapdragon X SoCs, all of the above finally seems to be changing. Microsoft also authorized arm64 editions of Windows 11 for use in virtual machines on Apple Silicon Macs at the beginning of this year. With Windows on Arm now clearly signaled as a major part of the future of Windows and clearly signaled as here to stay, and now that spinning up a Windows 11 on Arm VM is both formally supported and easy to do, a few weeks ago I finally got around to getting Takua Renderer up and running on native arm64 Windows 11.

Overall this process was very easy compared with my previous efforts to add support for arm64 Mac and Linux. This was not because porting architectures is easier on Windows but rather is a consequence of the fact that I had already solved all of the major architecture-related porting problems for Mac and Linux; the Windows 11 on Arm port just piggy-backed on those efforts. Because of how relatively straightforward this process was, this will be a shorter post, but there were a few interesting gotchas and details that I think are worth noting in case they’re useful to anyone else porting graphics stuff to Windows on Arm.

Note that everything in this post uses arm64 Windows 11 Pro 23H2 and Visual Studio 2022 17.10.x. Noting the specific versions used here is important since Microsoft is still actively fleshing out arm64 support in Windows 11 and Visual Studio 2022; later versions will likely see improvements to problems discussed in this post.

OpenGL on arm64 Windows 11

Takua has two user interface systems: a macOS-specific UI written using a combination of Dear Imgui, Metal, and AppKit, and a cross-platform UI written using a combination of Dear Imgui, OpenGL, and GLFW. On macOS, OpenGL is provided by the operating system itself as part of the standard system frameworks. On most desktop Linux distributions, OpenGL can be provided by several different sources: one option is entirely through the operating system’s provided Mesa graphics stack, another option is through a combination of Mesa for the graphics API and a proprietary driver for the backend hardware support, and the last option is entirely through a proprietary driver (such as with Nvidia’s official drivers). On Windows, however, the operating system does not provide modern OpenGL (“modern” meaning OpenGL 3.3 or newer), support whatsoever and the OpenGL 1.1 support that is available is a wrapper around Direct3D; modern OpenGL support on Windows has to be provided entirely by the graphics driver.

I don’t actually have any native arm64 Windows 11 hardware, so for this porting project, I ran arm64 Windows 11 as a virtual machine on two of my Apple Silicon Macs. I used the excellent UTM app (which under the hood uses QEMU) as the hypervisor. However, UTM does not provide any kind of GPU emulation/virtualization to Windows virtual machines, so the first problem I ran into was that my arm64 Windows 11 environment did not have any kind of modern OpenGL support due to the lack of a GPU driver with OpenGL. Therefore, I had no way to build and run Takua’s UI system.

Fortunately, because OpenGL is so widespread in commonly used applications and games, this is a problem that Microsoft has already anticipated and come up with a solution for. A few years ago, Microsoft developed and released an OpenGL/OpenCL Compatability Pack for Windows on Arm, and they’ve since also added Vulkan support to the compatability pack as well. The compatability pack is available for free on the Windows Store. Under the hood, the compatability pack uses a combination of Microsoft-developed client drivers and a bunch of components from Mesa to translate from OpenGL/OpenCL/Vulkan to Direct3D [Jiang 2020]. This system was originally developed to provide support for specifically Photoshop on arm64 Windows, but has since been expanded to provide general OpenGL 3.3, OpenCL 3.0, and Vulkan 1.2 support to all applications on arm64 Windows. Installing the compatability pack allowed me to get GLFW building and to get GLFW’s example demos working.

Takua’s cross-platform UI is capable of running either using OpenGL 4.5 on systems with support for the latest fanciest OpenGL API version, or using OpenGL 3.3 on systems that only have older OpenGL support (examples include macOS when not using the native Metal-based UI and include many SBC Linux devices such as Raspberry Pi). Since the arm64 Windows compatability pack only fully supports up to OpenGL 3.3, I set up Takua’s arm64 Windows build to fall back to only use the OpenGL 3.3 code path, which was enough to get things up and running. However, I immediately noticed that everything in the UI looked wrong; specifically, everything was clearly not in the correct color space.

The problem turned out to be that the Windows OpenGL/OpenCL/Vulkan compatability pack doesn’t seem to correctly implement GL_FRAMEBUFFER_SRGB; calling glEnable(GL_FRAMEBUFFER_SRGB) did not have any impact on the actual color space that the framebuffer rendered with. To work around this problem, I simply added software sRGB emulation to the output fragment shader and added some code to detect if GL_FRAMEBUFFER_SRGB was working or not and if not, fall back to the fragment shader’s implementation. Implementing the sRGB transform is extremely easy and is something that every graphics programmer inevitably ends up doing a bunch of times throughout one’s career:

float sRGB(float x) {
    if (x <= 0.00031308)
        return 12.92 * x;
    else
        return 1.055*pow(x,(1.0 / 2.4) ) - 0.055;
}

With this fix, Takua’s UI now fully works on arm64 Windows 11 and displays renders correctly:

Building Embree on arm64 Windows 11

Takua has a moderately sized dependency base, and getting all of the dependency base compiled during my ports to arm64 Linux and arm64 macOS was a very large part of the overall effort since arm64 support across the board was still in an early stage in the graphics field three years ago. However, now that libraries such as Embree and OpenEXR and even TBB have been building and running on arm64 for years now, I was expecting that getting Takua’s full dependency base brought up on Windows on Arm would be straightforward. Indeed this was the case for everything except Embree, which proved to be somewhat tricky to get working. I was surprised that Embree proved to be difficult, since Embree for a few years now has had excellent arm64 support on macOS and Linux. Thanks to a contribution from Apple’s Developer Ecosystem Engineer team, arm64 Embree now even has a neat double-pumped NEON option for emulating AVX2 instructions.

As of the time of writing this post, compiling Embree 4.3.1 for arm64 using MSVC 19.x (which ships with Visual Studio 2022) simply does not work. Initially just to get the renderer up and running in some form at all, I disabled Embree in the build. Takua has both an Embree-based traversal system and a standalone traversal system that uses my own custom BVH implementation; I keep both systems at parity with each other because Takua at the end of the day is a hobby renderer that I work on for fun, and writing BVH code is fun! However, a secondary reason for keeping both traversal systems around is because in the past having a non-Embree code path has been useful for getting the renderer bootstrapped on platforms that Embree doesn’t fully support yet, and this was another case of that.

Right off the bat, building Embree with MSVC runs into a bunch of problems with detecting the platform as being a 64-bit platform and also runs into all kinds of problems with including immintrin.h, which is where vector data types and other x86-64 intrinsics stuff is defined. After hacking my way through solving those problems, the next issue I ran into is that MSVC really does not like how Embree carries out static initialisation of NEON datatypes; this is a known problem in MSVC. Supposedly this issue was fixed in MSVC some time ago, but I haven’t been able to get it to work at all. Fixing this issue requires some extensive reworking of how Embree does static initialisation of vector datatypes, which is not a very trivial task; Anthony Roberts previously attempted to actually make these changes in support of getting Embree on Windows on Arm working for use in Blender, but eventually gave up since making these changes while also making sure Embree still passes all of its internal tests proved to be challenging.

In the end, I found a much easier solution to be to just compile Embree using Visual Studio’s version of clang instead of MSVC. This has to be done from the command line; I wasn’t able to get this to work from within Visual Studio’s regular GUI. From within a Developer PowerShell for Visual Studio session, the following worked for me:

cmake -G "Ninja" ../../ -DCMAKE_C_COMPILER="clang-cl" `
                        -DCMAKE_CXX_COMPILER="clang-cl" ` 
                        -DCMAKE_C_FLAGS_INIT="--target=arm64-pc-windows-msvc" `
                        -DCMAKE_CXX_FLAGS_INIT="--target=arm64-pc-windows-msvc" `
                        -DCMAKE_BUILD_TYPE=Release `
                        -DTBB_ROOT="[TBB LOCATION HERE]" `
                        -DCMAKE_INSTALL_PREFIX="[INSTALL PREFIX HERE]"

To do the above, of course you will need both CMake and Ninja installed; fortunately both come with pre-built arm64 Windows binaries on their respective websites. You will also need to install the “C++ Clang Compiler for Windows” component in the Visual Studio Installer application if you haven’t already.

Just building with clang is also the solution that Blender eventually settled on for Windows on Arm, although Blender’s version of this solution is a bit more complex since Blender builds Embree using its own internal clang and LLVM build instead of just using the clang that ships with Visual Studio.

An additional limitation in compiling Embree 4.3.1 for arm64 on Windows right now is that ISPC support seems to be broken. On arm64 macOS and Linux this works just fine; the ISPC project provides prebuilt arm64 binaries on both platforms, and even without a prebuilt arm64 binary, I found that running the x86-64 build of ISPC on arm64 macOS via Rosetta 2 worked without a problem when building Embree. However, on arm64 Windows 11, even though the x86-64 emulation system ran the x86-64 build of ISPC just fine standalone, trying to run it as part of the Embree build didn’t work for me despite me trying a variety of ways to get it to work. I’m not sure if this works with a native arm64 build of ISPC; building ISPC is a sufficiently involved process that I decided it was out of scope for this project.

Running x86-64 code on arm64 Windows 11

Much like how Apple provides Rosetta 2 for running x86-64 applications on arm64 macOS, Microsoft provides a translation layer for running x86 and x86-64 applications on arm64 Windows 11. In my post on porting to arm64 macOS, I included a lengthy section discussing and performance testing Rosetta 2. This time around, I haven’t looked as deeply into x86-64 emulation on arm64 Windows, but I did do some basic testing. Part of why I didn’t go as deeply into this area on Windows is because I’m running arm64 Windows 11 in a virtual machine instead of on native hardware- the comparison won’t be super fair anyway. Another part of why I didn’t go in as deeply is because x86-64 emulation is something that continues to be in an active state of development on Windows; Windows 11 24H2 is supposed to introduce a new x86-64 emulation system called Prism that Microsoft promises to be much faster than the current system in 23H2 [Mehdi 2024]. As of writing though, little to no information is available yet on how Prism works and how it improves on the current system.

The current system for emulating x86 and x86-64 on arm64 Windows is a fairly complex system that differs greatly from Rosetta 2 in a lot of ways. First, arm64 Windows 11 supports emulating both 32-bit x86 and 64-bit x86-64, whereas macOS dropped any kind of 32-bit support long ago and only needs to support 64-bit x86-64 on 64-bit arm64. Windows actually handles 32-bit x86 and 64-bit x86-64 through two basically completely different systems. 32-bit x86 is handled through an extension of the WoW64 (Windows 32-bit on Windows 64-bit) system, while 64-bit x86-64 uses a different system. The 32-bit system uses a JIT compiler called xtajit.dll [Radich et al. 2020, Beneš 2018] to translate blocks of x86 assembly to arm64 assembly and has a caching mechanism for JITed code blocks similar to Rosetta 2 to speed up execution of x86 code that has already been run through the emulation system before [Cylance Research Team 2019]. In the 32-bit system, overall support for providing system calls and whatnot are handled as part of the larger WoW64 system.

The 64-bit system relies on a newer mechanism. The core binary translation system is similar to the 32-bit system, but providing system calls and support for the rest of the surrounding operatin system doesn’t happen through WoW64 at all and instead relies on something that is in some ways similar to Rosetta 2, but is in other crucial ways radically different from Rosetta 2 or the 32-bit WoW64 approach. In Rosetta 2, arm64 code that comes from translation uses a completely different ABI from native arm64 code; the translated arm64 ABI contains a direct mapping between x86-64 and arm64 registers. Microsoft similarly uses a different ABI for translated arm64 code compared with native arm64 code; in Windows, translated arm64 code uses the arm64EC (EC for “Emulation Compatible”) ABI. Here though we find the first major difference between the macOS and Windows 11 approaches. In Rosetta 2, the translated arm64 ABI is an internal implementation detail that is not exposed to users or developers whatsoever; by default there is no way to compile source code against the translated arm64 ABI in Xcode. In the Windows 11 system though, the arm64EC ABI is directly available to developers; Visual Studio 2022 supports compiling source code against either the native arm64 or the translation-focused arm64EC ABI. Code built as arm64EC is capable of interoperating with emulated x86-64 code within the same process, the idea being that this approach allows developers to incrementally port applications to arm64 piece-by-piece while leaving other pieces as x86-64 [Sweetgall et al. 2023]. This… is actually kind of wild if you think about it!

The second major difference between the macOS and Windows 11 approaches is even bigger than the first. On macOS, application binaries can be fat binaries (Apple calls these universal binaries), which contain both full arm64 and x86-64 versions of an application and share non-code resources within a single universal binary file. The entirety of macOS’s core system and frameworks ship as universal binaries, such that at runtime Rosetta 2 can simply translate both the entirety of the user application and all system libraries that the application calls out to into arm64. Windows 11 takes a different approach- on arm64, Windows 11 extends the standard Windows portable executable format (aka .exe files) to be a hybrid binary format called arm64X (X for eXtension). The arm64X format allows for arm64 code compiled against the arm64EC ABI and emulated x86-64 code to interoperate within the same binary; x86-64 code in the binary is translated to arm64EC as needed. Pretty much every 64-bit system component of Windows 11 on Arm ships as arm64X binaries [Niehaus 2021]. Darek Mihocka has a fantastic article that goes into extensive depth about how arm64EC and arm64X work, and Koh Nakagawa has done an extensive analysis of this system as well.

One thing that Windows 11’s emulation system does not seem to be able to do is make special accomodations for TSO memory ordering. As I explored previously, Rosetta 2 gains a very significant performance boost from Apple Silicon’s hardware-level support for emulating x86-64’s strong memory ordering. However, since Microsoft cannot control and custom tailor the hardware that Windows 11 will be running on, arm64 Windows 11 can’t make any guarantees about hardware-level TSO memory ordering support. I don’t know if this situation is any different with the new Prism emulator running on the Snapdragon X Pro/Elite, but in the case of the current emulation framework, the lack of hardware TSO support is likely a huge problem for performance. In my testing of Rosetta 2, I found that Takua typically ran about 10-15% slower as x86-64 under Rosetta 2 with TSO mode enabled (the default) compared with native arm64, but ran 40-50% slower as x86-64 under Rosetta 2 with TSO mode disabled compared with native arm64.

Below are some numbers comparing running Takua on arm64 Windows 11 as a native arm64 application versus as an emulated x86-64 application. The tests used are the same as the ones I used in my Rosetta 2 tests, with the same settings as before. In this case though, because this was all running in a virtual machine (with 6 allocated cores) instead of directly on hardware, the absolute numbers are not as important as the relative difference between native and emulated modes:

	CORNELL BOX
	1024x1024, PT
Test:	Wall Time:	Core-Seconds:
Native arm64 (VM):	60.219 s	approx 361.314 s
Emulated x86-64 (VM):	202.242 s	approx 1273.45 s

	TEA CUP
	1920x1080, VCM
Test:	Wall Time:	Core-Seconds:
Native arm64 (VM):	244.37 s	approx 1466.22 s
Emulated x86-64 (VM):	681.539 s	approx 4089.24 s

	BEDROOM
	1920x1080, PT
Test:	Wall Time:	Core-Seconds:
Native arm64 (VM):	530.261 s	approx 3181.57 s
Emulated x86-64 (VM):	1578.76 s	approx 9472.57 s

	SCANDINAVIAN ROOM
	1920x1080, PT
Test:	Wall Time:	Core-Seconds:
Native arm64 (VM):	993.075 s	approx 5958.45 s
Emulated x86-64 (VM):	1745.5 s	approx 10473.0 s

The emulated results are… not great; for compute-heavy workloads like path tracing, x86-64 emulation on arm64 Windows 11 seems to to be around 1.7x to 3x slower than native arm64 code. These results are much slower compared with how Rosetta 2 performs, which generally sees only a 10-15% performance penalty over native arm64 when running Takua Renderer. However, a critical caveat has to be pointed out here: reportedly Windows 11’s x86-64 emulation works worse in a VM on Apple Silicon than it does on native hardware because Arm RCpc instructions on Apple Silicon are relatively slow. For Rosetta 2 this behavior doesn’t matter because Rosetta 2 uses TSO mode instead of RCpc instructions for emulating strong memory ordering, but since Windows on Arm does rely on RCpc for emulating strong memory ordering, this means that the results above are likely not fully representative of emulation performance on native Windows on Arm hardware. Nonetheless though, having any form of x86-64 emulation at all is an important part of making Windows on Arm viable for mainstream adoption, and I’m looking forward to see how much of an improvement the new Prism emulation system in Windows 11 24H2 brings. I’ll update these results with the Prism emulator once 24H2 is released, and I’ll also update these results to show comparisons on real Windows on Arm hardware whenever I actually get some real hardware to try out.

Conclusion

I don’t think that x86-64 is going away any time soon, but at the same time, the era of mainstream desktop arm64 adoption is here to stay. Apple’s transition to arm64-based Apple Silicon already made the viability of desktop arm64 unquestionable, and now that Windows on Arm is finally ready for the mainstream as well, I think we will now be living in a multi-architecture world in the desktop computing space for a long time. Having more competitors driving innovation ultimately is a good thing, and as new interesting Windows on Arm devices enter the market alongside Apple Silicon Macs, Takua Renderer is ready to go!

References

ARM Holdings. 2022. Load-Acquire and Store-Release instructions. Retrieved June 7, 2024.

Petr Beneš. 2018. Wow64 Internals: Re-Discovering Heaven’s Gate on ARM. Retrieved June 5, 2024.

Cylance Research Team. 2019. Teardown: Windows 10 on ARM - x86 Emulation. In BlackBerry Blog. Retrieved June 5, 2024.

Angela Jiang. 2020. Announcing the OpenCL™ and OpenGL® Compatibility Pack for Windows 10 on ARM. In DirectX Developer Blog. Retrieved June 5, 2024.

Yusuf Mehdi. 2024. Introducing Copilot+ PCs. In Official Microsoft Blog. Retrieved June 5, 2024.

Derek Mihocka. 2024. ARM64 Boot Camp. Retrieved June 5, 2024.

Koh M. Nakagawa. 2021. Discovering a new relocation entry of ARM64X in recent Windows 10 on Arm. In Project Chameleon. Retrieved June 5, 2024.

Koh M. Nakagawa. 2021. Relock 3.0: Relocation-based obfuscation revisited in Windows 11 on Arm. In Project Chameleon. Retrieved June 5, 2024.

Michael Niehaus. 2021. Running x64 on Windows 10 ARM64: How the heck does that work?. In Out of Office Hours. Retrieved June 5, 2024.

Quinn Radich, Karl Bridge, David Coulter, and Michael Satran. 2020. WOW64 Implementation Details. In Programming Guide for 64-bit Windows. Retrieved June 5, 2024.

Marc Sweetgall, Drew Batchelor, Scott Jones, and Matt Wojciakowski. 2023. Arm64EC - Build and port apps for native performance on ARM. Retrieved June 5, 2024.

Wikipedia. 2024. WoW64. Retrieved June 5, 2024.

Wish

2023-11-28T00:00:00+00:00

Disney Animation’s fall 2023 release is Wish, which is the studio’s 62nd animated feature and also the studio’s film celebrating the 100th anniversary of Disney Animation and The Walt Disney Company. Wish is a brand new story but is also steeped in the past century of Disney storytelling; while Asha and Star and Valentino’s adventure is a new musical story, the themes and setting Wish draws upon are timeless and classic. As part of this theme of modern Disney with throwback elements, Wish also has a unique beautiful visual style that combines the latest of our computer graphics animation with a classic watercolor style. Creating this style presented an interesting set of new challenges for our artists, TDs, engineers, and for Disney’s Hyperion Renderer.

At Disney Animation, every one of our films is a new opportunity for us to push our filmmaking art and technology forward. On most films this advancement takes place across many different aspects of the film, but on Wish, there is one obvious challenge that stood out above everything else: the film’s visual style. Of course we still made large improvements in other areas, such as major pipeline optimizations [Li et al. 2024a], but on Wish a large proportion of technology development was focused on achieving the target visual style. One could be forgiven for thinking that stylization is mostly a rendering problem, but on Wish it really was a challenge that reached into every department and every corner or our production process. Stylization on Wish meant stylization in modeling, lookdev, animation, effects, lighting, everything else in between, and even new pipeline challenges!

The decision to give Wish a unique style came from pretty much the very beginning of the project; the studio wanted to do something special for the 100th anniversary film to tie together our modern way of making animated films with the studio’s rich hand-drawn heritage. The look of Wish is especially influenced by early 20th century Disney traditional watercolor animation, with Snow White and the Seven Dwarfs (1937) and Sleeping Beauty (1959) being the largest guideposts. This influence extends all the way to the very shape of the film, so to speak- Wish is the first CG film that Disney Animation has made in the ultrawide 2.55:1 Cinemascope-style aspect ratio, matching the aspect ratio that was used on Sleeping Beauty and Lady and the Tramp (1955). This aspect ratio choice meant that stylization on Wish even impacted layout, since they had to think about how to frame for such an ultrawide image!

Disney Animation has a long history of stylizing 3D CG to resemble and fit in with hand-drawn animation, going all the way back to the studio’s traditional hand-drawn era [Meier 1996, Daniels 1999, Tamstorf et al. 2001, Odermatt and Springfield 2002, Teece 2003]. In the 3D CG era, Disney Animation has continually experimented with stylizing CG as well with a number of different approaches. Paperman focused on integrating 2D linework with 3D rendered characters [Kahrs et al. 2012 ,Whited et al. 2012], while Feast experimented with driving 2D lighting entirely in compositing on top of flat unlit/unshaded 3D renders [Osborne and Staub 2014]. The studio’s recent Short Circuit experimental shorts program had many shorts that experimented with a variety of different stylized looks, ranging from Chinese ink brush watercolors to graphic 2D illustrations to stop motion and wood carved looks [Newfield and Staub 2020]. Disney Research has also worked closely with Disney Animation in the past decade plus to develop various experimental stylization techniques [Schmid et al. 2011, Sýkora et al. 2014]. Both Strange World and Raya and the Last Dragon had small amounts of stylized sequences as well, with Strange World doing a 1950s pulp scifi comic book look and Raya and the Last Dragon doing a more graphic digital mixed media sort of look. Most recently, the short Far From the Tree utilized a look with cel-shaded characters on watercolor backgrounds. All of these were animated using our standard 3D pipeline, with much of the look being built using a combination of render passes from the renderer (Hyperion for everything except Paperman, which preceded Hyperion’s existence by a few years), various tricks in lighting, and a lot of work in compositing; how much of each was used varied widely per show and per target style.

Wish builds upon all of these predecessors. Wish’s stylization system is a vastly expanded version of what was used on Far From the Tree, which in turn was built on top of everything that was developed for Short Circuit, which in turn drew upon lessons from both Feast and Paperman. One of the biggest challenges came from having to scale up a stylization pipeline from a short film to a full feature length project, while also trying to hit a new target style. Early tests on the show were able to reproduce in CG the target visdev paintings essentially exactly, but through entirely ad-hoc and mostly manual approaches, which we then had to systematically take apart and figure out how to apply to the whole movie. My wife, Harmony Li, was an Associate Technical Supervisor on the show and (among a ton of other things) oversaw the development of the entire technical backend that was built out to support stylization on Wish [Li et al. 2024b]; as a member of the rendering team, I got to work with her on this, which was great fun! Meanwhile, much of the development for the actual techniques used was led primarily by lighting and lookdev artists.

An early breakthrough in achieving Wish’s style was finding that combining Kuwahara filters¹ [Kuwahara et al. 1976] with linework generated from the renderer created a convincing starting point for a line-on-watercolor look that could use the renderer’s physical lighting as a starting point, instead of needing to construct stylized lighting entirely from scratch in comp on top of flat-shaded unlit renders. To help really tie together the watercolor look, early tests put the entire image on a watercolor paper texture background, but once we tested the watercolor paper texture background in motion, some issues became apparent. With just a static watercolor paper texture background, the illusion of motion broke as animation looked like it was “swimming” through the texture, but simply texturing everything with watercolor textures in 3D space looked downright bizarre since it looked less like the frame was a watercolor painting and more like all of the characters and the environment were made out of paper. To solve this problem, the Hyperion team invented a new dynamic screen space texture technique [Burley et al. 2024] where the renderer would project screen space textures onto 3D surfaces while tracking motion vectors. The result is that Wish’s watercolor backgrounds look like just a flat sheet of watercolor paper when still, but under motion convincingly move with the characters while neither looking like they’re actually in 3D space nor looking disconnected from motion.

One interesting question I worked on for Wish was making Hyperion robustly handle shading normals that are really dramatically disconnected from the underlying “physically correct” geometric normal. Extreme bending of normals was used extensively on Wish to simplify shapes and art-direct lighting detail and shadows. In a normal physically based path traced render, shading normals coming from bump mapping and normal mapping usually have at least some relationship with the underlying true geometric normal, meaning that the ways shading normals modify light transport are relatively constrained to a plausible range. However, on Wish, extreme shading normals were used for things like simplifying the lighting on an entire complex tree canopy to match what the lighting would be on a simple sphere. Making Hyperion handle these cases both from an authoring perspective and making Hyperion’s light transport robust against these cases took some work!

There were actually also some more traditional physically based rendering problems to solve on Wish too, which one might not necessarily expect for such a stylized film. For some of Magnifico’s magic, the art-direction called for a sort of prismatic look where white light would get split into different colors. We decided to try to achieve this effect through physically based shading, since the starting point for Wish’s entire stylization pipeline was renders with physical lighting in order to provide consistency. To achieve this effect through physically based shading, I extended the Disney BSDF with spectral dispersion support (retrofitting a spectral effect into a non-spectral renderer was a fun challenge worth discussing on its own someday). Once our lookdev artists had access to dispersion within the Disney BSDF, it was fun seeing all of the other places where they started sprinkling the effect in, such as in various glass objects.

Stylization on Wish didn’t just mean new renderer effects and lighting and compositing work; in order to make characters read correctly in a watercolor look, stylization had to be incorporated into all of the characters at a geometric and design level as well, and had to be incorporated into animation and simulation. As an example: a core story device in Wish is the collective wishes of Rosas, which take the form of orbs containing entire small worlds set inside of swirling volumetrics. Creating these wishes required clever pipeline solutions to embed entire stylized animated scenes inside of the orbs in 3D space, which was used instead of a usual compositing-based insert-shot workflow or the teleport-based solution [Butler et al. 2022] used on Encanto; this approach was taken in order to provide animators with the ability to sync and see fully combined shots interactively and to simplify the rendering setup needed for stylization in lighting and compositing [Karanam et al. 2024]. On top of creating the individual wishes, huge numbers of wishes then had to be choreographed into tight, closely synchronized formations to meet the art-direction and shape language of the songs they were a part of, which required developing new crowd rigs and animation controls. The rendering aspect of the wishes was in a lot of ways actually the easy part! Each wish was also an internally emissive object, so when thousands of wishes are massed together in key sequences in the movie, we initially had some concerns about efficiently rendering all of the wishes, but our long-standing cache points many-light selection strategy [Li et al. 2024c] proved to be more than capable for the task.

Another example of stylization far upstream of lighting is in the project’s entire approach to character stylization. Character hair and fur grooms required a different approach from our usual process; normally in more photoreal Disney Animation films, hair and fur grooms are built to be highly detailed to support the rich detailed look of the film, but Wish’s watercolor style meant using a more simplified and graphic shape language across the board, where detail is traded off for a stronger focus on silhouette and overall massing. Hair and fur grooms had to be adjusted to match, and hair and fur simulation had to be adjusted to keep art-directed shapes intact instead of operating on a more individual strand-based level [Kaur and Stratton 2024]. Asha’s braids, with their North African inspired long box braids, required additional attention to create and simulate [Kaur et al. 2024]. The braids themselves were already a major technical challenge; even using our state-of-the-art in-house grooming system Tonic [Simmons and Whited 2014], the braids still required a final groom two orders of magnitude more complex than our average groom. Once Asha’s groom was figured out, her hair then had to also be put through the same stylized simulation setup mentioned earlier, with extensive 2D drawovers being used to art-direct simulations. Character animation then also had to take into account the fact that Wish does not have motion blur and how that impacts how viewers perceive character performances.

Speaking of 2D drawovers, one particularly interesting use of 2D drawings to art-direct stylization on Wish is in Magnifico’s magic and in various effects like flames and torches. Normal volumetric effects created from simulations tend to be highly physical and detailed, but Wish’s style called for these effects to harken back to the much more graphic shape language of magic effects from Disney Animation’s hand-drawn era. To do this, our effects artists built on top of the neural volume style transfer work from Raya and the Last Dragon [Navarro and Rice 2021] and Strange World [Navarro 2023] to develop a new system where effects animation would begin with hand-drawn 2D elements, which were then projected and extruded into the 3D space to provide a guide for neural volume style transfer on top of volume simulations [Tollec and Navarro 2024]. The result is that Wish’s volumetric effects combine the movement and interactions of physical simulations while retaining the shape and style of traditional hand-drawn effects.

Everything I’ve written about here is just what I was familiar with on this film; vastly more work went into every single frame of Wish than even I know. The final look of Wish is something that I think really is unique and beautiful. Wish’s 3D watercolor look speaks to the entire history of Disney Animation and simultaneously roots itself in the studio’s rich traditional hand-drawn legacy while also exemplifying the studio’s long history of innovating and driving filmmaking craft forward. Walt Disney never stopped seeking to innovate in animation, and 100 years after he founded the studio, the animation studio that carries his name today continues to embody that same bold spirit on every new film. As someone who’s a lifelong fan and student of animation, I feel incredibly humbled and fortunate to get to contribute towards that legacy every day.

Below are some frames from Wish, pulled from the Blu-ray and presented in no particular order. As always, I’d highly recommend seeing Wish on the biggest screen you can find!

Here are two credits frames from Wish; the first is the fancy hero-credit card for my wife Harmony Li and her fellow Associate Technical Supervisors, and the second is for the Hyperion team, along with several of the Hyperion’s sister technology teams that all support lighting and lookdev. Also, Wish has a lovely post-credits scene that I’d encourage sticking around for!

All images in this post are courtesy of and the property of Walt Disney Animation Studios.

References

Brent Burley, Brian Green, and Daniel Teece. 2024. Dynamic Screen Space Textures for Coherent Stylization. In ACM SIGGRAPH 2024 Talks. Article 50.

Corey Butler, Brent Burley, Wei-Feng Wayne Huang, Yining Karl Li, and Benjamin Huang. 2022. “Encanto” - Let’s Talk About Bruno’s Visions. In ACM SIGGRAPH 2022 Talks. Article 8.

Eric Daniels. 1999. Deep Canvas in Disney’s Tarzan. In ACM SIGGRAPH 1999 Sketches & Applications. 200.

Avneet Kaur, Jennifer Stratton, David Hutchins, and Nikki Mull. 2024. Art-Directing Asha’s Braids in Disney’s Wish. In ACM SIGGRAPH 2024 Talks. Article 4.

Avneet Kaur and Jennifer Stratton. 2024. Character Stylization in Disney’s Wish. In ACM SIGGRAPH 2024 Talks. Article 5.

John Kahrs, Patrick Osborne, Amol Sathe, Jeff Turley, Brian Whited, and Darrin Butters. 2012. The Art and Science Behind Walt Disney Animation Studios’ “Paperman”. In ACM SIGGRAPH 2012 Production Sessions.

Neelima Karanam, Joel Einhorn, Emily Vo, and Harmony M. Li. 2024. Creating the Wishes of Rosas. In ACM SIGGRAPH 2024 Talks. Article 6.

Michiyoshi Kuwahara, Kozaburo Hachimura, Shigeru Eiho, and Masato Kinoshita. 1976. Processing of RI-Angiocardiographic Images. In Digital Processing of Biomedical Images. 187-202.

Harmony M. Li, George Rieckenberg, Neelima Karanam, Emily Vo, and Kelsey Hurley. 2024a. Optimizing Assets for Authoring and Consumption in USD. In ACM SIGGRAPH 2024 Talks. Article 30.

Harmony M. Li, Angela McBride, Sari Rodrig, and Gregory Culp. 2024b. A Pipeline for Effective and Extensible Stylization. In ACM SIGGRAPH 2024 Talks. Article 51.

Yining Karl Li, Charlotte Zhu, Gregory Nichols, Peter Kutz, Wei-Feng Wayne Huang, David Adler, Brent Burley, and Daniel Teece. 2024c. Cache Points for Production-Scale Occlusion-Aware Many-Lights Sampling and Volumetric Scattering. In Proc. of Digital Production Symposium (DigiPro 2024). Article 6.

Barbara J. Meier. 1996. Painterly Rendering for Animation. In SIGGRAPH 1996: Proceedings of the 23rd Annual Conference on Computer Graphics and Interactive Techniques. 477-484.

Mike Navarro and Jacob Rice. 2021. Stylizing Volumes with Neural Networks. In ACM SIGGRAPH 2021 Talks. Article 54.

Mike Navarro. 2023. Diving Deeper Into Volume Style Transfer. In ACM SIGGRAPH 2023 Talks. Article 39.

Jennifer Newfield and Josh Staub. 2020. How Short Circuit Experiments: Experimental Filmmaking at Walt Disney Animation Studios. In ACM SIGGRAPH 2020 Talks. Article 72.

Kyle Odermatt and Chris Springfield. 2002. Creating 3D Painterly Environments for Disney’s “Treasure Planet”. In ACM SIGGRAPH 2002 Sketches & Applications. 160.

Patrick Osborne and Josh Staub. 2014. Feast – A Look at Walt Disney Animation Studios’ Newest Short. In ACM SIGGRAPH 2014 Production Sessions.

Johannes Schmid, Martin Sebastian Senn, Markus Gross, and Robert W. Sumner. 2011. OverCoat: An Implicit Canvas for 3D Painting. ACM Transactions on Graphics (Proc. of SIGGRAPH) 30, 4 (Jul. 2011), Article 28.

Maryann Simmons and Brian Whited. 2014. Disney’s Hair Pipeline: Crafting Hair Styles From Design to Motion. In Eurographics 2014 Industrial Presentation.

Daniel Sýkora, Ladislav Kavan, Martin Čadik, Ondrej Jamriška, Alec Jacobson, Brian Whited, Maryann Simmons, and Olga Sorkine-Hornung. 2014. Ink-and-Ray: Bas-relief Meshes for Adding Global Illumination Effects to Hand-Drawn Characters. ACM Transactions on Graphics 33, 2 (Apr. 2016), Article 16.

Rasmus Tamstorf, Ramón Montoya-Vozmediano, Daniel Teece, and Patrick Dalton. 2001. Hybrid Ink-Line Rendering in a Production Environment. In ACM SIGGRAPH 2001 Sketches & Applications. 201.

Daniel Teece. 2003. Sable - a Painterly Renderer for Film Animation. In ACM SIGGRAPH 2003 Sketches & Applications.

Marie Tollec and Mike Navarro. 2024. Making Magic with 3D Volume Style Transfer. In ACM SIGGRAPH 2024 Talks. Article 48.

Brian Whited, Eric Daniels, Michael Kaschalk, Patrick Osborne, and Kyle Odermatt. 2012. Computer-Assisted Animation of Line and Paint in Disney’s Paperman. In ACM SIGGRAPH 2012 Talks. Article 19.

Footnotes

¹ I recently learned that the Kuwahara filter originated from completely outside of graphics; it was originally invented at Kyoto University’s medical school and at Shiga University of Medical Science for medical imaging purposes. Specifically, it was invented for reducing noise in radioisotopic heart scans without blurring sharp features, and much later graphics people realized it made for a great edge-preserving blur for painting-like effects. I love when graphics intersects with other fields to produce interesting results! keyboard_return

SIGGRAPH 2023 Conference Paper- Progressive Null-tracking for Volumetric Rendering

2023-08-13T00:00:00+00:00

This year at SIGGRAPH 2023, we have a conference-track technical paper in collaboration with Zackary Misso and Wojciech Jarosz from Dartmouth College! The paper is titled “Progressive Null-tracking for Volumetric Rendering” and is the result of work that Zackary did while he was a summer intern with the Hyperion development team last summer. On the Disney Animation side, Brent Burley, Dan Teece, and I oversaw Zack’s internship work, while on the the Dartmouth side, Wojciech was involved in the project as both Zack’s PhD advisor and as a consultant to Disney Animation.

Here is the paper abstract:

Null-collision approaches for estimating transmittance and sampling free-flight distances are the current state-of-the-art for unbiased rendering of general heterogeneous participating media. However, null-collision approaches have a strict requirement for specifying a tightly bounding total extinction in order to remain both robust and performant; in practice this requirement restricts the use of null-collision techniques to only participating media where the density of the medium at every possible point in space is known a-priori. In production rendering, a common case is a medium in which density is defined by a black-box procedural function for which a bounding extinction cannot be determined beforehand. Typically in this case, a bounding extinction must be approximated by using an overly loose and therefore computation- ally inefficient conservative estimate. We present an analysis of how null-collision techniques degrade when a more aggressive initial guess for a bounding extinction underestimates the true maximum density and turns out to be non-bounding. We then build upon this analysis to arrive at two new techniques: first, a practical, efficient, consistent progressive algorithm that allows us to robustly adapt null-collision techniques for use with procedural media with unknown bounding extinctions, and second, a new importance sampling technique that improves ratio-tracking based on zero-variance sampling.

The paper and related materials can be found at:

One cool thing about this project is that this project both served as a direct extension of Zack’s PhD research area and served as a direct extension of the approach we’ve been taking to volume rendering in Disney’s Hyperion Renderer over the past 6 years. Hyperion has always used unbiased transmittance estimators for volume rendering (as opposed to biased ray marching) [Fong et al. 2017], and Hyperion’s modern volume rendering system is heavily based on null-collision theory [Woodcock et al. 1965]. We’ve put significant effort into making a null-collision based volume rendering system robust and practical in production, which led to projects such as residual ratio tracking [Novák et al. 2014], spectral and decomposition tracking [Kutz et al. 2017] and approaches for unbiased emission and scattering importance sampling in heterogeneous volumes [Huang et al. 2021]. Over the past decade, many other production renderers [Christensen et al. 2018, Gamito 2018, Novák et al. 2018] have similarly made the shift to null-collision based volume rendering because of the many benefits that the null-collision framework brings, such as unbiased volume rendering and efficient handling of volumes with lots of high-order scattering due to the null-collision framework’s ability to cheaply perform distance sampling. Vanilla null-collision volume rendering does have shortcomings, such as difficulty in efficiently sampling optically thin volumes due to the fact that null-collision tracking techniques produce a binary transmittance estimate that is super noisy. A lot of progress has been made in improving null-collision volume rendering’s efficiency and robustness in these thin volumes cases [Villemin and Hery 2013, Villemin et al. 2018, Herholz et al. 2019, Miller et al. 2019]; the intro to the paper goes into much more extensive detail about these advancements.

However, one major limitation of null-collision volume rendering that remained unsolved until this paper is that the null-collision framework requires knowing the maximum density, or bounding majorant of a heterogeneous volume beforehand. This is a fundamental requirement of null-collision volume rendering that makes using procedurally defined volumes difficult, since the maximum possible density value of a procedurally defined volume cannot be known a-priori without either putting into place a hard clamp or densely evaluating the procedural function. As a result, renderers that use null-collision volume rendering typically only support procedurally defined volumes by pre-rasterizing the procedural function onto a fixed voxel grid, à la the volume pre-shading in Manuka [Fascione et al. 2018]. The need to pre-rasterize procedural volumes negates a lot of the workflow and artistic benefits of using procedural volumes; this is one of several reasons why other renderers continue to use ray-marching based integrators for volumes despite the bias and loss of efficiency at handling high-order scattering. Inspired by ongoing challenges we were facing with rendering huge volume-scapes on Strange World at the time, we gave Zack a very open-ended challenge for his internship: brainstorm and experiment with ways to lift this limitation in null-collision volume rendering.

Zack’s PhD research coming into this internship revolved around deeply investigating the math behind modern volume rendering theory, and from these investigations, Zack had previously found deep new insights into how to formulate volumetric transmittance [Georgiev et al. 2019] and cool new ways to de-bias previously biased techniques such as ray marching [Misso et al. 2022]. Zack’s solution to the procedural volumes in null-collision volume rendering problem very much follows in the same trend as his previous papers; after initially attempting to find ways to adapt de-biased ray marching to fit into a null-collision system, Zack went back to first principles and had the insight that a better solution was to find a way to de-bias the result that one gets from clamping the majorant of a procedural function. This idea really surprised me when he first proposed it; I had never thought about the problem from this perspective before. Dan, Brent, and I were highly impressed!

In addition to the acknowledgements in the paper, I wanted to acknowledge here Henrik Falt and Jesse Erickson from Disney Animation, who spoke with Zack and us early in the project to help us better understand how better procedural volumes support in Hyperion could benefit FX artist workflows. We are also very grateful to Disney Animation’s CTO, Nick Cannon, for granting us permission to include example code implemented in Mitsuba as part of the paper’s supplemental materials.

A bit of a postscript: during the Q&A session after Zack’s paper presentation at SIGGRAPH, Zack and I had a chat with Wenzel Jakob, Merlin Nimier-David, Delio Vicini, and Sébastien Speierer from EPFL’s Realistic Graphics Lab. Wenzel’s group brought up a potential use case for this paper that we hadn’t originally thought of. Neural radiance fields (NeRFs) [Mildenhall et al. 2020, Takikawa et al. 2023] are typically rendered using ray marching, but this is often inefficient. Rendering NeRFs using null tracking instead of ray marching is an interesting idea, but the neural networks that underpin NeRFs are essentially similar to procedural functions as far as null-collision tracking is concerned because there’s no way to know a tight bounding majorant for a neural network a-priori without densely evaluating the neural network. Progressive null tracking solves this problem and potentially opens the door to more efficient and interesting new ways to render NeRFs! If you happen to be interested in this problem, please feel free to reach out to Zack, Wojciech, and myself.

Getting to work with Zack and Wojciech on this project was an honor and a blast; I count myself as very lucky that working at Disney Animation continues to allow me to meet and work with rendering folks from across our field!

References

Per Christensen, Julian Fong, Jonathan Shade, Wayne Wooten, Brenden Schubert, Andrew Kensler, Stephen Friedman, Charlie Kilpatrick, Cliff Ramshaw, Marc Bannister, Brenton Rayner, Jonathan Brouillat, and Max Liani. 2018. RenderMan: An Advanced Path-Tracing Architeture for Movie Rendering. ACM Transactions on Graphics 37, 3 (Jul. 2018), Article 30.

Luca Fascione, Johannes Hanika, Mark Leone, Marc Droske, Jorge Schwarzhaupt, Tomáš Davidovič, Andrea Weidlich and Johannes Meng. 2018. Manuka: A Batch-Shading Architecture for Spectral Path Tracing in Movie Production. ACM Transactions on Graphics 37, 3 (Jul. 2018), Article 31.

Julian Fong, Magnus Wrenninge, Christopher Kulla, and Ralf Habel. 2017. Production Volume Rendering. In ACM SIGGRAPH 2017 Courses. Article 2.

Manuel Gamito. 2018. Path Tracing the Framestorian Way. In ACM SIGGRAPH 2018 Course Notes: Path Tracing in Production. 52-61.

Sebastian Herholz, Yangyang Zhao, Oskar Elek, Derek Nowrouzezahrai, Hendrik P A Lensch, and Jaroslav Křivánek. 2019. Volume Path Guiding Based on Zero-Variance Random Walk Theory. ACM Transactions on Graphics 38, 3 (Jun. 2019), Article 25.

Wei-Feng Wayne Huang, Peter Kutz, Yining Karl Li, and Matt Jen-Yuan Chiang. 2021. Unbiased Emission and Scattering Importance Sampling For Heterogeneous Volumes. In ACM SIGGRAPH 2021 Talks. Article 3.

Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. 2020. NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. In Proc. of European Conference on Computer Vision (ECCV 2020). 405-421.

Bailey Miller, Iliyan Georgiev, and Wojciech Jarosz. 2019. A Null-Scattering Path Integral Formulation of Light Transport. ACM Transactions on Graphics (Proc. of SIGGRAPH) 38, 4 (Jul. 2019), Article 44.

Jan Novák, Iliyan Georgiev, Johannes Hanika, and Wojciech Jarosz. 2018. Monte Carlo Methods for Volumetric Light Transport Simulation. Computer Graphics Forum (Proc. of Eurographics) 37, 2 (May 2018), 551-576.

Jan Novák, Andrew Selle and Wojciech Jarosz. 2014. Residual Ratio Tracking for Estimating Attenuation in Participating Media. ACM Transactions on Graphics (Proc. of SIGGRAPH Asia) 33, 6 (Nov. 2014), Article 179.

Towaki Takikawa, Shunsuke Saito, James Tompkin, Vincent Sitzmann, Srinath Sridhar, Or Litany, and Alex Yu. 2023. Neural Fields for Visual Computing. In ACM SIGGRAPH 2023 Courses. Article 10.

Ryusuke Villemin and Christophe Hery. 2013. Practical Illumination from Flames. Journal of Computer Graphics Techniques 2, 2 (Dec. 2013), 142-155.

Ryusuke Villemin, Magnus Wrenninge, and Julian Fong. 2018. Efficient Unbiased Rendering of Thin Participating Media. Journal of Computer Graphics Techniques 7, 3 (Sep. 2018), 50-65.

E. R. Woodcock, T. Murphy, P. J. Hemmings, and T. C. Longworth. 1965. Techniques used in the GEM code for Monte Carlo neutronics calculations in reactors and other systems of complex geometry. In Applications of Computing Methods to Reactor Problems. Argonne National Laboratory.

Strange World

2022-11-30T00:00:00+00:00

Disney Animation’s fall 2022 release is Strange World, which is the studio’s 61st animated feature, and third original story in as many years. Strange World takes us to the land of Avalonia, a realm surrounded by impenetrable mountains and home to a society that blends elements of early 20th century pulp fiction, steampunk, and environmental solarpunk. The core story of Strange World revolves around father-son relationships and is exactly the type of story that Disney Animation excels at: something personal and relatable but set in a fantastic world we’ve never seen before. That “never seen before” aspect made my two years of working on rendering technology for Strange World an interesting time indeed!

In my writeups about our films, two recurring themes always are: 1. with each film, we build upon advancements made and lessons learned from the previous film, and 2. one of the greatest advantages that having in-house tools gives us is the ability to customize and build exactly what each film’s story and art direction requires. Strange World’s production exemplifies both of these themes; so much of what we had to do on Strange World builds upon things we learned from and developed for previous films that I’m not entirely sure we would have been able to make Strange World a few years ago, and much of what we learned we have only been able to apply as effectively as we have because we have the ability to extend and improve our own tools.

As an example: a large part of Strange World takes place on the airship Venture, which from a production pipeline perspective has to function as both a set/environment in which characters move and interact, and as a sort of character of its own as it moves around in the larger surrounding environment. In CG production pipelines, various pipeline optimizations are often built around the reasonable assumption that sets are relatively static; sets typically don’t need complex animation rigs and can be used as a stationary frame of reference for all kinds of different things. The Venture, of course, breaks all of these expectations. Disney Animation had to deal with this type of scenario before on Moana, where much of the movie is set on a canoe out at sea, so handling the “sets as characters” challenge wasn’t new. Instead of having to solve this problem from scratch, our artists and TDs were able to build on top of what they had learned before to enable the Venture to be a far more complex “set as a character” than anything we had done before. In fact, the Venture isn’t the only case of this type of challenge in Strange World! Huge parts of Strange World follow this “set as a character” pattern; entire chunks of terrain get up and walk around in this movie! All of these complex sets were made possible by advancements [Vo et al. 2023] in our USD based pipeline, which in turn built upon all of the lessons learned [Miller et al. 2022] from our previous pipeline.

Things got even more complex once crowds were brought into the mix too. Strange World has some of the most massive crowd simulation ever made by Disney Animation [Devlin et al. 2023], and these huge crowds had to interact with the Venture and complex terrain. One of the main tools our crowds team used to guide giant swarms of creatures traveling through Strange World’s massive environments and around the Venture originated as a tool made for a single character on Frozen 2, and had to be turbocharged to massive scales to go from handling the requirements for one character to handling thousands upon thousands of creatures [Lin et al. 2023]. Challenges involving simulating collisions in huge crowds like this, along with similar challenges in hair simulation, helped inspire further research work [Zhang et al. 2023] for future films as well.

Once the story moves into the subterranean world, the environment of the film ratchets up in production complexity on multiple different axes. Essentially every single surface in the subterranean world has significant subsurface scattering since everything is made up of organic gummy materials, and of course all of the giant crowds also all have subsurface scattering. Many of our previous films already were beginning to push the use of subsurface scattering in environments for things like plants and plastics and other materials, all thanks to the work that the rendering team put into making path traced subsurface scattering efficient and controllable enough for large-scale production usage [Chiang et al. 2016], but Strange World saw the widest usage of subsurface scattering in environments yet, by far.

Everything we’ve learned about controlling subsurface scattering also proved to be extremely important for creating the look for the Splat character, who is essentially a giant immune cell. Splat wound up requiring a unique custom one-off shader with custom functionality in Disney’s Hyperion Renderer combining subsurface scattering, a custom faux volumetric emission technique, our multiple-scattering sheen solution [Zeltner et al. 2022], and more in order to achieve the target art-direction in a single render pass [Litaker et al. 2023]. Splat’s challenges weren’t limited to just rendering though; rigging and animating Splat also required novel solutions in order to handle how varied and multi-purpose Splat’s limbs are [Black and Pederson 2023]. Splat’s rig was only made possible through a combination of new novel techniques and a decade of experience and continuous improvement in Disney Animation’s DRig modular rig building system [Smith et al. 2012].

Splat wasn’t the only character that provided interesting technical challenges though; in fact, our entire character asset workflow got an upgrade on Strange World. Our standard character asset workflow saw three major improvements on Strange World: eyes, skin, and curves.

Strange World’s character art direction called for eyes to use a bit of a different look from Disney Animation’s usual style; eyes on Strange World have more of an oblong oval shape. Over the past several shows, we introduced a new eye shading model that incorporates manifold next event estimation for physically accurate iris caustics and limbal arcs [Chiang et al. 2018]; one of my smaller projects on Strange World was to help work out the minor modifications to this system that were required to support Strange World’s eye shapes.

For skin shading, Strange World uses the same fully path traced subsurface scattering approach [Chiang et al. 2016] (as opposed to older diffusion-based approaches [Burley 2015]) that we have now used for all of our movies over the past few years. However, Strange World has one of the most diverse casts of any of our recent films in terms of skin tones, and our lighting and look dev artists took special care to make sure all of the different skin types were depicted accurately and beautifully. Doing so required rebuilding our entire skin material from the ground up and radically rethinking our entire approach to lighting characters to better handle contrasting skin tones and high specularity skin [Khoo et al. 2023].

Previously on Encanto, our look artists started to replace triangle mesh-based geometric representations for cloth with curve-based fiber level representations [Velasquez et al. 2022]. This authoring approach was pushed to new limits on Strange World, where curve-based garments were extended to incorporate custom weave patterns and widely varying fiber thicknesses ranging from fine threads to thick yarns [Lipson and Velasquez 2023]. Humans weren’t the only type of characters on Strange World to see upgraded curve geometry though; the Clade family’s lovable dog, Legend, also required upgraded curve grooming techniques to produce one of the most complex animal grooms the studio has ever made [Chun et al. 2023]. Of course all of these improved authoring techniques meant increased curve rendering complexity, but interestingly, we didn’t actually need to improve anything in the renderer to handle the increased curve rendering demand. After having spent many prior shows improving Hyperion’s ability to chew through vast geometric complexity, on Strange World we found that Hyperion was able to just handle all of the meshes and curves that we threw at it!

The hardest rendering challenge I worked on for Strange World was volume rendering. Strange World’s environments have some of the largest scale and most ambitious use of volumes in any of our films to date. Strange World extensively utilizes mist and atmospherics and low cloud cover to help convey a sense of mystery and to sell the sheer scale of the environments. Frozen 2 was the first movie that really extensively leveraged Hyperion’s modern volume rendering system (which we rewrote essentially from scratch during the early production of Ralph Breaks the Internet) and the first movie that introduced our modern volumes authoring workflow. This workflow, which is heavily based around quickly set-dressing atmospherics and clouds around environments by kitbashing together volumes from a large pre-made in-house library of VDBs, was further fleshed out on Raya and the Last Dragon and saw its largest and most complex usage yet on Strange World. Strange World also further extended our volume workflows with an evolved version [Navarro 2023] of the neural volume stylization tech we first introduced on Raya and the Last Dragon [Navarro and Rice 2021].

During Raya and the Last Dragon we consolidated various different experiments and techniques in our volume rendering system into a single unified volume integrator [Huang et al. 2022] that can efficiently handle every imaginable type of volume effect, so the challenge presented by Strange World’s volumes wasn’t so much light transport as it was simply a problem of efficiency at scale. When volumes are simultaneously highly detailed but also span kilometers of world space, massive memory usage becomes challenging, even with instancing. Also, super large and detailed volumes coverage means that average path length in volumes can get very long, exposing any potential performance issues in the volume integrator. A huge part of my time on Strange World was spent optimizing our volume integrator. There were no clever shortcuts or brilliant solutions here, just tons of profiling and careful analysis of the existing system architecture and hard low level optimization work.

We also noticed during Strange World that artists sometimes had to overauthor volume details in areas as a way to work around the lack of true procedural volumes evaluation support in our renderer. While Hyperion does support authoring procedural volumes, these procedural volumes are not actually evaluated at render time but instead are pre-evaluated and baked into a required underlying VDB grid at renderer startup. The reason for this limitation is fundamental to null collision-based volume rendering theory [Novák 2018]; null collision approaches only work if the bounding majorant (AKA max density) for all volumes in a region of space is known upfront. In theory we could just require artists to input a max density value that we would clamp all higher values down to, but such a value isn’t easy for artists to estimate in practice; too low of a value clamps away detail, while too high of value results in an overly loose bounding majorant, which in null collision theory-based volume rendering can result in significantly slower performance. Inspired by what we were seeing on Strange World, we kicked off a research project in collaboration with the Visual Computing Lab at Dartmouth College to solve this problem, with promising results [Misso et al. 2023]!

As usual, I’ve only written about the parts of making Strange World that I know a bit more about; hundreds of artists, TDs, and engineers worked to craft every frame of this movie and solve many many more problems. For the entire history of Disney Animation, one of the studio’s primary driving purposes has been to push the limits of animation as an art form, and Strange World is no exception to this rule. Strange World is the latest example of how each of our films builds upon what we’ve learned on previous films to push our filmmaking process forward, and as always, getting to be a part of this process is a lot of work but also a lot of fun!

Below are some frames from the strange but gorgeous world of Strange World, pulled from the Blu-ray and presented in semi-randomized order to prevent giving away too much of the story. Go see Strange World on the biggest screen you can find!

Here is the credits frame for the Hyperion team, which is listed as part of the larger Rendering & Visualization group at Disney Animation. In addition to the Hyperion team, this group also includes our sister render translation pipeline and interactive visualization teams:

All images in this post are courtesy of and the property of Walt Disney Animation Studios.

References

Cameron Black and Christoffer Pedersen. 2023. The Versatile Rigging of Splat in ‘Strange World’. In ACM SIGGRAPH 2023 Talks. Article 29.

Brent Burley. 2015. Extending the Disney BRDF to a BSDF with Integrated Subsurface Scattering. In ACM SIGGRAPH 2015 Course Notes: Physically Based Shading in Theory and Practice.

Matt Jen-Yuan Chiang, Peter Kutz, and Brent Burley. 2016. Practical and Controllable Subsurface Scattering for Production Path Tracing. In ACM SIGGRAPH 2016 Talks. Article 49.

Matt Jen-Yuan Chiang and Brent Burley. 2018. Plausible Iris Caustics and Limbal Arc Rendering. In ACM SIGGRAPH 2018 Talks. Article 15.

Courtney Chun, Jose Velasquez, and Haixiang Liu. 2023. Creating the Art-directed Groom for Legend in Disney’s Strange World. In ACM SIGGRAPH 2023 Talks. Article 7.

Nathan Devlin, Yasser Hamed, Alberto J Luceño Ros, Jeff Sullivan, and D’Lun Wong. 2023. Creating Creature Chaos: The Methods That Brought Crowds to the Forefront on Disney’s ‘Strange World’. In ACM SIGGRAPH 2023 Talks. Article 34.

Mason Khoo, Dan Lipson, and Jose Velasquez. 2023. Lighting and Look Dev for Skin Tones in Disney’s “Strange World”. In Proc. of Digital Production Symposium (DigiPro 2023). Article 5.

Andy Lin, Hannah Swan, Justin Walker, Cathy Lam, and Ricky Arietta. 2023. Swoop: Animating Characters Along a Path. In ACM SIGGRAPH 2023 Talks. Article 45.

Dan Lipson and Jose Velasquez. 2023. Creating Curve-based Garments With Custom Weave Patterns. In ACM SIGGRAPH 2023 Talks. Article 18.

Kendall Litaker, Brent Burley, Dan Lipson, and Mason Khoo. 2023. Splat: Developing a ‘Strange’ Shader. In ACM SIGGRAPH 2023 Talks. Article 28.

Tad Miller, Harmony M. Li, Neelima Karanam, Nadim Sinno, and Todd Scopio. 2022. Making Encanto with USD: Rebuilding a Production Pipeline Working from Home. In ACM SIGGRAPH 2022 Talks. Article 12.

Zackary Misso, Yining Karl Li, Brent Burley, Daniel Teece, and Wojciech Jarosz. 2023. Progressive Null-tracking for Volumetric Rendering. In Proc. of SIGGRAPH (SIGGRAPH 2023). Article 31.

Mike Navarro and Jacob Rice. 2021. Stylizing Volumes with Neural Networks. In ACM SIGGRAPH 2021 Talks. Article 54.

Mike Navarro. 2023. Diving Deeper Into Volume Style Transfer. In ACM SIGGRAPH 2023 Talks. Article 39.

Greg Smith, Mark McLaughlin, Andy Lin, Evan Goldberg, and Frank Hanner. 2012. DRig: An Artist-Friendly, Object-Oriented Approach to Rig Building. In ACM SIGGRAPH 2012 Talks. Article 18.

Jose Velasquez, Alexander Alvarado, Ying Liu, and Maryann Simmons. 2022. Embroidery and Cloth Fiber Workflows on Disney’s “Encanto”. In ACM SIGGRAPH 2022 Talks. Article 22.

Emily Vo, George Rieckenberg, and Ernest Petti. 2023. Honing USD: Lessons Learned and Workflow Enhancements at Walt Disney Animation Studios. In ACM SIGGRAPH 2023 Talks. Article 13.

Tizian Zeltner, Brent Burley, and Matt Jen-Yuan Chiang. 2022. Practical Multiple-Scattering Sheen Using Linearly Transformed Cosines. In ACM SIGGRAPH 2022 Talks. Article 7.

Paul Zhang, Zoë Marschner, Justin Solomon, and Rasmus Tamstorf. 2023. Sum-of-squares Collision Detection for Curved Shapes and Paths. In Proc. of SIGGRAPH (SIGGRAPH 2023). Article 76.

SIGGRAPH 2022 Talk- "Encanto" - Let's Talk About Bruno's Visions

2022-08-10T00:00:00+00:00

This year at SIGGRAPH 2022, Corey Butler, Brent Burley, Wei-Feng Wayne Huang, Benjamin Huang, and I have a talk that presents the technical and artistic challenges and solutions that went into creating the holographic look for Bruno’s visions in Encanto. In Encanto, Bruno is a character who has a magical gift of being able to see into the future, and the visions he sees of the future get crystalized into a sort of glassy emerald tablet with the vision embedded in the glassy surface with a holographic effect. Coming up with this unique look and an efficient and robust authoring workflow required a tight collaboration between visual development, lookdev, lighting, and the Hyperion rendering team to develop a custom solution in Disney’s Hyperion Renderer. On the artist side, Corey was the main lighter and Benjamin was the main lookdev artist for this project, while on the rendering team side, Wayne and I worked closely together to develop a series of prototype shaders that were instrumental in defining how the effect should look and then Brent came up with the implementation approach for the final production version of the shader. This project was a lot of fun to be a part of and in my opinion really demonstrates the benefits of having an in-house rendering team that works closely with and embedded within a production context.

Here is the paper abstract:

In Walt Disney Animation Studios’ “Encanto”, Mirabel discovers the remnants of her Uncle Bruno’s mysterious visions of the future. Developing the look and lighting for the emerald shards required close collaboration between our Visual Development, Look Development, Lighting, and Technology departments to create a holographic effect. With an innovative new teleporting holographic shader, we were able to bring a unique and unusual effect to the screen.

The paper and related materials can be found at:

When Corey first came to the rendering team with the request for a more efficient way to create the hologram effect that lighting had prototyped using camera mapping, our initial instinct actually wasn’t to develop a new shader at all. Hyperion has an existing “hologram” shader that was developed for use on Big Hero 6 [Joseph et al. 2014], and our initial instinct was to tell Corey that they should use the hologram shader. The way the Big Hero 6 era hologram shader works is: upon hitting a surface that has the hologram shader applied, the ray is moved into a virtual space containing a bunch of imaginary parallel planes, with each plane textured with a 2D slice of a 3D interior. In some ways the hologram shader can be thought of as raymarching through a sparse volumetric representation of a 3D interior, but the sparse volumetric interior really is just a stack of 2D slices. This technique works really well for things like building interiors seen through glass windows. However, our artists… really dislike using the hologram shader, to put things lightly. The problem with the hologram shader is that setting up the 2D slices that are inputs to the shader is an incredibly annoying and difficult process, and since the 2D slice baker has to be run as an offline process before the shader can be authored and rendered, making changes and iterating on the contents of the hologram shader is a slow process. Furthermore, if the inside of the hologram shader has to be animated, the slice baker needs to be run for every frame. We were told in no uncertain terms that the hologram shader was likely more work to set up and iterate on than the already painful manual camera mapping approach that the artists had prototyped the effect with. This request also came to us fairly late in Encanto’s production schedule, so easy setup and fast iteration times along with an extremely accelerated development timeline were hard requirements for whatever approach we took.

Upon receiving this feedback, Wayne and I set out to prototype a version of the teleportation shader that Pixar came up with for the portals in Incredibles 2 [Coleman et al. 2014]. This process was a lot of fun; Wayne and I spent a few days rapidly iterating on several different ideas for both how to implement ray teleportation in Hyperion and on how the artist workflow and interface for this new teleportation system should work. At the same time that we were prototyping, we started giving test builds of our latest prototypes to Corey to try out, which produced a feedback loop where Corey would use our prototypes to further iterate on how the final effect would look and go back and forth with the movie’s production designer and we would use Corey’s feedback to further improve the prototype. One example of where our prototype directly informed the final look was in how the prophecies fade away towards the edges of the emerald tablet- Wayne and I threw in a feature where artists could use a map to paint in the ratio of teleportation effect versus normal surface BSDF that would be applied at each surface point, and this feature wound up driving the faded edges.

The key thing that made our new approach work better than the old hologram shader was in simplicity of setup. Instead of having to run a pre-bake process and then wire up a whole bunch of texture slices into the renderer, our new approach was designed so that all an artist had to do was set up the 3D geometry that they wanted to put inside of the hologram in a target space hidden somewhere in the overall scene (typically below the ground plane in a black box or something), and then select the geometry in the main scene that they wanted to act as the “entrance” portal, select the geometry in the target space that they wanted to act as the “exit” portal, and link the two using the teleportation shader. The renderer then did all of the rest of the work of figuring out how each point on the entrance portal corresponded to the surface of the exit portal, how transforms needed to be calculated, and so on and so forth. Multiple portal pairs could be set up in a single scene too, and the contents of a world seen through a portal could contain more portals, all of which was important because in the movie, Mirabel initially finds Bruno’s prophecy broken into shards, which had to be set up as a separate entrance portal per shard all into the same interior world. Since all of this just piggy-backed off of the normal way artists set up scenes, things like animation just worked out-of-the-box with no additional code or effort.

The last piece of the puzzle fell into place when Wayne and I discussed our progress with Brent. One of the big remaining challenges for us was that tracking correspondences between entrance and exit geometry and transforms was prone to easy breakage if input geometry wasn’t set up exactly the way we expected. At the time Brent was working on a new fracture-aware tessellation system for subdivision surfaces in Hyperion [Burley and Rodriguez 2022], and Brent quickly realized that the approach we were using for figuring out the transform from the entrance to the exit portal could be replaced with something he had already developed for the fracture-aware tessellation system. Specifically, the fracture-aware tessellation system has to be able to calculate correspondences between undeformed unfractured reference points and corresponding points in a deformed fractured fragment space; this is done using a best-fit process to find orthonormal transforms [Horn et al. 1998]. Brent realized that the problem we were trying to solve was actually the same problem he that he had already solved in the fracture system, so he took our latest prototype and reworked the internals to use the same best-fit orthonormal transform solution as in the fracturing system. With Brent’s improvements, we arrived at the final production version of the teleportation shader used on Encanto.

Going from the start of brainstorming and prototyping to delivering the final production version of the shader took us a little over a week, which anyone who has worked in an animation/VFX production setting before will know is very fast for a large new rendering feature. Working tightly with Corey and Benjamin to simultaneously iterate on the art and the software and inform each other was key to this project’s fast development time and key to achieving an amazing looking effect in the film. At Disney Animation, we have a mantra that goes “art challenges technology and technology inspires the art”- this project was a case that exemplifies how we carry out that mantra in real-world filmmaking and demonstrates the amazing results that come out of such a process. Bruno’s visions in Encanto are every bit a case where the artistic vision challenged us to develop new technology, and the process of iterating on the new technology between engineers and artists in turn informed the final artwork that made it into the movie; for me, projects like these are one of the things that makes Disney Animation such a fun and amazing place to be.

Your browser does not support the video tag.

A short GIF showing two examples of the final effect. For many more examples, go watch Encanto on Disney+!

References

Brent Burley and Francisco Rodriguez. 2022. Fracture-Aware Tessellation of Subdivision Surfaces. In ACM SIGGRAPH 2022 Talks. Article 10.

Patrick Coleman, Darwyn Peachey, Tom Nettleship, Ryusuke Villemin, and Tobin Jones. 2018. Into the Voyd: Teleportation of Light Transport in Incredibles 2. In Proc. of Digital Production Symposium (DigiPro 2018). Article 12.

Berthold K. P. Horn, Hugh M. Hilden, and Shahriar Negahdaripour. 1988. Close-Form Solution of Absolute Orientation using Orthonormal Matrices. Journal of the Optical Society of America A 5, 7 (Jul. 1988), 1127–1135.

Norman Moses Joseph, Brett Achorn, Sean D. Jenkins, and Hank Driskill. Visualizing Building Interiors Using Virtual Windows. In ACM SIGGRAPH Asia 2014 Technical Briefs. Article 18.

Encanto

2021-11-29T00:00:00+00:00

For the first time since 2016, Walt Disney Animation Studios is releasing not just one animated feature in a year, but two! The second Disney Animation release of 2021 is Encanto, which marks a major milestone as Disney Animation’s 60th animated feature film. Encanto is a musical set in Colombia about a girl named Mirabel and her family: the amazing, fantastical, magical Madrigals. I’m proud of every Disney Animation project that I’ve had the privilege to work on, but I have to admit that this year was something different and something very special to me, because this year we completed both Raya and the Last Dragon and Encanto, which are together two of my favorite Disney Animation projects so far. Earlier this year, I wrote about the amazing work that went into Raya and the Last Dragon and why I loved working on that project; with Encanto now in theaters, I now get to share why I’ve loved working on Encanto so much as well!

Disney Animation feature films take many years and hundreds of people to make, and often the film’s story can remain in a state of flux for much of the film’s production. All of the above isn’t unusual; large-scale creative endeavors like filmmaking often entail an extremely complex and challenging process. More often than not, a film requires time and many iterations to really find its voice and gain that spark that makes it a great film. Encanto, however, is a film that a lot of my coworkers and I realized was going to be really special very early on in production. Now obviously, that hunch didn’t mean that making Encanto was easy by any means; every film requires tons of hard work from the most amazing, inspiring, talented artists and engineers that I know. But, I think in the end, that initial hunch about Encanto was proven correct: the finished Encanto has a story that is bursting with warmth and meaning, has one of Disney Animation’s best main characters to date with a huge cast of charming supporting characters, has the most beautiful, magical animation and visuals we’ve ever done, and sets all of the above to a wonderful soundtrack with a bunch of catchy, really cleverly written new songs. Both the production process and final film for Encanto were a strong reminder for me of why I love working on Disney Animation films in the first place.

From a technical perspective, Encanto also represents something very special in the history of Disney Animation’s continual advancements in animation technology. To understand why this is, a very brief history review about Disney Animation’s modern production pipeline and toolset is helpful. In retrospect, Disney Animation’s 50th animated feature film, Tangled, was probably one of the most important films the studio has ever made from a technical perspective, because the production of Tangled required a near-total ground-up rebuild of the studio’s production pipeline and tools that wound up laying the technical foundations for Disney Animation’s modern era. While every film we’ve made since Tangled has seen us make enormous technical strides in a variety of eras, the starting point of the production pipeline we’ve used and evolved for every CG film up until Encanto were put into place during Tangled. The fact that Encanto is Disney Animation’s 60th animated feature film is therefore fitting; Encanto is the first film made using the USD-based successor [Miller et al. 2022] to the production pipeline that was first built for Tangled, and just like how Tangled laid the technical foundations for the subsequent ten films that followed, Encanto lays the technical foundations for many more future films to come! As presented in the USD Birds of a Feather session at SIGGRAPH 2021, this new production pipeline is built on the open-source Universal Scene Description project and brings massive upgrades to almost every piece of software and every custom tool that our artists use. Already Encanto’s upgraded production pipeline has enabled cool new tools that would have been much harder to create previously, such as a new command center tool that gives a bird’s eye unified overview of stats and data across the entire movie [Tennant et al. 2022]. An absolutely monumental amount of work was put into building a new USD-based world at Disney Animation, but I think the effort was extremely worthwhile: thanks to the work done on Encanto, Disney Animation is now well set up for another decade of technical innovation and another decade of pushing animation as a medium forward.

Moving to a new production pipeline meant also moving Disney’s Hyperion Renderer to work in the new production pipeline. To me, one of the biggest advantages of an in-house production renderer is the ability for the renderer development team to work extremely closely with other teams in the studio in an integrated fashion, and moving Hyperion to work well in the new USD-based world exemplifies just how important this collaboration is. We couldn’t have pulled off this effort without the huge amount of amazing work that engineers and TDs and artists from many other departments pitched in. However, having to move an existing renderer to a new pipeline isn’t the only impact on rendering that the new USD-based world has had. One of the most exciting things about the new pipeline is all of the new possibilities and capabilities that USD and Hydra unlocks; one of the biggest projects our rendering team worked on during Encanto’s production was a new, very exciting next-generation rendering project. I can’t talk too much about this project yet; all I can say is that we see it as a major step towards the future of rendering at Disney Animation, and that even in its initial deployment on Encanto, we’ve already seen huge fundamental improvements to how our lighters work every day. Hopefully we’ll be able to reveal more soon!

Of course, just because Encanto saw huge foundational changes to how we make movies doesn’t mean that there weren’t the usual fun and interesting show-specific challenges as well. Encanto presented many new, weird, fun problems for the rendering team to think about. Geometry fracturing was a major effect used extensively throughout Encanto, and in order to author and render fractured geometry as efficiently as possible, the rendering team had to devise some really clever new geometry-processing features in Hyperion [Burley and Rodriguez 2021]. Encanto’s cinematography direction called for a beautiful, really colorful look [Robinson 2022] that required pushing artistic controllability in our lighting capabilities even further, and to that end our team developed a bunch of cool new artistic control enhancements in Hyperion’s volume rendering and light shaping systems. One of my favorite show-specific challenges that I got to work on for Encanto was for the holographic effect in Bruno’s emerald crystal prophecies [Butler et al. 2021]. For a variety of reasons, the artists wanted this effect done completely in-render; coming up with an in-render solution required many iterations and prototypes and experiments carried out over several months through a close collaboration between a number of artists and TDs and the rendering team.

Encanto also saw continued advancements to Hyperion’s state-of-the-art deep-learning denoiser and stereo rendering solutions and saw continued advancements in Hyperion’s shading models and traversal system. A particularly notable advancement in our shading model is the addition of a new physically accurate practical multiple-scattering sheen lobe [Zeltner et al. 2022] to the Disney BSDF [Burley 2015]; I think this new sheen model is going to catch on widely in industry due to its combination of accuracy, ease of implementation, and performance, all of which improves greatly over previously existing sheen models [Conty and Kulla 2017]. These advancements helped us tackle many of the interesting complexity and scaling challenges that Encanto presented; effects like Isabella’s flowers and the glowing magical particles associated with the Madrigal family’s miracle pushed instancing counts to incredible new record levels [Finley et al. 2022], and for the first time ever on a Disney Animation film, we actually rendered some of the gorgeous costumes in the movie not as displaced triangle meshes with fuzz on top, but as actual woven curves at the thread-level [Velasquez et al. 2022]. The latter proved crucial to creating the chiffon and tulle in Isabella’s outfit and was a huge part in creating the look of Mirabel’s characteristic custom-embroidered skirt. My mind was thoroughly blown when I saw those renders for the first time; on every film, I’m constantly amazed and impressed by what our artists can do with the tools we provide them with.

Encanto also saw rendering features that we first developed for previous films pushed even further and used in interesting new ways. We first deployed a path guiding implementation [Müller et al. 2017] in Hyperion back on Frozen 2 [Müller 2019], but path guiding wound up not seeing too much use on Raya and the Last Dragon since Raya’s setting was mostly outdoors, and path guiding doesn’t help as much in direct-lighting dominant scenarios such as outdoor scenes. However, since a huge part of Encanto takes place inside of the magical Madrigal casita, indoor indirect illumination was a huge component of Encanto’s lighting. We found that path guiding provided enormous benefits to render times in many indoor scenes, and especially in settings like the Madrigal family’s kitchen at night, where lighting was almost entirely provided by outdoor light sources coming in through windows and from candles and stuff. I think this case was a great example of how we benefit from how closely our lighting artists and our rendering engineers work together on many shows over time; because we had all worked together on similar problems before, we all had shared experiences with past solutions that we were able to draw on together to quickly arrive at a common understanding of the new challenges on Encanto. Another good example of how this collaboration continues to pay dividends over time is in the choices of lens and bokeh effects that were used on Encanto. For Raya and the Last Dragon, we learned a lot about creating non-uniform bokeh and interesting lensing effects, and what we learned on Raya in turn helped further inform early cinematography and lensing experiments on Encanto. One more great example can be found in how eyes are shaded in Encanto- over our last few shows, we’ve been steadily moving our eye shading approach over to a next-generation shading model with advanced, physically accurate iris caustics [Chiang and Burley 2018] sampled using manifold next event estimation, and Encanto is the first show to use this new eye shading model on 100% of characters. The way we push technology further and further on each film isn’t limited to just rendering either; I mostly write about only lighting/shading/rendering topics here because that’s my home domain, but there are countless other examples in things like rigging, animation, simulation, procedural authoring, interactive visualization, and more about how we each film tech advances on top of the previous film. A great example published at SIGGRAPH 2022 is the new hair simulation technique that was developed for Mirabel’s bouncy curly hair [Liu 2022]; ever since Tangled, Disney Animation has been great at hair, but with each movie we still keep advancing what we can do!

In addition to all of the cool renderer development work that I usually do, I also got to take part in something a little bit different on Encanto. Every year, the lighting department brings on a handful of trainees, who are put through several months of in-studio “lighting school” to learn our tools and pipeline and approach to lighting before lighting real shots on the film itself. This year, I got to join in with the lighting trainees while they were going through lighting training; this experience wound up being one of my favorites from the past year. I think that having to sit down and actually learn and use software the same way that the users have to is an extraordinarily valuable experience for any software engineer that is building tools for users. Even though I’ve been working at Disney Animation for six years now, and even though I know the internals of how our renderer works extensively, I still learned a ton from having to actually use Hyperion to light shots and address notes from lighting supervisors and stuff! Encanto’s lighting style required really leaning on the tools that we have for art-directing and pushing and modifying fully physical lighting, which really changed my perspective on some of these tools. For most rendering engineers and researchers, features that allow for breaking purely physical light transport are often seen as annoying and difficult to implement but necessary concessions to the artists. Having now used these features in order to hit artistic notes on short time frames though, I now have a better understanding of just how critical a component these features can be in an artist’s toolbox. I owe a huge amount of thanks to Disney Animation’s technology department leadership and to the lighting department for having made this experience possible and for having strongly supported this entire “exchange program”; I’d strongly recommend that every rendering engineer should go try lighting some shots sometime!

Finally, here are some stills from the movie pulled from the Blu-ray, 100% created using Disney’s Hyperion Renderer by our amazing artists. I’ve ordered the frames randomly, to try to prevent spoiling anything important. These frames showcase just how gorgeous Encanto looks, but they only represent a small fraction of how breathtakingly beautiful and colorful the total film is. I highly recommend seeing Encanto on the biggest screen you can; if you are a computer graphics enthusiast, go see it twice: the first time for the wonderful, magical story and the second time for the incredible artistry that went into every single shot and every single frame! I love working on Disney Animation films because Disney Animation is a place where some of the most amazing artists and engineers in the world work together to simultaneously advance animation as a storytelling medium, as a visual medium, and as a technology goal. Art being inspired by technology and technology being challenged by art is a legacy that is deeply baked into the very DNA of Disney Animation, and that approach is exemplified by every single frame in Encanto:

Here is the credits frame for Disney Animation’s rendering and visualization teams! These two teams collectively are responsible for generating all of the pixels at Disney Animation, be it final frames from Hyperion, or interactive viewports using our internal realtime rasterizer:

All images in this post are courtesy of and the property of Walt Disney Animation Studios.

Also, be sure to catch our new short, Far From the Tree, which is accompanying Encanto in theaters. Far From the Tree deserves its own discussion later; all I’ll write here is that I’m sure it’s going to be fascinating for rendering and computer graphics enthusiasts to see! Far From the Tree tells the story of a parent and child raccoon as they explore a beach; the short has a beautiful hand-drawn watercolor look that is actually CG rendered out of Disney’s Hyperion Renderer and extensively augmented with hand-crafted elements. Be sure to see Far From the Tree in theaters with Encanto!

References

Brent Burley. 2015. Extending the Disney BRDF to a BSDF with Integrated Subsurface Scattering. In ACM SIGGRAPH 2015 Course Notes: Physically Based Shading in Theory and Practice.

Brent Burley and Francisco Rodriguez. 2022. Fracture-Aware Tessellation of Subdivision Surfaces. In ACM SIGGRAPH 2022 Talks. Article 10.

Corey Butler, Brent Burley, Wei-Feng Wayne Huang, Yining Karl Li, and Benjamin Huang. 2022. “Encanto” - Let’s Talk About Bruno’s Visions. In ACM SIGGRAPH 2022 Talks. Article 8.

Matt Jen-Yuan Chiang and Brent Burley. 2018. Plausible Iris Caustics and Limbal Arc Rendering. In ACM SIGGRAPH 2018 Talks. Article 15.

Alejandro Conty and Christopher Kulla. 2017. Production Friendly Microfacet Sheen BRDF. In ACM SIGGRAPH 2017 Course Notes: Physically Based Shading in Theory and Practice.

Henrik Dahlberg, David Adler, and Jeremy Newlin. 2019. Machine-Learning Denoising in Feature Film Production. In ACM SIGGRAPH 2019 Talks. Article 21.

Andrew Finley, Jesse Erickson, Peter De Mund, and Ying Liu. 2022. Modeling Animated Jumbo Floral Display on Disney’s “Encanto”. In ACM SIGGRAPH 2022 Talks. Article 43.

Haixiang Liu. 2022. Gravity Preloading for Maintaining Hair Shape Using the Simulator as a Closed-box Function. In ACM SIGGRAPH 2022 Talks. Article 40.

Tad Miller, Harmony M. Li, Neelima Karanam, Nadim Sinno, and Todd Scopio. 2022. Making Encanto with USD: Rebuilding a Production Pipeline Working from Home. In ACM SIGGRAPH 2022 Talks. Article 12.

Thomas Müller. 2019. Practical Path Guiding in Production. In ACM SIGGRAPH 2019 Course Notes: Path Guiding in Production. 37-50.

Michelle Robinson, Michael Woodside, Daniel Rice, Tad Miller, Scott Kersavage, and Tyler Kupferer. 2022. We Don’t Talk About Bruno - An Encanto Musical Sequence Unveiled. In ACM SIGGRAPH 2022 Production Sessions. Article 2.

Justin Tennant, Mitch Counsell, Far Jangtrakool, Salina Ortega, Rajesh Sharma, Tad Miller, and Scott Kersavage. 2022. Visualizing the Production Process of “Encanto” with the Command Center. In ACM SIGGRAPH 2022 Talks. Article 11.

Jose Velasquez, Alexander Alvarado, Ying Liu, and Maryann Simmons. 2022. Embroidery and Cloth Fiber Workflows on Disney’s “Encanto”. In ACM SIGGRAPH 2022 Talks. Article 22.

Tizian Zeltner, Brent Burley, and Matt Jen-Yuan Chiang. 2022. Practical Multiple-Scattering Sheen Using Linearly Transformed Cosines. In ACM SIGGRAPH 2022 Talks. Article 7.

Rendering on the Apple M1 Max Chip

2021-10-25T00:00:00+00:00

1. Introduction
2. Disclaimers
3. The M1 Max Chip
4. Application to Ray Tracing

5. Results
6. Conclusion
7. Bonus Images and Acknowledgements

Introduction

Over the past year, I ported my hobby renderer, Takua Renderer, to 64-bit ARM. I wrote up the entire process and everything I learned as a three-part blog post series covering topics ranging from assembly-level comparison between x86-64 and arm64, to deep dives into various aspects of Apple Silicon, to a comparison of x86-64’s SSE and arm64’s Neon vector instructions. In the intro to part 1 of my arm64 series, I wrote about my motivation for exploring arm64, and in the conclusion to part 2 of my arm64 series, I wrote the following about the Apple M1 chip:

There’s really no way to understate what a colossal achievement Apple’s M1 processor is; compared with almost every modern x86-64 processor in its class, it achieves significantly more performance for much less cost and much less energy. The even more amazing thing to think about is that the M1 is Apple’s low end Mac processor and likely will be the slowest arm64 chip to ever power a shipping Mac; future Apple Silicon chips will only be even faster.

Well, those future Apple Silicon chips are now here! Last week (relative to the time of posting), Apple announced new 14 and 16-inch MacBook Pro models, powered by the new Apple M1 Pro and Apple M1 Max chips. Apple reached out to me last week immediately after the announcement of the new MacBook Pros, and as a result, for the past week I’ve had the opportunity to use a prerelease M1 Max-equipped 2021 14-inch MacBook Pro as my daily computer. So, to my extraordinary surprise, this post is the unexpected Part 4 to what was originally supposed to be a two-part series about Takua Renderer on arm64. This post will serve as something of a coda to my Takua Renderer on arm64 series, but will also be fairly different in structure and content to the previous three parts. While the previous three parts dove deep into extremely technical details about arm64 assembly and Apple Silicon and such, this post will focus on a single question: now that professional-grade Apple Silicon chips exist in the wild, how well do high-end rendering workloads run on workstation-class arm64?

Disclaimers

Before we dive in, I want to get a few important details out of the way. First, this post is not really a product review or anything like that, and I will not be making any sort of endorsement or recommendation on what you should or should not buy; I’ll just be writing about my experiences so far. Many amazing tech reviewers exist out there, and if what you are looking for is a general overview and review of the new M1 Pro and M1 Max based MacBook Pros, I would suggest you go check out reviews by The Verge, Anandtech, MKBHD, Dave2D, LinusTechTips, and so on. Second, as with everything in this blog, the contents of this post represent only my personal opinion and do not in any way represent any kind of official or unofficial position, endorsement, or opinion on any matter from my employer, Walt Disney Animation Studios. When Apple reached out to me, I received permission from Disney Animation to go ahead on a purely personal basis, and beyond that nothing with this entire process involves Disney Animation. Finally, Apple is not paying me or giving me anything for this post; the 14-inch MacBook Pro I’ve been using for the past week is strictly a loaner unit that has to be returned to Apple at a later point. Similarly, Apple has no say over the contents of this post; Apple has not even seen any version of this post before publishing. What is here is only what I think!

The M1 Max Chip

Now that a year has passed since the first Apple Silicon arm64 Macs were released, I do have my hobby renderer up and running on arm64 with everything working, but I’ve only rendered relatively small scenes so far on arm64 processors. The reason I’ve stuck to smaller scenes is because high-end workstation-class arm64 processors so far just have not existed; while large server-class arm64 processors with large core counts and tons of memory do exist, these server-class processors are mostly found in huge server farms and supercomputers and are not readily available for general use. For general use, the only arm64 options so far have been low-power single-board computers like the Raspberry Pi 4 that are nowhere near capable of running large rendering workloads, or phones and tablets that don’t have software or operating systems or interfaces suitable for professional 3D applications, or M1-based Macs. I have been using an M1 Mac Mini for the past year, but while the M1 performance-wise punches way above what a 15 watt TDP typically would suggest, the M1 only supports up to 16 GB of RAM and only represents Apple’s entry into Apple Silicon based Macs. The M1 Pro and M1 Max, however, are are Apple’s first high powered arm64-based chips targeted at professional workloads, meant for things like high-end rendering and many other creative workloads; by extension, the M1 Pro and M1 Max are also the first arm64 chips of their class in the world with wide general availability. So, in this post, answering the question “how well do high-end rendering workloads run on workstation-class arm64” really means examining how well the M1 Pro and M1 Max can do rendering.

Spoiler: the answer is extremely well; all of the renders in the post were rendered on the 14-inch MacBook Pro with an M1 Max chip. Here is a screenshot of Takua Renderer running on the 14-inch MacBook Pro with an M1 Max chip:

The 14-inch MacBook Pro I’ve been using for the past week is equipped with the maximum configuration in every category: a full M1 Max chip with a 10-core CPU, 32-core GPU, 64 GB of unified memory, and 8 TB of SSD storage. However, for this post, I’ll only focus on the 10-core CPU and 64 GB of RAM, since Takua Renderer is currently CPU-only (more on that later); for a deep dive into the M1 Pro and M1 Max’s entire system-on-a-chip, I’d suggest taking a look at Anandtech’s great initial impressions and later in-depth review.

The first M1 Max spec that jumped out at me is the 64 GB of unified memory; having this amount of memory meant I could finally render some of the largest scenes I have for my hobby renderer. To test out the M1 Max with 64 GB of RAM, I chose the forest scene from my Mipmapping with Bidirectional Techniques post. This scene has enormous amounts of complex geometry; almost every bit of vegetation in this scene has highly detailed displacement mapping that has to be stored in memory, and the large amount of textures in this scene is what drove me to implement a texture caching system in my hobby renderer in the first place. In total, this scene requires just slightly under 30 GB of memory just to store all of the subdivided, tessellated, and displaced scene geometry, and requires an additional few more GB for the texture caching system (the scene can render with just a 1 GB texture cache, but having a larger texture cache helps significantly with performance).

I have only ever published two images from this scene: the main forest path view in the mipmapping blog post, and a closeup of a tree stump as the title image on my personal website. I originally had several more camera angles set up that I wanted to render images from, and I actually did render out 1080p images. However, to showcase the detail of the scene better, I wanted to wait until I had 4K renders to share, but unfortunately I never got around to doing the 4K renders. The reason I never did the 4K renders is because I only have one large personal workstation that has both enough memory and enough processing power to actually render images from this scene in a reasonable amount of time, but I needed this workstation for other projects. I also have a few much older spare desktops that do have just barely enough memory to render this scene, but unfortunately, those machines are so loud and so slow and produce so much heat that I prefer not to run them at all if possible, and I especially prefer not running them on long render jobs when I have to work-from-home in the same room! However, over the past week, I have been able to render a bunch of 4K images from my forest scene on the M1 Max 14-inch MacBook Pro; quite frankly, being able to do this on a laptop is incredible to me. Here is the title image from my personal website, but now rendered at 4K resolution on the M1 Max 14-inch MacBook Pro:

The M1 Max-based MacBook Pro is certainly not the first laptop to ever ship with 64 GB of RAM; the previous 2019 16-inch MacBook Pro was also configurable up to 64 GB of RAM, and there are crazy PC laptops out there that can be configured up even higher. However, this is where the M1 Max and M1 Pro’s CPU performance comes into play: while previous laptops could support 64 GB of RAM and more, actually utilizing large amounts of RAM was difficult since previous laptop CPUs often couldn’t keep up! Being able to fit a large data set into memory is one thing, but being able to run processing fast enough to actually make use of large data sets in a reasonable amount of time is the other half of the puzzle. My wife has a 2019 16-inch MacBook Pro with 32 GB of memory, which is just enough to render my forest scene. However, as seen in the benchmark results later in this post, the 2019 16-inch MacBook Pro’s Intel Core-i7 9750H CPU with 6 cores and 12 threads is over twice as slow as the M1 Max at rendering this scene at best, and can be even slower depending on thermals, power, and more. Rendering each of the images in this post took a few hours on the M1 Max, but on the Core-i7 9750H, the renders have to become overnight jobs with the 16-inch MacBook Pro’s fans running at full speed. With only a week to write this post, a few hours per image versus an overnight job per image made the difference between having images ready for this post versus not having any interesting renders to show at all!

Actually, the M1 Max isn’t just fast for a chip in a laptop; the M1 Max is stunningly competitive even with desktop workstation CPUs. For the past few years, the large personal workstation that I offload large projects onto has been a machine with dual Intel Xeon E5-2680 workstation processors with 8 cores / 16 threads each for a total of 16 cores and 32 threads. Even though the Xeon E5-2680s are ancient at this point, this workstation’s performance is still on-par with that of the current Intel-based 2020 27-inch iMac. The M1 Max is faster then the dual-Xeon E5-2680 workstation at rendering my forest scene, and considerably so. But of course, a comparison with aging Sandy Bridge era Xeons isn’t exactly a fair sporting competition; the M1 Max has almost a decade of improved processor design and die shrinks to give it an advantage. So, I also tested the M1 Max against… the current generation 2019 Mac Pro, which uses a Intel Xeon W-3245 CPU with 16 cores and 32 threads. As expected, the M1 Max loses to the 2019 Mac Pro… but not by a lot, and for a fraction of the power used. The Intel Xeon W-3245 has a 205 watt TDP just for the CPU alone and has to be utilized in a huge desktop tower with an extremely elaborate custom-engineered cooling solution, whereas the M1 Max 14-inch MacBook Pro has a reported whole-system TDP of just 60 watts!

How does Apple pack so much performance with such little energy consumption into their arm64 CPU designs? A number of factors come into play here, ranging from partnering with TSMC to manufacture on cutting-edge 5 nm process nodes to better microarchitecture design to better software and hardware integration; outside of Apple’s processor engineering labs, all anyone can really do is just hypothesize and guess. However, there are some good guesses out there! Several plausible theories have to do with the choice to use the arm64 instruction set; the argument goes that having been originally designed for low-power use cases, arm64 is better suited for efficient energy consumption than x86-64, and scaling up a more efficient design to huge proportions can mean more capable chips that use less power than their traditional counterparts. Another theory revolving around the arm64 instruction set has to do with microarchitecture design considerations. The M1, M1 Pro, and M1 Max’s high-performance “Firestorm” cores have been observed to have an absolutely humongous reorder buffer, which enables extremely deep out-of-order execution capabilities; modern processors attain a lot of their speed by reordering incoming instructions to do things like hide memory latency and bypass stalled instruction sequences. The M1 family’s high-performance cores posses an out-of-order window that is around twice as large as that in Intel’s current Willow Cove microarchitecture and around three times as large as that in AMD’s current Zen3 microarchitectures. Having a huge reordering buffer supports the M1 family’s high-performance cores also having a high level of instruction-level parallelism enabled by extremely wide instruction execution and extremely wide instruction decoding. While wide instruction decoding is certainly possible on x86-64 and other architectures, scaling wide instruction-issue designs in a low power budget is generally accepted to be a very challenging chip design problem. The theory goes that arm64’s fixed instruction length and relatively simple instructions make implementing extremely wide decoding and execution far more practical for Apple, compared with what Intel and AMD have to do in order to decode x86-64’s variable length, often complex compound instructions.

Application to Ray Tracing

So what does any of the above have to do with ray tracing? One concrete application has to do with opacity mapping in a ray tracing renderer. Opacity maps are used to produce finer geometric detail on surfaces by using a texture map to specify whether a part of a given surface should actually exist or not. Implementing opacity mapping in a ray tracer creates a surprisingly large number of design considerations that need to be solved for. For example, texture lookups are usually done as part of a renderer’s shading system, which in a ray tracer only runs after ray intersection has been carried out. However, evaluating whether or not a given hit point against a surface should be ignored or not after exiting the entire ray traversal system leads to massive inefficiencies due to the need to potentially re-enter the entire ray traversal system from scratch again. As an example: imagine a tree where all of the leaves are modeled as rectangular cards, and the shape of each leaf is produced using an opacity map on each card. If the renderer wants to test if a ray hits any part of the tree, and the renderer is architected such that opacity map lookups only happen in the shading system, then the renderer may need to cycle back and forth between the traversal and shading systems for every leaf encountered in a straight line path through the tree (and trees have a lot of leaves!). An alternative way to handle opacity hits is to allow for direct texture map lookups or to evaluate opacity procedurally from within the traversal system itself, such that the renderer can immediately decide whether to accept a hit or not without having to exit out and run the shading system; this approach is what most renderers use and is what ray tracing libraries like Embree and Optix largely expect. However, this method produces a different problem: tight inner loop ray traversal code is now potentially dependent on slow texture fetches from memory! Both of these approaches to implementing opacity mapping have downsides and potential performance impacts, which is why often times just modeling detail into geometry instead of using opacity mapping can actually result in faster ray tracing performance, despite the heavier geometry memory footprint. However, opacity mapping is often a lot easier to set up compared with modeling detail into geometry, and this is where a deep out-of-order buffer coupled with good branch prediction can make a big difference in ray tracing performance; these two tools combined can allow the processor to proceed with a certain amount of ray traversal work without having to wait for opacity map decisions. Problems similar to this, coupled with the lack of out-of-order and speculative execution on GPUs, play a large role in why GPU ray tracing renderers often have to be architecture fairly differently from CPU ray tracing renderers, but that’s a topic for another day.

I give the specific example above because it turns out that the M1 Max’s deep reordering capabilities seem to make a fairly noticeable difference in my Takua Renderer’s performance when opacity maps are used extensively! In the following rendered image, the ferns have an extremely detailed, complex appearance that depends heavily on opacity maps to cut out leaf shapes from simple underlying geometry. In this case, I found that the slowdown introduced by using opacity maps in a render on the M1 Max is proportionally much lower than the slowdown introduced when using opacity maps in a render on the x86-64 machines that I tested. Of course, I have no way of knowing if the above theory for why the M1 Max seems to handle renders that use opacity maps better is correct, but whichever way, the end results look very nice and renders faster than on any other computer that I have!

In terms of whether the M1 Pro or the M1 Max is better for CPU rendering, I only have the M1 Max to test, but my guess is that there shouldn’t actually be too large of a difference as long as the scene fits in memory. However, the above guess comes with a major caveat revolving around memory bandwidth. Where the M1 Pro and M1 Max differ is in the maximum number of GPU cores and maximum amount of unified memory configurable; the M1 Pro can go up to 16 GPU cores and 32 GB of RAM, while the M1 Max can go up to 32 GPU cores and 64 GB of RAM. Outside of the GPU and maximum amount of memory, the M1 Pro and M1 Max chips actually share identical CPU configurations: both of them have a 10-core arm64 CPU with 8 high-performance cores and 2 energy-efficient cores, implementing a custom in-house Apple-designed microarchitecture. However, for some workloads, I would not be surprised if the M1 Max is actually slightly faster since the M1 Max also has twice the memory bandwidth over the M1 Pro (400 GB/s on M1 Max versus 200 GB/s M1 Pro); this difference comes from the M1 Max having twice the number of memory controllers. While consumer systems such as game consoles and desktop GPUs often do ship with memory bandwidth numbers comparable or even better than the M1 Max’s 400 GB/s, seeing these levels of memory bandwidth in even workstation CPUs is relatively unheard of. For example, AMD’s monster flagship Ryzen Threadripper 3990X is currently the most powerful high-end desktop CPU on the planet (outside of server processors), but the 3990X’s maximum memory bandwidth tops out at 95.37 GiB/s, or 165.944 GB/s; seeing the M1 Max MacBook Pro ship with over twice the memory bandwidth compared to the Threadripper 3990X is pretty wild. The M1 Max also has twice the amount of system-level cache as the M1 Pro; on the M1 family of chips, the system-level cache is loosely analogous to L3 cache on other processors, but serves the entire system instead of just the CPU cores.

Production-grade CPU ray tracing is a process that depends heavily on being able to pin fast CPU cores at close to 100% utilization for long periods of time, while accessing extremely large datasets from system memory. In an ideal world, intensive computational tasks should be structured in such a way that data can be pulled from memory in a relatively coherent, predictable manner, allowing the CPU cores to rely on data in cache over fetching from main memory as much as possible. Unfortunately, making ray tracing coherent enough to utilize cache well is an extremely challenging problem. Operations such as BVH traversal, which finds the closest point in a scene that a ray intersects, essentially represent an arbitrarily random walk through potentially vast amounts of geometry stored in memory, and any kind of incoherent walk through memory makes overall CPU performance dependent on memory performance. As a result, operations like BVH traversal tend to be heavily bottlenecked by memory latency and memory bandwidth. I expect that the M1 Max’s strong memory bandwidth numbers should provide a some performance boost for rendering compared to the M1 Pro. A complicating factor, however, is how the additional memory bandwidth on the M1 Max is utilized; not all of it is available to just the CPU, since the M1 Max’s unified memory needs to also serve the system’s GPU, neural processing systems, and other custom onboard logic blocks. The actual real-world impact should be easily testable by rendering the same scene on a M1 Pro and a M1 Max chip both with 32 GB of RAM, but in the week that I’ve had to test the M1 Max so far, I haven’t had the time or ability to be able to carry out this test on my own. Stay tuned; I’ll update this post if I am able to try this test soon!

I’m very curious to see if the increased memory bandwidth on the M1 Max will make a difference over the M1 Pro on this forest scene in particular, due to how dense some of the geometry is and therefore how deep some of the BVHs have to go. For example, every single pine needle in this next image is individually modeled geometry, and every tree trunk has sub-pixel-level tessellation and displacement; being able to render this image on a MacBook Pro instead of a giant workstation is incredible:

In the previous posts about running Takua Renderer on arm64 processors, I included performance testing results across a wide variety of machines ranging from the Raspberry Pi 4B to the M1 Mac Mini all the way up to my dual Intel Xeon E5-2680 workstation. However, all of those tests weren’t necessarily indicative of what real world rendering performance on huge scenes would be like, since all of those tests had to use scenes that were small enough to fit in to a M1 Mac Mini’s 16 GB memory footprint. Now that I have access to a M1 Max MacBook Pro with 64 GB of memory, I can present some initial performance comparisons with larger machines rendering my forest scene. I think these results are likely more indicative of what real-world production rendering performance looks like, since the forest scene is the closest thing I have to true production complexity (I haven’t ported the Disney’s Moana Island data set to work in my renderer yet).

The machines I tested this time are a 2021 14-inch MacBook Pro with an Apple M1 Max chip with 10 cores (8 performance, 2 efficiency) and 10 threads, a 2019 16-inch MacBook Pro with an Intel Core i7-9750H CPU with 6 cores and 12 threads, a 2019 Mac Pro with an Intel Xeon W-3245 CPU with 16 cores and 32 threads, and a Linux workstation with dual Intel Xeon E5-2680 CPUs with 8 cores and 16 threads per CPU for a total of 16 cores and 32 threads. The Xeon E5-2680 workstation is, quite franky, ancient, and makes for something of a strange comparison point, but it’s the main workstation that I use for personal rendering projects at the moment, so I included it. I don’t exactly have piles of the latest server and workstation chips just laying around my house, so I had to work with what I got! However, I was also able to borrow access to a Windows workstation with an AMD Threadripper 3990X CPU, which weighs in with 64 cores and 128 threads. I figured that the Threadripper 3990X system is not at all a fair comparison point for the exact opposite reason why the Xeon E5-2680 is not a fair comparison point, but I thought I’d throw it in anyway out of sheer curiosity. Notably, the regular Apple M1 chip does not make an appearance in these tests, since the forest scene doesn’t fit in memory on the M1. I also borrowed a friend’s Razer Blade 15 to test, but wound up not using it since I discovered that it has the same Intel Core i7-9750H CPU as the 2019 16-inch MacBook Pro, but only has half the memory and therefore can’t fit the scene.

In the case of the two MacBook Pros, I did all tests twice: once with the laptops plugged in, and once with the laptops running entirely on battery power. I wanted to compare plugged-in versus battery performance because of Apple’s claim that the new M1 Pro/Max based MacBook Pros perform the same whether plugged-in or on battery. This claim is actually a huge deal; laptops traditionally have had to throttle down CPU performance when unplugged to conserve battery life, but the energy efficiency of Apple Silicon allows Apple to no longer have to do this on M1-family laptops. I wanted to verify this claim for myself!

Results

In the results below, I present three tests using the forest scene. The first test measures how long Takua Renderer takes to run subdivision, tessellation, and displacement, which has to happen before any pixels can actually be rendered. The subdivision/tessellation/displacement process has an interesting performance profile that looks very different from the performance profile of the main path tracing process. Subdivision within a single mesh is not easily parallelizable, and even with a parallel implementation, scales very poorly beyond just a few threads. Takua Renderer attempts to scale subdivision widely by running subdivision on multiple meshes in parallel, with each mesh’s subdivision task only receiving an allocation of at most four threads. As a result, the subdivision step actually benefits slightly more from single-threaded performance over a larger number of cores and greater multi-threaded performance. The second test is rendering the main view of the forest scene from my mipmapping blog post, at 1920x1080 resolution. I chose to use 1920x1080 resolution since most of the time this is a more common maximum resolution to be using while working on artistic iteration. The third test is rendering the fern view of the forest scene from Figure 2 of this post, at final 4K 3840x2160 resolution. For both of the main rendering tests, I only ran the renderer for 8 samples per pixel, since I didn’t want to sit around for days to collect all of the data. For each test, I did five runs, discarded the highest and lowest results, and averaged the remaining three results to get the numbers below. Wall time (as in a clock on a wall) measures the actual amount of real-world time that each test took, while core-seconds is an approximation of how long each test would have taken running on a single core. So, wall time can be thought of as a measure of total computation power, whereas core-seconds is more a measure of computational efficiency; in both cases, lower numbers are better:

	Forest Subdivision/Displacement
Processor:	Wall Time:	Core-Seconds:
Apple M1 Max (Plugged in):	128 s	approx 1280 s
Apple M1 Max (Battery):	128 s	approx 1280 s
Intel Core i7-9750H (Plugged in):	289 s	approx 3468 s
Intel Core i7-9750H (Battery):	307 s	approx 3684 s
Intel Xeon W-3245:	179 s	approx 5728 s
Intel Xeon E5-2680 x2:	222 s	approx 7104 s
AMD Threadripper 3990X:	146 s	approx 18688 s

	Forest Rendering (Main Camera)
	1920x1080, 8 spp, PT
Processor:	Wall Time:	Core-Seconds:
Apple M1 Max (Plugged in):	127.143 s	approx 1271.4 s
Apple M1 Max (Battery):	126.421 s	approx 1264.2 s
Intel Core i7-9750H (Plugged in):	288.089 s	approx 3457.1 s
Intel Core i7-9750H (Battery):	347.898 s	approx 4174.8 s
Intel Xeon W-3245:	106.332 s	approx 3402.6 s
Intel Xeon E5-2680 x2:	158.255 s	approx 5064.2 s
AMD Threadripper 3990X:	38.887 s	approx 4977.5 s

	Forest Rendering (Fern Camera)
	3840x2160, 8 spp, PT
Processor:	Wall Time:	Core-Seconds:
Apple M1 Max (Plugged in):	478.247 s	approx 4782.5 s
Apple M1 Max (Battery):	496.384 s	approx 4963.8 s
Intel Core i7-9750H (Plugged in):	1084.504 s	approx 13014.0 s
Intel Core i7-9750H (Battery):	1219.59 s	approx 14635.1 s
Intel Xeon W-3245:	345.292 s	approx 11049.3 s
Intel Xeon E5-2680 x2:	576.279 s	approx 18440.9 s
AMD Threadripper 3990X:	108.2596 s	approx 13857.2 s

When rendering the main camera view, the 2021 14-inch MacBook Pro used on average about 7% of its battery charge, while the 2019 16-inch MacBook Pro used on average about 39% of its battery charge. When rendering the fern view, the 2021 14-inch MacBook Pro used on average about 19% of its battery charge, while the 2019 16-inch MacBook Pro used on average about 48% of its battery charge. Overall by every metric, the 2021 14-inch MacBook Pro achieves an astounding victory over the 2019 16-inch MacBook Pro: a little over twice the performance for a fraction of the total power consumption. The 2021 14-inch MacBook Pro also lives up to Apple’s claim of identical performance plugged in and on battery power, whereas in the results above, the 2019 16-inch MacBook Pro suffers anywhere between a 25% to 50% performance hit just from switching to battery power. The 2021 14-inch MacBook Pro’s performance win is even more astonishing when considering that the 2019 16-inch MacBook Pro is the previous flagship that the new M1 Pro/Max MacBook Pros are the direct successors to. Seeing this kind of jump in a single hardware generation is unheard of in modern tech and represents a massive win for both Apple and for the arm64 ISA. The M1 Max also handily beats the old dual Intel Xeon E5-2680 that I am currently using by a comfortable margin; for my personal workflow, this means that I can now do everything that I previously needed a large loud power-hungry workstation for on the 2021 14-inch MacBook Pro, and I can do everything faster on the 2021 14-inch MacBook Pro too.

The real surprises to me came with the 2019 Mac Pro and the Threadripper 3990X workstation. In both of those cases, I expected the M1 Max to lose, but the 2021 14-inch MacBook Pro came surprisingly close to the 2019 Mac Pro’s performance in terms of wall time. Even more importantly as a predictor of future scalability, the M1 Max’s efficiency as measured by core-seconds comes in at far far superior to both the Intel Xeon W-3245 and the AMD Threadripper 3900X. Imagining what a hypothetical future Apple Silicon iMac or Mac Pro with an even more scaled up M1 variant, or perhaps some kind of multi-M1 Max chiplet or multisocket solution, is extremely exciting! I think that with the upcoming Apple Silicon based large iMac and Mac Pro, Apple has a real shot at beating both Intel and AMD’s highest end CPUs to win the absolute workstation performance crown.

Of course, what makes the M1 Max’s performance numbers possible is the M1 Max’s energy efficiency; this kind of performance-per-watt is simply unparalleled in the desktop (meaning non-mobile, not desktop form factor) processor world. The M1 architecture’s energy efficiency is what allows Apple to scale the design out into the M1 Pro and M1 Max and hopefully beyond. Below is a breakdown of energy utilization for each of the rendering tests above; the total energy used for each render is the wall clock render time multiplied by the maximum TDP of each processor to get watt-seconds, which is then translated to watt-hours. I assume maximum TDP for each processor since I ran Takua Renderer with processor utilization set to 100%. For the two MacBook Pros, I’m just reporting the plugged-in results.

	Forest Rendering (Main Camera)
	1920x1080, 8 spp, PT
Processor:	Max TDP:	Total Energy Used:
Apple M1 Max:	60 W	2.1191 Wh
Intel Core i7-9750H:	45 W	3.6011 Wh
Intel Xeon W-3245:	205 W	6.0550 Wh
Intel Xeon E5-2680 x2:	260 W	11.4295 Wh
AMD Threadripper 3990X:	280 W	3.0246 Wh

	Forest Rendering (Fern Camera)
	3840x2160, 8 spp, PT
Processor:	Max TDP:	Total Energy Used:
Apple M1 Max:	60 W	7.9708 Wh
Intel Core i7-9750H:	45 W	13.5563 Wh
Intel Xeon W-3245:	205 W	19.6625 Wh
Intel Xeon E5-2680 x2:	260 W	41.6202 Wh
AMD Threadripper 3990X:	280 W	8.4202 Wh

At least for my rendering use case, the Apple M1 Max is easily the most energy efficient processor, even without taking into account that the 60 W TDP of the M1 Max is for the entire system-on-a-chip including CPU, GPU, and more, while the TDPs for all of the other processors are just for a CPU and don’t take into account the rest of the system. The M1 Max manages to beat the 2019 16-inch MacBook Pro’s Intel Core i7-9750H in absolute performance by a factor of two whilst using anywhere between a two-thirds to half of the energy, and the M1 Max comes close to matching the 2019 Mac Pro’s absolute performance while using about a third of the energy. Of course the comparison with the Intel Xeon E5-2680 workstation isn’t exactly fair since the M1 Max is manufactured using a 5 nm process while the ancient Intel Xeon E5-2580s were manufactured on a 35 nm process a decade ago, but I think the comparison still underscores just how far processors have advanced over the past decade leading up to the M1 Max. The only processor that really comes near the M1 Max in terms of energy efficiency is the AMD Threadripper 3990X, which makes sense since the AMD Threadripper 3990X and the M1 Max are the closest cousins in this list in terms of manufacturing process; both are using leading-edge TSMC photolithography. However, on a whole, the M1 Max is still more efficient than the AMD Threadripper 3990X, and again, the AMD Threadripper 3990X TDP is for just a CPU, not an entire SoC! Assuming near-linear scaling, a hypothetical M1-derived variant that is scaled up 4.5 times to a 270 W TDP should be able to handily defeat the AMD Threadripper 3990X in absolute performance.

The wider takeaway here though is that in order to give the M1 Max some real competition, one has to skip laptop chips entirely and reach for not just high end desktop chips, but for server-class workstation hardware to really beat the M1 Max. For workloads that push the CPU to maximum utilization for sustained periods of time, such as production-quality path traced rendering, the M1 Max represents a fundamental shift in what is possible in a laptop form factor. Something even more exciting to think about is how the M1 Max really is the middle tier Apple Silicon solution; presumably the large iMac and Mac Pro will push things into even more absurd territory.

Conclusion

So those are my initial thoughts on the Apple M1 Max chip and my initial experiences with getting my hobby renderer up and running on the 2021 14-inch MacBook Pro. I’m extremely impressed, and not just with the chip! This post mostly focused on the chip itself, but the rest of the 2021 MacBook Pro lineup is just as impressive. For rendering professionals and enthusiasts alike, one aspect of the 2021 MacBook Pros that will likely be just as important as the processor is the incredible screen. The 2021 MacBook Pros ship with what I believe is an industry first: a micro-LED backlit 120 Hz display with an extended dynamic range that can go up to 1600 nits peak brightness. The screen is absolutely gorgeous, which is a must for anyone who spends their time generating pixels with a 3D renderer! One thing on my to-do list was to add extended dynamic range support to Thomas Müller’s excellent tev image viewer, which is a popular tool in the rendering research community. However, it turns out that Thomas already added extended dynamic range support, and it looks amazing on the 2021 MacBook Pro’s XDR display.

In this post I didn’t go into the M1 Max’s GPU at all, even though the GPU in many ways might actually be even more interesting than the CPU (which is saying a lot considering how interesting the CPU is). On paper at least, the M1 Max’s GPU aims for roughly mobile NVIDIA GeForce RTX 3070 performance, but how the M1 Max and a mobile NVIDIA GeForce RTX 3070 actually will compare for ray traced rendering is difficult to say without actually conducting some tests. On one hand, the M1 Max’s unified memory architecture grants its GPU far more memory than any NVIDIA mobile GPU by a huge margin, and the M1 Max’s unified memory architecture opens up a wide variety of interesting optimizations that are otherwise difficult to do when managing separate pools of CPU and GPU memory. On the other hand though, the M1 Max’s GPU lacks the dedicated hardware ray tracing acceleration that modern NVIDIA and AMD GPUs and the upcoming Intel discrete GPUs all have, and in my experience so far, dedicated hardware ray tracing acceleration makes a huge difference in GPU ray tracing performance. Maybe Apple will add hardware ray tracing acceleration in the future; Metal already has software ray tracing APIs, and there already is a precedent for Apple Silicon including dedicated hardware for accelerating relatively niche, specific professional workflows. As an example, the M1 Pro and M1 Max include hardware ProRes acceleration for high-end video editing. Over the next year, I am undertaking a large-scale effort to port the entirety of Takua Renderer to work on GPUs through CUDA on NVIDIA GPUs, and through Metal on Apple Silicon devices. Even though I’ve just gotten started on this project, I’ve already learned a lot of interesting things comparing CUDA and Metal compute; I’ll have much more to say on the topic hopefully soon!

Beyond the CPU and GPU and screen, there are still even more other nice features that the new MacBook Pros have for professional workflows like high-end rendering, but I’ll skip going through them in this post since I’m sure they’ll be thoroughly covered by all of the various actual tech reviewers out on the internet.

Bonus Images and Acknowledgements

To conclude for now, here are two more bonus images that I rendered on the M1 Max 14-inch MacBook Pro. I originally planned on just rendering the earlier three images in this post, but to my surprise, I found that I had enough time to do a few more! I think that kind of encapsulates the M1 Pro and M1 Max MacBook Pros in a nutshell: I expected incredible performance, but was surprised to find even my high expectations met and surpassed.

A huge thanks to everyone at Apple that made this post possible! Also a big thanks to Rajesh Sharma and Mark Lee for catching typos and making some good suggestions.

Comparing SIMD on x86-64 and arm64

2021-09-07T00:00:00+00:00

1. Introduction
2. 4-Wide Ray Bounding Box Intersection
3. Test Program Setup
4. Shared SSE/Neon Structs

5. Scalar Implementations
6. SSE Implementation
7. Neon Implementation
8. Auto-Vectorized Implementation

9. ISPC Implementation
10. Final Results and Conclusions
11. Addenda
12. References

Introduction

I recently wrote a big two-part series about a ton of things that I learned throughout the process of porting my hobby renderer, Takua Renderer, to 64-bit ARM. In the second part, one of the topics I covered was how the Embree ray tracing kernels library gained arm64 support by utilizing the sse2neon project to emulate x86-64 SSE2 SIMD instructions using arm64’s Neon SIMD instructions. In the second part of the series, I had originally planned on diving much deeper into comparing writing vectorized code using SSE intrinsics versus using Neon intrinsics versus other approaches, but the comparison write-up became so large that I wound up leaving it out of the original post with the intention of making the comparison into its own standalone post. This post is that standalone comparison!

As discussed in my porting to arm64 series, a huge proportion of the raw compute power in modern CPUs is located in vector SIMD instruction set extensions, and lots of things in computer graphics happen to be be workload types that fit vectorization very well. Long-time readers of this blog probably already know what SIMD instructions do, but for the unfamiliar, here’s a very brief summary. SIMD stands for single instruction, multiple data, and is a type of parallel programming that exploits data level parallelism instead of concurrency. What the above means is that, unlike multithreading, in which multiple different streams of instructions simultaneously execute on different cores over different pieces of data, in a SIMD program, a single instruction stream executes simultaneously over different pieces of data. For example, a 4-wide SIMD multiplication instruction would simultaneously execute a single multiply instruction over four pairs of numbers; each pair is multiplied together at the same time as the other pairs. SIMD processing makes processors more powerful by allowing the processor to process more data within the same clock cycle; many modern CPUs implement SIMD extensions to their base scalar instruction sets, and modern GPUs are at a very high level broadly similar to huge ultra-wide SIMD processors.

Multiple approaches exist today for writing vectorized code. The four main ways available today are: directly write code using SIMD assembly instructions, write code using compiler-provided vector intrinsics, write normal scalar code and rely on compiler auto-vectorization to emit vectorized assembly, or write code using ISPC: the Intel SPMD Program Compiler. Choosing which approach to use for a given project requires considering many different tradeoffs and factors, such as ease of programming, performance, and portability. Since this post is looking at comparing SSE2 and Neon, portability is especially interesting here. Auto-vectorization and ISPC are the most easily portable approaches, while vector intrinsics can be made portable using sse2neon, but each of these approaches requires different trade-offs in other areas.

In this post, I’ll compare vectorizing the same snippet of code using several different approaches. On x86-64, I’ll compare implementations using SSE intrinsics, using auto-vectorization, and using ISPC emitting SSE assembly. On arm64, I’ll compare implementations using Neon intrinsics, using SSE intrinsics emulated on arm64 using sse2neon, using auto-vectorization, and using ISPC emitting Neon assembly. I’ll also evaluate how each approach does in balancing portability, ease-of-use, and performance.

4-wide Ray Bounding Box Intersection

For my comparisons, I wanted to use a small but practical real-world example. I wanted something small since I wanted to be able to look at the assembly output directly, and keeping things smaller makes the assembly output easier to read all at once. However, I also wanted something real-world to make sure that whatever I learned wasn’t just the result of a contrived artificial example. The comparison that I picked is a common operation inside of ray tracing: 4-wide ray bounding box intersection. By 4-wide, I mean intersecting the same ray against four bounding boxes at the same time. Ray bounding box intersection tests are a fundamental operation in BVH traversal, and typically account for a large proportion (often a majority) of the computational cost in ray intersection against the scene. Before we dive into code, here’s some background on BVH traversal and the role that 4-wide ray bounding box intersection plays in modern ray tracing implementations.

Acceleration structures are a critical component of ray tracing; tree-based acceleration structures convert tracing a ray against a scene from being a O(N) problem into a O(log(N)) problem, where N is the number of objects that are in the scene. For scenes with lots of objects and for objects made up of lots of primitives, lowering the worst-case complexity of ray intersection from linear to logarithmic is what makes the difference between ray tracing being impractical and practical. From roughly the late 90s through to the early 2010s, a number of different groups across the graphics field put an enormous amount of research and effort into establishing the best possible acceleration structures. Early on, the broad general consensus was that KD-trees were the most efficient acceleration structure for ray intersection performance, while BVHs were known to be faster to build than KD-trees but less performant at actual ray intersection. However, advancements over time improved BVH ray intersection performance [Stich et al. 2009] to the point where today, BVHs are now the dominant acceleration structure used in pretty much every production ray tracing solution. For a history and detailed survey of BVH research over the past twenty-odd years, please refer to Meister et al. [2021]. One interesting thing to note when looking through the past twenty years of ray tracing acceleration research are the author names; many of these authors are the same people that went on to create the modern underpinnings of Embree, Optix, and the ray acceleration hardware found in NVIDIA’s RTX GPUs.

A BVH is a tree structure where bounding boxes are placed over all of the objects that need to be intersected, and then these bounding boxes are grouped into (hopefully) spatially local groups. Each group is then enclosed in another bounding box, and these boxes are grouped again, and so on and so forth until a top-level bounding box is reached that contains everything below. In university courses, BVHs are traditionally taught as being binary trees, meaning that each node within the tree structure bounds two children nodes. Binary BVHs are the simplest possible BVH to build and implement, hence why they’re usually the standard version taught in schools. However, the actual branching factor at each BVH node doesn’t have to be binary; the branching factor can be any integer number greater than 2. BVHs with 4 and even 8 wide branching factors have largely come to dominate production usage today.

The reason production BVHs today tend to have wide branching factors originates in the need to vectorize BVH traversal in order to utilize the maximum possible performance of SIMD-enabled CPUs. Early attempts at vectorizing BVH traversal centered around tracing groups, or packets, of multiple rays through a BVH together; packet tracing allows for simultaneously intersecting N rays against a single bounding box at each node in the hierarchy [Wald et al. 2001], where N is the vector width. However, packet tracing only really works well for groups of rays that are all going in largely the same direction from largely the same origin; for incoherent rays, divergence in the traversal path each incoherent ray needs to take through the BVH destroys the efficacy of vectorized packet traversal. To solve this problem, several papers concurrently proposed a different solution to vectorizing BVH traversal [Wald et al. 2008, Ernst and Greiner 2008, Dammertz et al. 2008]: instead of simultaneously intersecting N rays against a single bounding box, this new solution simultaneously intersects a single ray against N bounding boxes. Since the most common SIMD implementations are at least 4 lanes wide, BVH implementations that want to take maximum advantage of SIMD hardware also need to be able to present four bounding boxes at a time for vectorized ray intersection, hence the move from a splitting factor of 2 to a splitting factor of 4 or even wider. In addition to being more performant when vectorized, a 4-wide splitting factor also tends to reduce the depth and therefore memory footprint of BVHs, and 4-wide BVHs have also been demonstrated to be able to outperform 2-wide BVHs even without vectorization [Vegdahl 2017]. Vectorized 4-wide BVH traversal can also be combined with the previous packet approach to yield even better performance for coherent rays [Tsakok 2009].

All of the above factors combined are why BVHs with wider branching factors are more popularly used today on the CPU; for example, the widely used Embree library [Wald et al. 2014] offers 4-wide as the minimum supported split factor, and supports even wider split factors when vectorizing using wider AVX instructions. On the GPU, the story is similar, although a little bit more complex since the GPU’s SIMT (as opposed to SIMD) parallelism model changes the relative importance of being able to simultaneously intersect one ray against multiple boxes. GPU ray tracing systems today use a variety of different split factors; AMD’s RDNA2-based GPUs implement hardware acceleration for a 4-wide BVH [AMD 2020]. NVIDIA does not publicly disclose what split factor their RTX GPUs assume in hardware, since their various APIs for accessing the ray tracing hardware are designed to allow for changing out for different, better future techniques under the hood without modification to client applications. However, we can guess that support for multiple different splitting factors seems likely given that Optix 7 uses different splitting factors depending on whether an application wants to prioritize BVH construction speed or BVH traversal speed [NVIDIA 2021]. While not explicitly disclosed, as of writing, we can reasonable guess based off of what Optix 6.x implemented that Optix 7’s fast construction mode implements a TRBVH [Karras and Aila 2013], which is a binary BVH, and that Optix 7’s performance-optimized mode implements a 8-wide BVH with compression [Ylitie et al. 2017].

Since the most common splitting factor in production CPU cases in a 4-wide split, and since SSE and Neon are both 4-wide vector instruction sets, I think the core simultaneous single-ray-4-box intersection test is a perfect example case to look at! To start off, we need an efficient intersection test between a single ray and a single axis-aligned bounding box. I’ll be using the commonly utilized solution by Williams et al. [2005]; improved techniques with better precision [Ize 2013] and more generalized flexibility [Majercik 2018] do exist, but I’ll stick with the original Williams approach in this post to keep things simple.

Test Program Setup

Everything in this post is implemented in a small test program that I have put in an open Github repository, licensed under the Apache-2.0 License. Feel free to clone the repository for yourself to follow along using or to play with! To build and run the test program yourself, you will need a version of CMake that has ISPC support (so, CMake 3.19 or newer), a modern C++ compiler with support for C++17, and a version of ISPC that supports Neon output for arm64 (so, ISPC v1.16.1 or newer); further instructions for building and running the test program is included in the repository’s README.md file. The test program compiles and runs on both x86-64 and arm64; on each processor architecture, the appropriate implementations for each processor architecture are automatically chosen for compilation.

The test program runs each single-ray-4-box intersection test implementation N times, where N is an integer that can be set by the user as the first input argument to the program. By default, and for all results in this post, N is set to 100000 runs. The four bounding boxes that the intersection tests run against are hardcoded into the test program’s main function and are reused for all N runs. Since the bounding boxes are hardcoded, I had to take some care to make sure that the compiler wasn’t going to pull any optimization shenanigans and not actually run all N runs. To make sure of the above, the test program is compiled in two separate pieces: all of the actual ray-bounding-box intersection functions are compiled into a static library using -O3 optimization, and then the test program’s main function is compiled separately with all optimizations disabled, and then the intersection functions static library is linked in.

Ideally I would have liked to set up the project to compile directly to a Universal Binary on macOS, but unfortunately CMake’s built-in infrastructure for compiling multi-architecture binaries doesn’t really work with ISPC at the moment, and I was too lazy to manually set up custom CMake scripts to invoke ISPC multiple times (once for each target architecture) and call the macOS lipo tool; I just compiled and ran the test program separately on an x86-64 Mac and on an arm64 Mac. However, on both the x86-64 and arm64 systems, I used the same operating system and compilers. For all of the results in this post, I’m running on macOS 11.5.2 and I’m compiling using Apple Clang v12.0.5 (which comes with Xcode 12.5.1) for C++ code and ISPC v1.16.1 for ISPC code.

For the rest of the post, I’ll include results for each implementation in the section discussing that implementation, and then include all results together in a results section at the end. All results were generated by running on a 2019 16 inch MacBook Pro with a Intel Core i7-9750H CPU for x86-64, and on a 2020 M1 Mac Mini for arm64 and Rosetta 2. All results were generated by running the test program with 100000 runs per implementation, and I averaged results across 5 runs of the test program after throwing out the highest and lowest result for each implementation to discard outliers. The timings reported for each implementation are the average across 100000 runs.

Defining Structs Usable with Both SSE and Neon

Before we dive into the ray-box intersection implementations, I need to introduce and describe the handful of simple structs that the test program uses. The most widely used struct in the test program is FVec4, which defines a 4-dimensional float vector by simply wrapping around four floats. FVec4 has one important trick: FVec4 uses a union to accomplish type punning, which allows us to access the four floats in FVec4 either as separate individual floats, or as a single __m128 when using SSE or a single float32x4_t when using Neon. __m128 on SSE and float32x4_t on Neon serve the same purpose; since SSE and Neon use 128-bit wide registers with four 32-bit “lanes” per register, intrinsics implementations for SSE and Neon need a 128-bit data type that maps directly to the vector register when compiled. The SSE intrinsics implementation defined in <xmmintrin.h> uses __m128 as its single generic 128-bit data type, whereas the Neon intrinsics implementation defined in <arm_neon.h> defines separate 128-bit types depending on what is being stored. For example, Neon intrinsics use float32x4 as its 128-bit data type for four 32-bit floats, and uses uint32x4 as its 128-bit data type for four 32-bit unsigned integers, and so on. Each 32-bit component in a 128-bit vector register is often known as a lane. The process of populating each of the lanes in a 128-bit vector type is sometimes referred to as a gather operation, and the process of pulling 32-bit values out of the 128-bit vector type is sometimes referred to as a scatter operation; the FVec4 struct’s type punning makes gather and scatter operations nice and easy to do.

One of the comparisons that the test program does on arm64 machines is between an implementation using native Neon intrinsics, and an implementation written using SSE intrinsics that are emulated with Neon intrinsics under the hood on arm64 via the sse2neon project. Since for this test program, SSE intrinsics are available on both x86-64 (natively) and on arm64 (through sse2neon), we don’t need to wrap the __m128 member of the union in any #ifdefs. We do need to #ifdef out the Neon implementation on x86-64 though, hence the check for #if defined(__aarch64__). Putting everything above all together, we can get a nice, convenient 4-dimensional float vector in which we can access each component individually and access the entire contents of the vector as a single intrinsics-friendly 128-bit data type on both SSE and Neon:

struct FVec4 {
    union {  // Use union for type punning __m128 and float32x4_t
        __m128 m128;
#if defined(__aarch64__)
        float32x4_t f32x4;
#endif
        struct {
            float x;
            float y;
            float z;
            float w;
        };
        float data[4];
    };

    FVec4() : x(0.0f), y(0.0f), z(0.0f), w(0.0f) {}
#if defined(__x86_64__)
    FVec4(__m128 f4) : m128(f4) {}
#elif defined(__aarch64__)
    FVec4(float32x4_t f4) : f32x4(f4) {}
#endif

    FVec4(float x_, float y_, float z_, float w_) : x(x_), y(y_), z(z_), w(w_) {}
    FVec4(float x_, float y_, float z_) : x(x_), y(y_), z(z_), w(0.0f) {}

    float operator[](int i) const { return data[i]; }
    float& operator[](int i) { return data[i]; }
};

Listing 1: FVec4 definition, which defines a 4-dimensional float vector that can be accessed as either a single 128-bit vector value or as individual 32-bit floats.

The actual implementation in the test project has a few more functions defined as part of FVec4 to provide basic arithmetic operators. In the test project, I also define IVec4, which is a simple 4-dimensional integer vector type that is useful for storing multiple indices together. Rays are represented as a simple struct containing just two FVec4s and two floats; the two FVec4s store the ray’s direction and origin, and the two floats store the ray’s tMin and tMax values.

For representing bounding boxes, the test project has two different structs. The first is BBox, which defines a single axis-aligned bounding box for purely scalar use. Since BBox is only used for scalar code, it just contains normal floats and doesn’t have any vector data types at all inside:

struct BBox {
    union {
        float corners[6];        // indexed as [minX minY minZ maxX maxY maxZ]
        float cornersAlt[2][3];  // indexed as corner[minOrMax][XYZ]
    };

    BBox(const FVec4& minCorner, const FVec4& maxCorner) {
        cornersAlt[0][0] = fmin(minCorner.x, maxCorner.x);
        cornersAlt[0][1] = fmin(minCorner.y, maxCorner.y);
        cornersAlt[0][2] = fmin(minCorner.z, maxCorner.z);
        cornersAlt[1][0] = fmax(minCorner.x, maxCorner.x);
        cornersAlt[1][1] = fmax(minCorner.y, maxCorner.y);
        cornersAlt[1][2] = fmax(minCorner.x, maxCorner.x);
    }

    FVec4 minCorner() const { return FVec4(corners[0], corners[1], corners[2]); }

    FVec4 maxCorner() const { return FVec4(corners[3], corners[4], corners[5]); }
};

Listing 2: Struct holding a single bounding-box.

The second bounding box struct is BBox4, which stores four axis-aligned bounding boxes together. BBox4 internally uses FVec4s in a union with two different arrays of regular floats to allow for vectorized operation and individual access to each component of each corner of each box. The internal layout of BBox4 is not as simple as just storing four BBox structs; I’ll discuss how the internal layout of BBox4 works a little bit later in this post.

Williams et al. 2005 Ray-Box Intersection Test: Scalar Implementations

Now that we have all of the data structures that we’ll need, we can dive into the actual implementations. The first implementation is the reference scalar version of ray-box intersection. The implementation below is pretty close to being copy-pasted straight out of the Williams et al. 2005 paper, albeit with some minor changes to use our previously defined data structures:

bool rayBBoxIntersectScalar(const Ray& ray, const BBox& bbox, float& tMin, float& tMax) {
    FVec4 rdir = 1.0f / ray.direction;
    int sign[3];
    sign[0] = (rdir.x < 0);
    sign[1] = (rdir.y < 0);
    sign[2] = (rdir.z < 0);

    float tyMin, tyMax, tzMin, tzMax;
    tMin = (bbox.cornersAlt[sign[0]][0] - ray.origin.x) * rdir.x;
    tMax = (bbox.cornersAlt[1 - sign[0]][0] - ray.origin.x) * rdir.x;
    tyMin = (bbox.cornersAlt[sign[1]][1] - ray.origin.y) * rdir.y;
    tyMax = (bbox.cornersAlt[1 - sign[1]][1] - ray.origin.y) * rdir.y;
    if ((tMin > tyMax) || (tyMin > tMax)) {
        return false;
    }
    if (tyMin > tMin) {
        tMin = tyMin;
    }
    if (tyMax < tMax) {
        tMax = tyMax;
    }
    tzMin = (bbox.cornersAlt[sign[2]][2] - ray.origin.z) * rdir.z;
    tzMax = (bbox.cornersAlt[1 - sign[2]][2] - ray.origin.z) * rdir.z;
    if ((tMin > tzMax) || (tzMin > tMax)) {
        return false;
    }
    if (tzMin > tMin) {
        tMin = tzMin;
    }
    if (tzMax < tMax) {
        tMax = tzMax;
    }
    return ((tMin < ray.tMax) && (tMax > ray.tMin));
}

Listing 3: A direct implementation of "An Efficient and Robust Ray-Box Intersection Algorithm" by Amy Williams et al. 2005.

For our test, we want to intersect a ray against four boxes, so we just write a wrapper function that calls rayBBoxIntersectScalar() four times in sequence. In the wrapper function, hits is a reference to a IVec4 where each component of the IVec4 is used to store either 0 to indicate no intersection, or 1 to indicate an intersection:

void rayBBoxIntersect4Scalar(const Ray& ray,
                            const BBox& bbox0,
                            const BBox& bbox1,
                            const BBox& bbox2,
                            const BBox& bbox3,
                            IVec4& hits,
                            FVec4& tMins,
                            FVec4& tMaxs) {
    hits[0] = (int)rayBBoxIntersectScalar(ray, bbox0, tMins[0], tMaxs[0]);
    hits[1] = (int)rayBBoxIntersectScalar(ray, bbox1, tMins[1], tMaxs[1]);
    hits[2] = (int)rayBBoxIntersectScalar(ray, bbox2, tMins[2], tMaxs[2]);
    hits[3] = (int)rayBBoxIntersectScalar(ray, bbox3, tMins[3], tMaxs[3]);
}

Listing 4: Wrap and call rayBBoxIntersectScalar() four times sequentially to implement scalar 4-way ray-box intersection.

The implementation provided in the original paper is easy to understand, but unfortunately is not in a form that we can easily vectorize. Note the six branching if statements; branching statements do not bode well for good vectorized code. The reason branching doesn’t go well with SIMD code is because with SIMD code, the same instruction has to be executed in lockstep across all four SIMD lanes; the only way for different lanes to execute different branches is to run all branches across all lanes sequentially, and for each branch mask out the lanes that the branch shouldn’t apply to. Contrast with normal scalar sequential execution where we process one ray-box intersection at a time; each ray-box test can independently choose what codepath to execute at each branch and completely bypass executing branches that never get taken. Scalar code can also do fancy things like advanced branch prediction to further speed things up.

In order to get to a point where we can more easily write vectorized SSE and Neon implementations of the ray-box test, we first need to refactor the original implementation into an intermediate scalar form that is more amenable to vectorization. In other words, we need to rewrite the code in Listing 3 to be as branchless as possible. We can see that each of the if statements in Listing 3 is comparing two values and, depending on which value is greater, assigning one value to be the same as the other value. Fortunately, this type of compare-and-assign with floats can easily be replicated in a branchless fashion by just using a min or max operation. For example, the branching statement if (tyMin > tMin) { tMin = tyMin; } can be easily replaced with the branchless statement tMin = fmax(tMin, tyMin);. I chose to use fmax() and fmin() instead of std::max() and std::min() because I found fmax() and fmin() to be slightly faster in this example. The good thing about replacing our branches with min/max operations is that SSE and Neon both have intrinsics to do vectorized min and max operations in the form of _mm_min_ps and _mm_max_ps for SSE and vminq_f32 and vmaxq_f32 for Neon.

Also note how in Listing 3, the index of each corner is calculated while looking up the corner; for example: bbox.cornersAlt[1 - sign[0]]. To make the code easier to vectorize, we don’t want to be computing indices in the lookup; instead, we want to precompute all of the indices that we will want to look up. In Listing 5, the IVec4 values named near and far are used to store precomputed lookup indices. Finally, one more shortcut we can make with an eye towards easier vectorization is that we don’t actually care what the values of tMin and tMax are in the event that the ray misses the box; if the values that come out of a missed hit in our vectorized implementation don’t exactly match the values that come out of a missed hit in the scalar implementation, that’s okay! We just need to check for the missed hit case and instead return whether or not a hit has occurred as a bool.

Putting all of the above together, we can rewrite Listing 3 into the following much more compact, more more SIMD friendly scalar implementation:

bool rayBBoxIntersectScalarCompact(const Ray& ray, const BBox& bbox, float& tMin, float& tMax) {
    FVec4 rdir = 1.0f / ray.direction;
    IVec4 near(int(rdir.x >= 0.0f ? 0 : 3), int(rdir.y >= 0.0f ? 1 : 4),
            int(rdir.z >= 0.0f ? 2 : 5));
    IVec4 far(int(rdir.x >= 0.0f ? 3 : 0), int(rdir.y >= 0.0f ? 4 : 1),
            int(rdir.z >= 0.0f ? 5 : 2));

    tMin = fmax(fmax(ray.tMin, (bbox.corners[near.x] - ray.origin.x) * rdir.x),
                fmax((bbox.corners[near.y] - ray.origin.y) * rdir.y,
                    (bbox.corners[near.z] - ray.origin.z) * rdir.z));
    tMax = fmin(fmin(ray.tMax, (bbox.corners[far.x] - ray.origin.x) * rdir.x),
                fmin((bbox.corners[far.y] - ray.origin.y) * rdir.y,
                    (bbox.corners[far.z] - ray.origin.z) * rdir.z));
                    
    return tMin <= tMax;
}

Listing 5: A much more compact implementation of Williams et al. 2005; this implementation does not calculate a negative tMin if the ray origin is inside of the box.

The wrapper around rayBBoxIntersectScalarCompact() to make a function that intersects one ray against four boxes is exactly the same as in Listing 4, just with a call to the new function, so I won’t bother going into it.

Here is how the scalar compact implementation (Listing 5) compares to the original scalar implementation (Listing 3). The “speedup” columns use the scalar compact implementation as the baseline:

	x86-64:	x86-64 Speedup:	arm64:	arm64 Speedup:	Rosetta2:	Rosetta2 Speedup:
Scalar Compact:	44.5159 ns	1.0x.	41.8187 ns	1.0x.	81.0942 ns	1.0x.
Scalar Original:	44.1004 ns	1.0117x	78.4001 ns	0.5334x	90.7649 ns	0.8935x
Scalar No Early-Out:	55.6770 ns	0.8014x	85.3562 ns	0.4899x	102.763 ns	0.7891x

The original scalar implementation is actually ever-so-slightly faster than our scalar compact implementation on x86-64! This result actually doesn’t surprise me; note that the original scalar implementation has early-outs when checking the values of tyMin and tzMin, whereas the early-outs have to be removed in order to restructure the original scalar implementation into the vectorization-friendly compact scalar implementation. To confirm that the original scalar implementation is faster because of the early-outs, in the test program I also include a version of the original scalar implementation that has the early-outs removed. Instead of returning when the checks on tyMin or tzMin fail, I modified the original scalar implementation to store the result of the checks in a bool that is stored until the end of the function and then checked at the end of the function. In the results, this modified version of the original scalar implementation is labeled as “Scalar No Early-Out”; this modified version is considerably slower than the compact scalar implementation on both x86-64 and arm64.

The more surprising result is that the original scalar implementation is slower than the compact scalar implementation on arm64, and by a considerable amount! Even more interesting is that the original scalar implementation and the modified “no early-out” version perform relatively similarly on arm64; this result strongly hints to me that for whatever reason, the version of Clang I used just wasn’t able to optimize for arm64 as well as it was able to for x86-64. Looking at the compiled x86-64 assembly and the compiled arm64 assembly on Godbolt Compiler Explorer for the original scalar implementation shows that the structure of the output assembly is very similar across both architectures though, so the cause of the slower performance on arm64 is not completely clear to me.

For all of the results in the rest of the post, the compact scalar implementation’s timings are used as the baseline that everything else is compared against, since all of the following implementations are derived from the compact scalar implementation.

SSE Implementation

The first vectorized implementation we’ll look at is using SSE on x86-64 processors. The full SSE through SSE4 instruction set today including contains 281 instructions, introduced over the past two decades-ish in a series of supplementary extensions to the original SSE instruction set. All modern Intel and AMD x86-64 processors from at least the past decade support SSE4, and all x86-64 processors ever made support at least SSE2 since SSE2 is written into the base x86-64 specification. As mentioned earlier, SSE uses 128-bit registers that can be split into two, four, eight, or even sixteen lanes; the most common (and original) use case is four 32-bit floats. AVX and AVX2 later expanded the register width from 128-bit to 256-bit, and the latest AVX-512 extensions introduced 512-bit registers. For this post though, we’ll just stick with 128-bit SSE.

In order to program directly using SSE instructions, we can either write SSE assembly directly, or we can use SSE intrinsics. Writing SSE assembly directly is not particularly ideal for all of the same reasons that writing programs in regular assembly is not particularly ideal for most cases, so we’ll want to use intrinsics instead. Intrinsics are functions whose implementations are specially handled by the compiler; in the case of vector intrinsics, each function maps directly to a known single or small number of vector assembly instructions. Intrinsics kind of bridge between writing directly in assembly and using full-blown standard library functions; intrinsics are higher level than assembly, but lower level than what you typically find in standard library functions. The headers for vector intrinsics are defined by the compiler; almost every C++ compiler that supports SSE and AVX intrinsics follows a convention where SSE/AVX intrinsics headers are named using the pattern *mmintrin.h, where * is a letter of the alphabet corresponding to a specific subset or version of either SSE of AVX (for example, x for SSE, e for SSE2, n for SSE4.2, i for AVX, etc.). For example, xmmintrin.h is where the __m128 type we used earlier in defining all of our structs comes from. Intel’s searchable online Intrinsics Guide is an invaluable resource for looking up what SSE intrinsics there are and what each of them does.

The first thing we need to do for our SSE implementation is to define a new BBox4 struct that holds four bounding boxes together. How we store these four bounding boxes together is extremely important. The easiest way to store four bounding boxes in a single struct is to just have BBox4 store four separate BBox structs internally, but this approach is actually really bad for vectorization. To understand why, consider something like the following, where we perform an min operation between the ray tMin and a distance to a corner of a bounding box:

fmax(ray.tMin, (bbox.corners[near.x] - ray.origin.x) * rdir.x);

Now consider if we want to do this operation for four bounding boxes in serial:

fmax(ray.tMin, (bbox0.corners[near.x] - ray.origin.x) * rdir.x);
fmax(ray.tMin, (bbox1.corners[near.x] - ray.origin.x) * rdir.x);
fmax(ray.tMin, (bbox2.corners[near.x] - ray.origin.x) * rdir.x);
fmax(ray.tMin, (bbox3.corners[near.x] - ray.origin.x) * rdir.x);

The above serial sequence is a perfect example of what we want to fold into a single vectorized line of code. The inputs to a vectorized version of the above should be a 128-bit four-lane value with ray.tMin in all four lanes, another 128-bit four-lane value with ray.origin.x in all four lanes, another 128-bit four-lane value with rdir.x in all four lanes, and finally a 128-bit four-lane value where the first lane is a single index of a single corner from the first bounding box, the second lane is a single index of a single corner from the second bounding box, and so on and so forth. Instead of an array of structs, we need the bounding box values to be provided as a struct of corner value arrays where each 128-bit value stores one 32-bit value from each corner of each of the four boxes. Alternatively, the BBox4 memory layout that we want can be thought of as an array of 24 floats, which is indexed as a 3D array where the first dimension is indexed by min or max corner, the second dimension is indexed by x, y, and z within each corner, and the third dimension is indexed by which bounding box the value belongs to. Putting the above together with some accessors and setter functions yields the following definition for BBox4:

struct BBox4 {
    union {
        FVec4 corners[6];             // order: minX, minY, minZ, maxX, maxY, maxZ
        float cornersFloat[2][3][4];  // indexed as corner[minOrMax][XYZ][bboxNumber]
        float cornersFloatAlt[6][4];
    };

    inline __m128* minCornerSSE() { return &corners[0].m128; }
    inline __m128* maxCornerSSE() { return &corners[3].m128; }

#if defined(__aarch64__)
    inline float32x4_t* minCornerNeon() { return &corners[0].f32x4; }
    inline float32x4_t* maxCornerNeon() { return &corners[3].f32x4; }
#endif

    inline void setBBox(int boxNum, const FVec4& minCorner, const FVec4& maxCorner) {
        cornersFloat[0][0][boxNum] = fmin(minCorner.x, maxCorner.x);
        cornersFloat[0][1][boxNum] = fmin(minCorner.y, maxCorner.y);
        cornersFloat[0][2][boxNum] = fmin(minCorner.z, maxCorner.z);
        cornersFloat[1][0][boxNum] = fmax(minCorner.x, maxCorner.x);
        cornersFloat[1][1][boxNum] = fmax(minCorner.y, maxCorner.y);
        cornersFloat[1][2][boxNum] = fmax(minCorner.x, maxCorner.x);
    }

    BBox4(const BBox& a, const BBox& b, const BBox& c, const BBox& d) {
        setBBox(0, a.minCorner(), a.maxCorner());
        setBBox(1, b.minCorner(), b.maxCorner());
        setBBox(2, c.minCorner(), c.maxCorner());
        setBBox(3, d.minCorner(), d.maxCorner());
    }
};

Listing 6: Struct holding four bounding boxes together with values interleaved for optimal vectorized access.

Note how the setBBox function (which the constructor calls) has a memory access pattern where a single value is written into every 128-bit FVec4. Generally scattered access like this is extremely expensive in vectorized code, and should be avoided as much as possible; setting an entire 128-bit value at once is much faster than setting four separate 32-bit segments across four different values. However, something like the above is often inevitably necessary just to get data loaded into a layout optimal for vectorized code; in the test program, BBox4 structs are initialized and set up once, and then reused across all tests. The time required to set up BBox and BBox4 is not counted as part of any of the test runs; in a full BVH traversal implementation, the BVH’s bounds at each node should be pre-arranged into a vector-friendly layout before any ray traversal takes place. In general, figuring out how to restructure an algorithm to be easily expressed using vector intrinsics is really only half of the challenge in writing good vectorized programs; the other half of the challenge is just getting the input data into a form that is amenable to vectorization. Actually, depending on the problem domain, the data marshaling can account for far more than half of the total effort spent!

Now that we have four bounding boxes structured in a way that is amenable to vectorized usage, we also need to structure our ray inputs for vectorized usage. This step is relatively easy; we just need to expand each component of each element of the ray into a 128-bit value where the same value is replicated across every 32-bit lane. SSE has a specific intrinsic to do exactly this: _mm_set1_ps() takes in a single 32-bit float and replicates it to all four lanes in a 128-bit __m128. SSE also has a bunch of more specialized instructions, which can be used in specific scenarios to do complex operations in a single instruction. Knowing when to use these more specialized instructions can be tricky and requires extensive knowledge of the SSE instruction set; I don’t know these very well yet! One good trick I did figure out was that in the case of taking a FVec4 and creating a new __m128 from each of the FVec4’s components, I could use _mm_shuffle_ps instead of _mm_set1_ps(). The problem with using _mm_set1_ps() in this case is that with a FVec4, which internally uses __m128 on x86-64, taking a element out to store using _mm_set1_ps() compiles down to a MOVSS instruction in addition to a shuffle. _mm_shuffle_ps(), on the other hand, compiles down to a single SHUFPS instruction. _mm_shuffle_ps() takes in two __m128s as input and takes two components from the first __m128 for the first two components of the output, and takes two components from the second __m128 for the second two components of the output. Which components from the inputs are taken is assignable using an input mask, which can conveniently be generated using the _MM_SHUFFLE() macro that comes with the SSE intrinsics headers. Since our ray struct’s origin and direction elements are already backed by __m128 under the hood, we can just use _mm_shuffle_ps() with the same element from the ray as both the first and second inputs to generate a __m128 containing only a single component of each element. For example, to create a __m128 containing only the x component of the ray direction, we can write: _mm_shuffle_ps(rdir.m128, rdir.m128, _MM_SHUFFLE(0, 0, 0, 0)).

Translating the fmin() and fmax() functions is very straightforward with SSE; we can use SSE’s _mm_min_ps() and _mm_max_ps() as direct analogues. Putting all of the above together allows us to write a fully SSE-ized version of the compact scalar ray-box intersection test that intersects a single ray against four boxes simultaneously:

void rayBBoxIntersect4SSE(const Ray& ray,
                        const BBox4& bbox4,
                        IVec4& hits,
                        FVec4& tMins,
                        FVec4& tMaxs) {
    FVec4 rdir(_mm_set1_ps(1.0f) / ray.direction.m128);
    /* use _mm_shuffle_ps, which translates to a single instruction while _mm_set1_ps involves a
    MOVSS + a shuffle */
    FVec4 rdirX(_mm_shuffle_ps(rdir.m128, rdir.m128, _MM_SHUFFLE(0, 0, 0, 0)));
    FVec4 rdirY(_mm_shuffle_ps(rdir.m128, rdir.m128, _MM_SHUFFLE(1, 1, 1, 1)));
    FVec4 rdirZ(_mm_shuffle_ps(rdir.m128, rdir.m128, _MM_SHUFFLE(2, 2, 2, 2)));
    FVec4 originX(_mm_shuffle_ps(ray.origin.m128, ray.origin.m128, _MM_SHUFFLE(0, 0, 0, 0)));
    FVec4 originY(_mm_shuffle_ps(ray.origin.m128, ray.origin.m128, _MM_SHUFFLE(1, 1, 1, 1)));
    FVec4 originZ(_mm_shuffle_ps(ray.origin.m128, ray.origin.m128, _MM_SHUFFLE(2, 2, 2, 2)));

    IVec4 near(int(rdir.x >= 0.0f ? 0 : 3), int(rdir.y >= 0.0f ? 1 : 4),
            int(rdir.z >= 0.0f ? 2 : 5));
    IVec4 far(int(rdir.x >= 0.0f ? 3 : 0), int(rdir.y >= 0.0f ? 4 : 1),
            int(rdir.z >= 0.0f ? 5 : 2));

    tMins = FVec4(_mm_max_ps(
        _mm_max_ps(_mm_set1_ps(ray.tMin), 
                   (bbox4.corners[near.x].m128 - originX.m128) * rdirX.m128),
        _mm_max_ps((bbox4.corners[near.y].m128 - originY.m128) * rdirY.m128,
                   (bbox4.corners[near.z].m128 - originZ.m128) * rdirZ.m128)));
    tMaxs = FVec4(_mm_min_ps(
        _mm_min_ps(_mm_set1_ps(ray.tMax),
                   (bbox4.corners[far.x].m128 - originX.m128) * rdirX.m128),
        _mm_min_ps((bbox4.corners[far.y].m128 - originY.m128) * rdirY.m128,
                   (bbox4.corners[far.z].m128 - originZ.m128) * rdirZ.m128)));

    int hit = ((1 << 4) - 1) & _mm_movemask_ps(_mm_cmple_ps(tMins.m128, tMaxs.m128));
    hits[0] = bool(hit & (1 << (0)));
    hits[1] = bool(hit & (1 << (1)));
    hits[2] = bool(hit & (1 << (2)));
    hits[3] = bool(hit & (1 << (3)));
}

Listing 7: SSE version of the compact Williams et al. 2005 implementation.

The last part of rayBBoxIntersect4SSE() where hits is populated might require a bit of explaining. This last part implements the check for whether or not a ray actually hit the box based on the results stored in tMin and tMax. This implementation takes advantage of the fact that misses in this implementation produce inf or -inf values; to figure out if a hit has occurred, we just have to check that in each lane, the tMin value is less than the tMax value, and inf values play nicely with this check. So, to conduct the check across all lanes at the same time, we use _mm_cmple_ps(), which compares if the 32-bit float in each lane of the first input is less-than-or-equal than the corresponding 32-bit float in each lane of the second input. If the comparison succeeds, _mm_cmple_ps() writes 0xFFF into the corresponding lane in the output __m128, and if the comparison fails, 0 is written instead. The remaining _mm_movemask_ps() instruction and bit shifts are just used to copy the results in each lane out into each component of hits.

I think variants of this 4-wide SSE ray-box intersection function are fairly common in production renderers; I’ve seen something similar developed independently at multiple studios and in multiple renderers, which shouldn’t be surprising since the translation from the original Williams et al. 2005 paper to a SSE-ized version is relatively straightforward. Also, the performance results further hint at why variants of this implementation are popular! Here is how the SSE implementation (Listing 7) performs compared to the scalar compact representation (Listing 5):

	x86-64:	x86-64 Speedup:	Rosetta2:	Rosetta2 Speedup:
Scalar Compact:	44.5159 ns	1.0x.	81.0942 ns	1.0x.
SSE:	10.9660 ns	4.0686x	13.6353 ns	5.9474x

The SSE implementation is almost exactly four times faster than the reference scalar compact implementation, which is exactly what we would expect as a best case for a properly written SSE implementation. Actually, in the results listed above, the SSE implementation is listed as being slightly more than four times faster, but that’s just an artifact of averaging together results from multiple runs; the amount over 4x is basically just an artifact of the statistical margin of error. A 4x speedup is the maximum speedup we can possible expect given that SSE is 4-wide for 32-bit floats. In our SSE implementation, the BBox4 struct is already set up before the function is called, but the function still needs to translate each incoming ray into a form suitable for vector operations, which is additional work that the scalar implementation doesn’t need to do. In order to make this additional setup work not drag down performance, the _mm_shuffle_ps() trick becomes very important.

Running the x86-64 version of the test program on arm64 using Rosetta 2 produces a more surprising result: close to a 6x speedup! Running through Rosetta 2 means that the x86-64 and SSE instructions have to be translated to arm64 and Neon instructions, and the 8x speedup here hints that for this test, Rosetta 2’s SSE to Neon translation ran much more efficiently than Rosetta 2’s x86-64 to arm64 translation. Otherwise, a greater-than-4x speedup should not be possible if both implementations are being translated with equal levels of efficiency. I did not expect that to be the case! Unfortunately, while we can speculate, only Apple’s developers can say for sure what Rosetta 2 is doing internally that produces this result.

Neon Implementation

The second vectorized implementation we’ll look at is using Neon on arm64 processors. Much like how all modern x86-64 processors support at least SSE2 because the 64-bit extension to x86 incorporated SSE2 into the base instruction set, all modern arm64 processors support Neon because the 64-bit extension to ARM incorporates Neon in the base instruction set. Compared with SSE, Neon is a much more compact instruction set, which makes sense since SSE belongs to a CISC ISA while Neon belongs to a RISC ISA. Neon includes a little over a hundred instructions, which is less than half the number of instructions that the full SSE to SSE4 instruction set contains. Neon has all of the basics that one would expect, such as arithmetic operations and various comparison operations, but Neon doesn’t have more complex high-level instructions like the fancy shuffle instructions we used in our SSE implementation.

Much like how Intel has a searchable SSE intrinsics guide, ARM provides a helpful searchable intrinsics guide. Howard Oakley’s recent blog series on writing arm64 assembly also includes a great introduction to using Neon. Note that even though there are fewer Neon instructions in total than there are SSE instructions, the ARM intrinsics guide lists several thousand functions; this is because of one of the chief differences between SSE and Neon. SSE’s __m128 is just a generic 128-bit container that doesn’t actually specify what type or how many lanes it contains; what type a __m128 value is or how many lanes a __m128 value contains interpreted as is entirely up to each SSE instruction. Contrast with Neon, which has explicit separate types for floats and integers, and also defines separate types based on width. Since Neon has many different 128-bit types, each Neon instruction has multiple corresponding intrinsics that differ simply by the input types and widths accepted in the function signature. As a result of all of the above differences from SSE, writing a Neon implementation is not quite as simple as just doing a one-to-one replacement of each SSE intrinsic with a Neon intrinsic.

…or is it? Writing C/C++ code utilizing Neon instructions can be done by using the native Neon intrinsics found in <arm_neon.h>, but another option exists through the sse2neon project. When compiling for arm64, the x86-64 SSE <xmmintrin.h> header is not available for use because every function in the <xmmintrin.h> header maps to a specific SSE instruction or group of SSE instructions, and there’s no sense in the compiler trying to generate SSE instructions for a processor architecture that SSE instructions don’t even work on. However, the function definitions for each intrinsic are just function definitions, and sse2neon project reimplements every SSE intrinsic function with a Neon implementation under the hood. So, using sse2neon, code originally written for x86-64 using SSE intrinsics can be compiled without modification on arm64, with Neon instructions generated from the SSE intrinsics. A number of large projects originally written on x86-64 now have arm64 ports that utilize sse2neon to support vectorized code without having to completely rewrite using Neon intrinsics; as discussed in my previous Takua on ARM post, this approach is the exact approach that was taken to port Embree to arm64.

The sse2neon project was originally started by John W. Ratcliff and a few others at NVIDIA to port a handful of games from x86-64 to arm64; the original version of sse2neon only implemented the small subset of SSE that was needed for their project. However, after the project was posted to Github with a MIT license, other projects found sse2neon useful and contributed additional extensions that eventually fleshed out full coverage for MMX and all versions of SSE from SSE1 all the way through SSE4.2. For example, Syoyo Fujita’s embree-aarch64 project, which was the basis of Intel’s official Embree arm64 port, resulted in a number of improvements to sse2neon’s precision and faithfulness to the original SSE behavior. Over the years sse2neon has seen contributions and improvements from NVIDIA, Amazon, Google, the Embree-aarch64 project, the Blender project, and recently Apple as part of Apple’s larger slew of contributions to various projects to improve arm64 support for Apple Silicon. Similar open-source projects also exist to further generalize SIMD intrinsics headers (simde), to reimplement the AVX intrinsics headers using Neon (AvxToNeon), and Intel even has a project to do the reverse of sse2neon: reimplement Neon using SSE (ARM_NEON_2_x86_SSE).

While learning about Neon and while looking at how Embree was ported to arm64 using sse2neon, I started to wonder how efficient using sse2neon versus writing code directly using Neon intrinsics would be. The SSE and Neon instruction sets don’t have a one-to-one mapping to each other for many of the more complex higher-level instructions that exist in SSE, and as a result, some SSE intrinsics that compiled down to a single SSE instruction on x86-64 have to be implemented on arm64 using many Neon instructions. As a result, at least in principle, my expectation was that on arm64, code written directly using Neon intrinsics typically should likely have at least a small performance edge over SSE code ported using sse2neon. So, I decided to do a direct comparison in my test program, which required implementing the 4-wide ray-box intersection test using Neon:

inline uint32_t neonCompareAndMask(const float32x4_t& a, const float32x4_t& b) {
    uint32x4_t compResUint = vcleq_f32(a, b);
    static const int32x4_t shift = { 0, 1, 2, 3 };
    uint32x4_t tmp = vshrq_n_u32(compResUint, 31);
    return vaddvq_u32(vshlq_u32(tmp, shift));
}

void rayBBoxIntersect4Neon(const Ray& ray,
                        const BBox4& bbox4,
                        IVec4& hits,
                        FVec4& tMins,
                        FVec4& tMaxs) {
    FVec4 rdir(vdupq_n_f32(1.0f) / ray.direction.f32x4);
    /* since Neon doesn't have a single-instruction equivalent to _mm_shuffle_ps, we just take
    the slow route here and load into each float32x4_t */
    FVec4 rdirX(vdupq_n_f32(rdir.x));
    FVec4 rdirY(vdupq_n_f32(rdir.y));
    FVec4 rdirZ(vdupq_n_f32(rdir.z));
    FVec4 originX(vdupq_n_f32(ray.origin.x));
    FVec4 originY(vdupq_n_f32(ray.origin.y));
    FVec4 originZ(vdupq_n_f32(ray.origin.z));

    IVec4 near(int(rdir.x >= 0.0f ? 0 : 3), int(rdir.y >= 0.0f ? 1 : 4),
            int(rdir.z >= 0.0f ? 2 : 5));
    IVec4 far(int(rdir.x >= 0.0f ? 3 : 0), int(rdir.y >= 0.0f ? 4 : 1),
            int(rdir.z >= 0.0f ? 5 : 2));

    tMins =
        FVec4(vmaxq_f32(vmaxq_f32(vdupq_n_f32(ray.tMin),
                                (bbox4.corners[near.x].f32x4 - originX.f32x4) * rdirX.f32x4),
                        vmaxq_f32((bbox4.corners[near.y].f32x4 - originY.f32x4) * rdirY.f32x4,
                                (bbox4.corners[near.z].f32x4 - originZ.f32x4) * rdirZ.f32x4)));
    tMaxs = FVec4(vminq_f32(vminq_f32(vdupq_n_f32(ray.tMax),
                                    (bbox4.corners[far.x].f32x4 - originX.f32x4) * rdirX.f32x4),
                            vminq_f32((bbox4.corners[far.y].f32x4 - originY.f32x4) * rdirY.f32x4,
                                    (bbox4.corners[far.z].f32x4 - originZ.f32x4) * rdirZ.f32x4)));

    uint32_t hit = neonCompareAndMask(tMins.f32x4, tMaxs.f32x4);
    hits[0] = bool(hit & (1 << (0)));
    hits[1] = bool(hit & (1 << (1)));
    hits[2] = bool(hit & (1 << (2)));
    hits[3] = bool(hit & (1 << (3)));
}

Listing 8: Neon version of the compact Williams et al. 2005 implementation.

Even if you only know SSE and have never worked with Neon, you should already be able to tell broadly how the Neon implementation in Listing 8 works! Just from the name alone, vmaxq_f32() and vminq_f32() obviously correspond directly to _mm_max_ps() and _mm_min_ps() in the SSE implementation, and understanding how the ray data is being loaded into Neon’s 128-bit registers using vdupq_n_f32() instead of _mm_set1_ps() should be relatively easy too. However, because there is no fancy single-instruction shuffle intrinsic available in Neon, the way the ray data is loaded is potentially slightly less efficient.

The largest area of difference between the Neon and SSE implementations is in the processing of the tMin and tMax results to produce the output hits vector. The SSE version uses just two intrinsic functions because SSE includes the fancy high-level _mm_cmple_ps() intrinsic, which compiles down to a single CMPPS SSE instruction, but implementing this functionality using Neon takes some more work. The neonCompareAndMask() helper function implements the hits vector processing using four Neon intrinsics; a better solution may exist, but for now this is the best I can do given my relatively basic level of Neon experience. If you have a better solution, feel free to let me know!

Here’s how the native Neon intrinsics implementation performs compared with using sse2neon to translate the SSE implementation. For an additional point of comparison, I’ve also included the Rosetta 2 SSE result from the previous section. Note that the speedup column for Rosetta 2 here isn’t comparing how much faster the SSE implementation running over Rosetta 2 is with the compact scalar implementation running over Rosetta 2; instead, the Rosetta 2 speedup columns here compare how much faster (or slower) the Rosetta 2 runs are compared with the native arm64 compact scalar implementation:

	arm64:	arm64 Speedup:	Rosetta2:	Rosetta2 Speedup over Native:
Scalar Compact:	41.8187 ns	1.0x.	81.0942 ns	0.5157x
SSE:	-	-	13.6353 ns	3.0669x
SSE2NEON:	12.3090 ns	3.3974x	-	-
Neon:	12.2161 ns	3.4232x	-	-

I originally also wanted to include a test that would have been the reverse of sse2neon: use Intel’s ARM_NEON_2_x86_SSE project to get the Neon implementation working on x86-64. However, when I tried using ARM_NEON_2_x86_SSE, I discovered that the ARM_NEON_2_x86_SSE isn’t quite complete enough yet (as of time of writing) to actually compile the Neon implementation in Figure 8.

I was very pleased to see that both of the native arm64 implementations ran faster than the SSE implementation running over Rosetta 2; which means that my native Neon implementation is at least halfway decent, and which also means that sse2neon works as advertised. The native Neon implementation is also just a hair faster than the sse2neon implementation, which indicates that at least here, rewriting using native Neon intrinsics instead of mapping from SSE to Neon does indeed produce slightly more efficient code. However, the sse2neon implementation is very very close in terms of performance, to the point where it may well be within an acceptable margin of error. Overall, both of the native arm64 implementations get a respectable speedup over the compact scalar reference, even though the speedup amounts are a bit less than the ideal 4x. I think that the slight performance loss compared to the ideal 4x is probably attributable to the more complex solution required for filling the output hits vector.

To better understand why the sse2neon implementation performs so close to the native Neon implementation, I tried just copy-pasting every single function implementation out of sse2neon into the SSE 4-wide ray-box intersection test. Interestingly, the result was extremely similar to my native Neon implementation; structurally they were more or less identical, but the sse2neon version had some additional extraneous calls. For example, instead of replacing _mm_max_ps(a, b) one-to-one with vmaxq_f32(a, b), sse2neon’s version of _mm_max_ps(a, b) is vreinterpretq_m128_f32(vmaxq_f32(vreinterpretq_f32_m128(a), vreinterpretq_f32_m128(b))). vreinterpretq_m128_f32 is a helper function defined by sse2neon to translate an input __m128 into a float32x4_t. There’s a lot of reinterpreting of inputs to specific float or integer types in sse2neon; all of the reinterpreting in sse2neon is to convert from SSE’s generic __m128 to Neon’s more specific types. In the specific case of vreinterpretq_m128_f32, the reinterpretation should actually compile down to a no-op since sse2neon typedefs __m128 directly to float32x4_t, but many of sse2neon’s other reinterpretation functions do require additional extra Neon instructions to implement.

Even though the Rosetta 2 result is definitively slower than the native arm64 results, the Rosetta 2 result is far closer to the native arm64 results than I normally would have expected. Rosetta 2 usually can be expected to perform somewhere in the neighborhood of 50% to 80% of native performance for compute-heavy code, and the Rosetta 2 performance for the compact scalar implementation lines up with this expectation. However, the Rosetta 2 performance for the vectorized version lends further credence to the theory from the previous section that Rosetta 2 somehow is better able to translate vectorized code than scalar code.

Auto-vectorized Implementation

The unfortunate thing about writing vectorized programs using vector intrinsics is that… vector intrinsics can be hard to use! Vector intrinsics are intentionally fairly low-level, which means that when compared to writing normal C or C++ code, using vector intrinsics is only a half-step above writing code directly in assembly. The vector intrinsics APIs provided for SSE and Neon have very large surface areas, since a large number of intrinsic functions exist to cover the large number of vector instructions that there are. Furthermore, unless compatibility layers like sse2neon are used, vector intrinsics are not portable between different processor architectures in the same way that normal higher-level C and C++ code is. Even though I have some experience working with vector intrinsics, I still don’t consider myself even remotely close to comfortable or proficient in using them; I have to rely heavily on looking up everything using various reference guides.

One potential solution to the difficulty of using vector intrinsics is compiler auto-vectorization. Auto-vectorization is a compiler technique that aims to allow programmers to better utilize vector instructions without requiring programmers to write everything using vector intrinsics. Instead of writing vectorized programs, programmers write standard scalar programs which the compiler’s auto-vectorizer then converts into a vectorized program at compile-time. One common auto-vectorization technique that many compilers implement is loop vectorization, which takes a serial innermost loop and restructures the loop such that each iteration of the loop maps to one vector lane. Implementing loop vectorization can be extremely tricky, since like with any other type of compiler optimization, the cardinal rule is that the originally written program behavior must be unmodified and the original data dependencies and access orders must be preserved. Add in the need to consider all of the various concerns that are specific to vector instructions, and the result is that loop vectorization is easy to get wrong if not implemented very carefully by the compiler. However, when loop vectorization is available and working correctly, the performance increase to otherwise completely standard scalar code can be significant.

The 4-wide ray-box intersection test should be a perfect candidate for auto-vectorization! The scalar implementations are implemented as just a single for loop that calls the single ray-box test once per iteration of the loop, for four iterations. Inside of the loop, the ray-box test is fundamentally just a bunch of simple min/max operations and a little bit of arithmetic, which as seen in the SSE and Neon implementations, is the easiest part of the whole problem to vectorize. I originally expected that I would have to compile the entire test program with all optimizations disabled, because I thought that with optimizations enabled, the compiler would auto-vectorize the compact scalar implementation and make comparisons with the hand-vectorized implementations difficult. However, after some initial testing, I realized that the scalar implementations weren’t really getting auto-vectorized at all even with optimization level -O3 enabled. Or, more precisely, the compiler was emitting long stretches of code using vector instructions and vector registers… but the compiler was just utilizing one lane in all of those long stretches of vector code, and was still looping over each bounding box separately. As a point of reference, here is the x86-64 compiled output and the arm64 compiled output for the compact scalar implementation.

Finding that the auto-vectorizer wasn’t really working on the scalar implementations led me to try to write a new scalar implementation that would auto-vectorize well. To try to give the auto-vectorizer as good of a chance at possible at working well, I started with the compact scalar implementation and embedded the single-ray-box intersection test into the 4-wide function as an inner loop. I also pulled apart the implementation into a more expanded form where every line in the inner loop carries out a single arithmetic operation that can be mapped to exactly to one SSE or Neon instruction. I also restructured the data input to the inner loop to be in a readily vector-friendly layout; the restructuring is essentially a scalar implementation of the vectorized setup code found in the SSE and Neon hand-vectorized implementations. Finally, I put a #pragma clang loop vectorize(enable) in front of the inner loop to make sure that the compiler knows that it can use the loop vectorizer here. Putting all of the above together produces the following, which is as auto-vectorization-friendly as I could figure out how to rewrite things:

void rayBBoxIntersect4AutoVectorize(const Ray& ray,
                                    const BBox4& bbox4,
                                    IVec4& hits,
                                    FVec4& tMins,
                                    FVec4& tMaxs) {
    float rdir[3] = { 1.0f / ray.direction.x, 1.0f / ray.direction.y, 1.0f / ray.direction.z };
    float rdirX[4] = { rdir[0], rdir[0], rdir[0], rdir[0] };
    float rdirY[4] = { rdir[1], rdir[1], rdir[1], rdir[1] };
    float rdirZ[4] = { rdir[2], rdir[2], rdir[2], rdir[2] };
    float originX[4] = { ray.origin.x, ray.origin.x, ray.origin.x, ray.origin.x };
    float originY[4] = { ray.origin.y, ray.origin.y, ray.origin.y, ray.origin.y };
    float originZ[4] = { ray.origin.z, ray.origin.z, ray.origin.z, ray.origin.z };
    float rtMin[4] = { ray.tMin, ray.tMin, ray.tMin, ray.tMin };
    float rtMax[4] = { ray.tMax, ray.tMax, ray.tMax, ray.tMax };

    IVec4 near(int(rdir[0] >= 0.0f ? 0 : 3), int(rdir[1] >= 0.0f ? 1 : 4),
            int(rdir[2] >= 0.0f ? 2 : 5));
    IVec4 far(int(rdir[0] >= 0.0f ? 3 : 0), int(rdir[1] >= 0.0f ? 4 : 1),
            int(rdir[2] >= 0.0f ? 5 : 2));

    float product0[4];

#pragma clang loop vectorize(enable)
    for (int i = 0; i < 4; i++) {
        product0[i] = bbox4.corners[near.y][i] - originY[i];
        tMins[i] = bbox4.corners[near.z][i] - originZ[i];
        product0[i] = product0[i] * rdirY[i];
        tMins[i] = tMins[i] * rdirZ[i];
        product0[i] = fmax(product0[i], tMins[i]);
        tMins[i] = bbox4.corners[near.x][i] - originX[i];
        tMins[i] = tMins[i] * rdirX[i];
        tMins[i] = fmax(rtMin[i], tMins[i]);
        tMins[i] = fmax(product0[i], tMins[i]);

        product0[i] = bbox4.corners[far.y][i] - originY[i];
        tMaxs[i] = bbox4.corners[far.z][i] - originZ[i];
        product0[i] = product0[i] * rdirY[i];
        tMaxs[i] = tMaxs[i] * rdirZ[i];
        product0[i] = fmin(product0[i], tMaxs[i]);
        tMaxs[i] = bbox4.corners[far.x][i] - originX[i];
        tMaxs[i] = tMaxs[i] * rdirX[i];
        tMaxs[i] = fmin(rtMax[i], tMaxs[i]);
        tMaxs[i] = fmin(product0[i], tMaxs[i]);

        hits[i] = tMins[i] <= tMaxs[i];
    }
}

Listing 9: Compact scalar version written to be easily auto-vectorized.

How well is Apple Clang v12.0.5 able to auto-vectorize the implementation in Listing 9? Well, looking at the output assembly on x86-64 and on arm64… the result is disappointing. Much like with the compact scalar implementation, the compiler is in fact emitting nice long sequences of vector intrinsics and vector registers… but the loop is still getting unrolled into four repeated blocks of code where only one lane is utilized per unrolled block, as opposed to produce a single block of code where all four lanes are utilized together. The difference is especially apparent when compared with the hand-vectorized SSE compiled output and the hand-vectorized Neon compiled output.

Here are the results of running the auto-vectorized implementation above, compared with the reference compact scalar implementation:

	x86-64:	x86-64 Speedup:	arm64:	arm64 Speedup:	Rosetta2:	Rosetta2 Speedup:
Scalar Compact:	44.5159 ns	1.0x.	41.8187 ns	1.0x.	81.0942 ns	1.0x.
Autovectorize:	34.1398 ns	1.3069x	38.1917 ns	1.0950x	59.9757 ns	1.3521x

While the auto-vectorized version certainly is faster than the reference compact scalar implementation, the speedup is far from the 3x to 4x that we’d expect from well vectorized code that was properly utilizing each processor’s vector hardware. On arm64, the speed boost from auto-vectorization is almost nothing.

So what is going on here? Why is compiler failing so badly at auto-vectorizing code that has been explicitly written to be easily vectorizable? The answer is that the compiler is in fact producing vectorized code, but since the compiler doesn’t have a more complete understanding of what the code is actually trying to do, the compiler can’t set up the data appropriately to really be able to take advantage of vectorization. Therein lies what is, in my opinion, one of the biggest current drawbacks of relying on auto-vectorization: there is only so much the compiler can do without a higher, more complex understanding of what the program is trying to do overall. Without that higher level understanding, the compiler can only do so much, and understanding how to work around the compiler’s limitations requires a deep understanding of how the auto-vectorizer is implemented internally. Structuring code to auto-vectorize well also requires thinking ahead to what the vectorized output assembly should be, which is not too far from just writing the code using vector intrinsics to begin with. At least to me, if achieving maximum possible performance is a goal, then all of the above actually amounts to more complexity than just directly writing using vector intrinsics. However, that isn’t to say that auto-vectorization is completely useless- we still did get a bit of a performance boost! I think that auto-vectorization is definitely better than nothing, and when it does work, it works well. But, I also think that auto-vectorization is not a magic bullet perfect solution to writing vectorized code, and when hand-vectorizing is an option, a well-written hand-vectorized implementation has a strong chance of outperforming auto-vectorization.

ISPC Implementation

Another option exists for writing portable vectorized code without having to directly use vector intrinsics: ISPC, which stands for “Intel SPMD Program Compiler”. The ISPC project was started and initially developed by Matt Pharr after he realized that the reason auto-vectorization tends to work so poorly in practice is because auto-vectorization is not a programming model [Pharr 2018]. A programming model both allows programmers to better understand what guarantees the underlying hardware execution model can provide, and also provides better affordances for compilers to rely on for generating assembly code. ISPC utilizes a programming model known as SPMD, or single-program-multiple-data. The SPMD programming model is generally very similar to the SIMT programming model used on GPUs (in many ways, SPMD can be viewed as a generalization of SIMT): programs are written as a serial program operating over a single data element, and then the serial program is run in a massively parallel fashion over many different data elements. In other words, the parallelism in a SPMD program is implicit, but unlike in auto-vectorization, the implicit parallelism is also a fundamental component of the programming model.

Mapping to SIMD hardware, writing a program using a SPMD model means that the serial program is written for a single SIMD lane, and the compiler is responsible for multiplexing the serial program across multiple lanes [Pharr and Mark 2012]. The difference between SPMD-on-SIMD and auto-vectorization is that with SPMD-on-SIMD, the compiler can know much more and rely on much harder guarantees about how the program wants to be run, as enforced by the programming model itself. ISPC compiles a special variant of the C programming language that has been extended with some vectorization-specific native types and control flow capabilities. Compared to writing code using vector intrinsics, ISPC programs look a lot more like normal scalar C code, and often can even be compiled as normal scalar C code with little to no modification. Since the actual transformation to vector assembly is up to the compiler, and since ISPC utilizes LLVM under the hood, programs written for ISPC can be written just once and then compiled to many different LLVM-supported backend targets such as SSE, AVX, Neon, and even CUDA.

Actually writing an ISPC program is, in my opinion, very straightforward; since the language is just C with some additional builtin types and keywords, if you already know how to program in C, you already know most of ISPC. ISPC provides vector versions of all of the basic types like float and int; for example, ISPC’s float<4> in memory corresponds exactly to the FVec4 struct we defined earlier for our test program. ISPC also adds qualifier keywords like uniform and varying that act as optimization hints for the compiler by providing the compiler with guarantees about how memory is used; if you’ve programmed in GLSL or a similar GPU shading language before, you already know how these qualifiers work. There are a variety of other small extensions and differences, all of which are well covered by the ISPC User’s Guide.

The most important extension that ISPC adds to C is the foreach control flow construct. Normal loops are still written using for and while, but the foreach loop is really how parallel computation is specified in ISPC. The inside of a foreach loop describes what happens on one SIMD lane, and the iterations of the foreach loop are what get multiplexed onto different SIMD lanes by the compiler. In other words, the contents of the foreach loop is roughly analogous to the contents of a GPU shader, and the foreach loop statement itself is roughly analogous to a kernel launch in the GPU world.

Knowing all of the above, here’s how I implemented the 4-wide ray-box intersection test as an ISPC program. Note how the actual intersection testing happens in the foreach loop; everything before that is setup:

typedef float<3> float3;

export void rayBBoxIntersect4ISPC(const uniform float rayDirection[3],
                                const uniform float rayOrigin[3],
                                const uniform float rayTMin,
                                const uniform float rayTMax,
                                const uniform float bbox4corners[6][4],
                                uniform float tMins[4],
                                uniform float tMaxs[4],
                                uniform int hits[4]) {
    uniform float3 rdir = { 1.0f / rayDirection[0], 1.0f / rayDirection[1],
                            1.0f / rayDirection[2] };

    uniform int near[3] = { 3, 4, 5 };
    if (rdir.x >= 0.0f) {
        near[0] = 0;
    }
    if (rdir.y >= 0.0f) {
        near[1] = 1;
    }
    if (rdir.z >= 0.0f) {
        near[2] = 2;
    }

    uniform int far[3] = { 0, 1, 2 };
    if (rdir.x >= 0.0f) {
        far[0] = 3;
    }
    if (rdir.y >= 0.0f) {
        far[1] = 4;
    }
    if (rdir.z >= 0.0f) {
        far[2] = 5;
    }

    foreach (i = 0...4) {
        tMins[i] = max(max(rayTMin, (bbox4corners[near[0]][i] - rayOrigin[0]) * rdir.x),
                    max((bbox4corners[near[1]][i] - rayOrigin[1]) * rdir.y,
                        (bbox4corners[near[2]][i] - rayOrigin[2]) * rdir.z));
        tMaxs[i] = min(min(rayTMax, (bbox4corners[far[0]][i] - rayOrigin[0]) * rdir.x),
                    min((bbox4corners[far[1]][i] - rayOrigin[1]) * rdir.y,
                        (bbox4corners[far[2]][i] - rayOrigin[2]) * rdir.z));
        hits[i] = tMins[i] <= tMaxs[i];
    }
}

Listing 10: ISPC implementation of the compact Williams et al. 2005 implementation.

In order to call the ISPC function from our main C++ test program, we need to define a wrapper function on the C++ side of things. When an ISPC program is compiled, ISPC automatically generates a corresponding header file named using the name of the ISPC program appended with “_ispc.h”. This automatically generated header can be included by the C++ test program. Using ISPC through CMake 3.19 or newer, ISPC programs can be added to any normal C/C++ project, and the automatically generated ISPC headers can be included like any other header and will be placed into the correct place by CMake.

Since ISPC is a separate language and since ISPC code has to be compiled as a separate object from our main C++ code, we can’t pass the various structs we’ve defined directly into the ISPC function. Instead, we need a simple wrapper function that extracts pointers to the underlying basic data types from our custom structs, and passes those pointers to the ISPC function:

void rayBBoxIntersect4ISPC(const Ray& ray,
                        const BBox4& bbox4,
                        IVec4& hits,
                        FVec4& tMins,
                        FVec4& tMaxs) {
    ispc::rayBBoxIntersect4ISPC(ray.direction.data, ray.origin.data, ray.tMin, ray.tMax,
                                bbox4.cornersFloatAlt, tMins.data, tMaxs.data, hits.data);
}

Listing 11: Wrapper function to call the ISPC implementation from C++.

Looking at the assembly output from ISPC for x86-64 SSE4 and for arm64 Neon, things look pretty good! The contents of the foreach loop have been compiled down to a single straight run of vectorized instructions, with all four lanes filled beforehand. Comparing ISPC’s output with the compiler output for the hand-vectorized implementations, the core of the ray-box test looks very similar between the two, while ISPC’s output for all of the precalculation logic actually seems slightly better than the output from the hand-vectorized implementation.

Here is how the ISPC implementation performs, compared to the baseline compact scalar implementation:

	x86-64:	x86-64 Speedup:	arm64:	arm64 Speedup:	Rosetta2:	Rosetta2 Speedup:
Scalar Compact:	44.5159 ns	1.0x.	41.8187 ns	1.0x.	81.0942 ns	1.0x.
ISPC:	8.2877 ns	5.3835x	11.2182 ns	3.7278x	11.3709 ns	7.1317x

The performance from the ISPC implementation looks really good! Actually, on x86-64, the ISPC implementation’s performance looks too good to be true: at first glance, a 5.3835x speedup over the compact scalar baseline implementation shouldn’t be possible since the maximum expected possible speedup is only 4x. I had to think about this result a while; I think the explanation for this apparently better-than-possible speedup is because the setup versus the actual intersection test parts of the 4-wide ray-box test need to be considered separately. The actual intersection part is the part that is an apples-to-apples comparison across all of the different implementations, while the setup code can vary significantly both in how it is written and in how well it can be optimized across different implementations. The reason for the above is that the setup code is more inherently scalar. I think that the reason the ISPC implementation has an overall more-than-4x speedup over the baseline is because in the baseline implementation, the scalar setup code is not much out of the -O3 optimization level, whereas the ISPC implementation’s setup code is both getting more out of ISPC’s -O3 optimization level and is additionally just better vectorized on account of being ISPC code. A data point that lends credence to this theory is that when Clang and ISPC are both forced to disabled all optimizations using the -O0 flag, the performance difference between the baseline and ISPC implementations falls back into a much more expected multiplier below 4x.

Generally, I really like ISPC! ISPC delivers on the promise of write-once compiler-and-run-anywhere vectorized code, and unlike auto-vectorization, ISPC’s output compiler assembly performs as we expect for well-written vectorized code. Of course, ISPC isn’t 100% fool-proof magic; care still needs to be taken in writing good ISPC programs that don’t contain excessive amounts of execution path divergence between SIMD lanes, and care still needs to be taken in not doing too many expensive gather/scatter operations. However, these types of considerations are just part of writing vectorized code in general and are not specific to ISPC, and furthermore, these types of considerations should be familiar territory for anyone with experience writing GPU code as well. I think that’s a general strength of ISPC: writing vector CPU code using ISPC feels a lot like writing GPU code, and that’s by design!

Final Results and Conclusions

Now that we’ve walked through every implementation in the test program, below are the complete results for every implementation across x86-64, arm64, and Rosetta 2. As mentioned earlier, all results were generated by running on a 2019 16 inch MacBook Pro with a Intel Core i7-9750H CPU for x86-64, and on a 2020 M1 Mac Mini for arm64 and Rosetta 2. All results were generated by running the test program with 100000 runs per implementation; the timings reported are the average time for one run. I ran the test program 5 times with 100000 runs each time; after throwing out the highest and lowest result for each implementation to discard outliers, I then averaged the remaining three results for each implementation for each architecture. In the results, the “speedup” columns use the scalar compact implementation as the baseline for comparison:

			Results
	x86-64:	x86-64 Speedup:	arm64:	arm64 Speedup:	Rosetta2:	Rosetta2 Speedup:
Scalar Compact:	44.5159 ns	1.0x.	41.8187 ns	1.0x.	81.0942 ns	1.0x.
Scalar Original:	44.1004 ns	1.0117x	78.4001 ns	0.5334x	90.7649 ns	0.8935x
Scalar No Early-Out:	55.6770 ns	0.8014x	85.3562 ns	0.4899x	102.763 ns	0.7891x
SSE:	10.9660 ns	4.0686x	-	-	13.6353 ns	5.9474x
SSE2NEON:	-	-	12.3090 ns	3.3974x	-	-
Neon:	-	-	12.2161 ns	3.4232x	-	-
Autovectorize:	34.1398 ns	1.3069x	38.1917 ns	1.0950x	59.9757 ns	1.3521x
ISPC:	8.2877 ns	5.3835x	11.2182 ns	3.7278x	11.3709 ns	7.1317x

In each of the sections above, we’ve already looked at how the performance of each individual implementation compares against the baseline compact scalar implementation. Ranking all of the approaches (at least for the specific example used in this post), ISPC produces the best performance, hand-vectorization using each processor’s native vector intrinsics comes in second, hand-vectorization using a translation layer such as sse2neon follows very closely behind using native vector intrinsics, and finally auto-vectorization comes in a distant last place. Broadly, I think a good rule of thumb is that auto-vectorization is better than nothing, and that for large complex programs where vectorization is important and where cross-platform is required, ISPC is the way to go. For smaller-scale things where the additional development complexity of bringing in an additional compiler isn’t justified, writing directly using vector intrinsics is a good solution, and using translation layers like sse2neon to port code written using one architecture’s vector intrinsics to another architecture without a total rewrite can work just as well as rewriting from scratch (assuming the translation layer is as well-written as sse2neon is). Finally, as mentioned earlier, I was very surprised to learn that Rosetta 2 seems to be much better at translating vector instructions than it is at translating normal scalar x86-64 instructions.

Looking back over the final test program, around a third of the total lines of code in the test program aren’t ray-box intersection code at all. Around a third of the code is made up of just defining data structures and doing data marshaling to make sure that the actual ray-box intersection code can be efficiently vectorized at all. I think that in most applications of vectorization, figuring out the data marshaling to enable good vectorization is just as important of a problem as actually writing the core vectorized code, and I think the data marshaling can often be even harder than the actual vectorization part. Even the ISPC implementation in this post only works because the specific memory layout of the BBox4 data structure is designed for optimal vectorized access.

For much larger vectorized applications, such as full production renderers, planning ahead for vectorization doesn’t just mean figuring out how to lay out data structures in memory, but can mean having to incorporate vectorization considerations into the fundamental architecture of the entire system. A great example of the above is DreamWorks Animation’s Moonray renderer, which has an entire architecture designed around coalescing enough coherent work in an incoherent path tracer to facilitate ISPC-based vectorized shading [Lee et al. 2017]. Weta Digital’s Manuka renderer goes even further by fundamentally restructuring the typical order of operations in a standard path tracer into a shade-before-hit architecture, also in part to facilitate vectorized shading [Fascione et al. 2018]. Pixar and Intel have also worked together recently to extend OSL with better vectorization for use in RenderMan XPU, which has necessitated the addition of a new batched interface to OSL [Liani and Wells 2020]. Some other interesting large non-rendering applications where vectorization has been applied through the use of clever rearchitecting include JPEG encoding [Krasnov 2018] and even JSON parsing [Langdale and Lemire 2019]. More generally, the entire domain of data-oriented design [Acton 2014] revolves around understanding how to structure data layout according to how computation needs to access said data; although data-oriented design was originally motivated by a need to efficiently utilize the CPU cache hierarchy, data-oriented design is also highly applicable to structuring vectorized programs.

In this post, we only looked at 4-wide 128-bit SIMD extensions. Vectorization is not limited to 128-bits or 4-wide instructions, of course; x86-64’s newer AVX instructions use 256-bit registers and, when used with 32-bit floats, AVX is 8-wide. The newest version of AVX, AVX-512, extends things even wider to 512-bit registers and can support a whopping 16 32-bit lanes. Similarly, ARM’s new SVE vector extensions serve as a wider successor to Neon (ARM also recently introduced a new lower-energy lighter weight companion vector extension to Neon, named Helium). Comparing AVX and SVE is interesting, because their design philosophies are much further apart than the relatively similar design philosophies behind SSE and Neon. AVX serves as a direct extension to SSE, to the point where even AVX’s YMM registers are really just an expanded version of SSE’s XMM registers (on processors supporting AVX, the XMM registers physically are actually just the lower 128 bits of the full YMM registers). Similar to AVX, the lower bits of SVE’s registers also overlap Neon’s registers, but SVE uses a new set of vector instructions separate from Neon. The big difference between AVX and SVE is that while AVX and AVX-512 specify fixed 256-bit and 512-bit widths respectively, SVE allows for different implementations to define different widths from a minimum of 128-bit all the way up to a maximum of 2048-bit, in 128-bit increments. At some point in the future, I think a comparison of AVX and SVE could be fun and interesting, but I didn’t touch on them in this post because of a number of current problems. In many Intel processors today, AVX (and especially AVX-512) is so power-hungry that using AVX means that the processor has to throttle its clock speeds down [Krasnov 2017], which can in some cases completely negate any kind of performance improvement. The challenge with testing SVE code right now is… there just aren’t many arm64 processors out that actually implement SVE yet! As of the time of writing, the only publicly released arm64 processor in the world that I know of that implements SVE is Fujitsu’s A64FX supercomputer processor, which is not exactly an off-the-shelf consumer part. NVIDIA’s upcoming Grace arm64 server CPU is also supposed to implement SVE, but as of 2021, the Grace CPU is still a few years away from release.

At the end of the day, for any application where vectorization is a good fit, not using vectorization means leaving a large amount of performance on the table. Of course, the example used in this post is just a single data point, and is a relatively small example; your mileage may and likely will vary for different and larger examples! As with any programming task, understanding your problem domain is crucial for understanding how useful any given technique will be, and as seen in this post, great care must be taken to structure code and data to even be able to take advantage of vectorization. Hopefully this post has served as a useful examination of several different approaches to vectorization! Again, I have put all of the code in this post in an open Github repository; feel free to play around with it yourself (or if you are feeling especially ambitious, feel free to use it as a starting point for a full vectorized BVH implementation)!

Addendum 2022-09-07

After I published this post, Romain Guy wrote in with a suggestion to use -ffast-math to improve the auto-vectorization results. I gave the suggestion a try, and the result was indeed markedly improved! Across the board, using -ffast-math cut the auto-vectorization timings by about half, corresponding to around a doubling of performance. Using ffast-math, the auto-vectorized implementation still trails behind the hand-vectorized and ISPC implementations, but by a much narrower margin than before, and overall is much much better than the compact scalar baseline. Romain previously presented a talk in 2019 about Google’s Filament real-time rendering engine, which includes many additional tips for making auto-vectorization work better.

References

Mike Acton. 2014. Data-Oriented Design and C++. In CppCon 2014.

AMD. 2020. “RDNA 2” Instruction Set Architecture Reference Guide. Retrieved August 30, 2021.

ARM Holdings. 2021. ARM Intrinsics. Retrieved August 30, 2021.

ARM Holdings. 2021. Helium Programmer’s Guide. Retrieved September 5, 2021.

ARM Holdings. 2021. SVE and SVE2 Programmer’s Guide. Retrieved September 5, 2021.

Holger Dammertz, Johannes Hanika, and Alexander Keller. 2008. Shallow Bounding Volume Hierarchies for Fast SIMD Ray Tracing of Incoherent Rays. Computer Graphics Forum (Proc. of Eurographics Symposium on Rendering 27, 4 (Jun. 2008), 1225-1234.

Manfred Ernst and Günther Greiner. 2008. Multi Bounding Volume Hierarchies. In Proc. of IEEE Symposium on Interactive Ray Tracing (RT 2008). 35-40.

Romain Guy and Mathias Agopian. 2019. High Performance (Graphics) Programming. In Android Dev Summit ‘19. Retrieved September 7, 2021.

Intel Corporation. 2021. Intel Intrinsics Guide. Retrieved August 30, 2021.

Intel Corporation. 2021. Intel ISPC User’s Guide. Retrieved August 30, 2021.

Thiago Ize. 2013. Robust BVH Ray Traversal. Journal of Computer Graphics Techniques 2, 2 (Jul. 2013), 12-27.

Tero Karras and Timo Aila. 2013. Fast Parallel Construction of High-Quality Bounding Volume Hierarchies. In Proc. of High Performance Graphics (HPG 2013). 89-88.

Vlad Krasnov. 2017. On the dangers of Intel’s frequency scaling. In Cloudflare Blog. Retrieved August 30, 2021.

Vlad Krasnov. 2018. NEON is the new black: fast JPEG optimization on ARM server. In Cloudflare Blog. Retrieved August 30, 2021.

Geoff Langdale and Daniel Lemire. 2019. Parsing Gigabytes of JSON per Second. The VLDB Journal. 28 (2019), 941-960.

Mark Lee, Brian Green, Feng Xie, and Eric Tabellion. 2017. Vectorized Production Path Tracing. In Proc. of High Performance Graphics (HPG 2017). Article 10.

Max Liani and Alex M. Wells. 2020. Supercharging Pixar’s RenderMan XPU with Intel AVX-512. In ACM SIGGRAPH 2020 Exhibitor Sessions.

Alexander Majercik, Cyril Crassin, Peter Shirley, and Morgan McGuire. 2018. A Ray-Box Intersection Algorithm and Efficient Dynamic Voxel Rendering. Journal of Computer Graphics Techniques 7, 3 (Sep. 2018), 66-81.

Daniel Meister, Shinji Ogaki, Carsten Benthin, Michael J. Doyle, Michael Guthe, and Jiri Bittner. 2021. A Survey on Bounding Volume Hierarchies for Ray Tracing. Computer Graphics Forum (Proc. of Eurographics) 40, 2 (May 2021), 683-712.

NVIDIA. 2021. NVIDIA OptiX 7.3 Programming Guide. Retrieved August 30, 2021.

Howard Oakley. 2021. Code in ARM Assembly: Lanes and loads in NEON. In The Eclectic Light Company. Retrieved September 7, 2021.

Matt Pharr. 2018. The Story of ISPC. In Matt Pharr’s Blog. Retrieved July 18, 2021.

Matt Pharr and William R. Mark. 2012. ispc: A SPMD compiler for high-performance CPU programming. In Proc. of nnovative Parallel Computing (InPar 2012). 184-196.

Martin Stich, Heiko Friedrich, and Andreas Dietrich. 2009. Spatial Splits in Bounding Volume Hierarchies. In Proc. of High Performance Graphics (HPG 2009). 7-13.

John A. Tsakok. 2009. Faster Incoherent Rays: Multi-BVH Ray Stream Tracing. In Proc. of High Performance Graphics (HPG 2009). 151-158.

Nathan Vegdahl. 2017. BVH4 Without SIMD. In Psychopath Renderer. Retrieved August 20, 2021.

Ingo Wald, Carsten Benthin, and Solomon Boulos. 2008. Getting Rid of Packets - Efficient SIMD Single-Ray Traversal using Multi-Branching BVHs. In Proc. of IEEE Symposium on Interactive Ray Tracing (RT 2008). 49-57.

Ingo Wald, Philipp Slusallek, Carsten Benthin, and Markus Wagner. 2001. Interactive Rendering with Coherent Ray Tracing. Computer Graphics Forum (Proc. of Eurographics 20, 3 (Sep. 2001), 153-165.

Ingo Wald, Sven Woop, Carsten Benthin, Gregory S. Johnson, and Manfred Ernst. 2014. Embree: A Kernel Framework for Efficient CPU Ray Tracing. ACM Transactions on Graphics (Proc. of SIGGRAPH) 33, 4 (Jul. 2014), Article 143.

Amy Williams, Steve Barrus, Keith Morley, and Peter Shirley. 2005. An Efficient and Robust Ray-Box Intersection Algorithm. _Journal of Graphics Tools) 10, 1 (Jan. 2005), 49-54.

Henri Ylitie, Tero Karras, and Samuli Laine. 2017. Efficient Incoherent Ray Traversal on GPUs Through Compressed Wide BVHs. In Proc. of High Performance Graphics (HPG 2017). Article 4.

Wikipedia. 2021. Advanced Vector Extensions. Retrieved September 5, 2021.

Wikipedia. 2021. Automatic Vectorization. Retrieved September 4, 2021.

Wikipedia. 2021. AVX-512. Retrieved September 5, 2021.

Wikipedia. 2021. Single Instruction, Multiple Threads. Retrieved July 18, 2021.

Wikipedia. 2021. SPMD. Retrieved July 18, 2021.

SIGGRAPH 2021 Talk- Unbiased Emission and Scattering Importance Sampling for Heterogeneous Volumes

2021-08-09T00:00:00+00:00

This year at SIGGRAPH 2021, Wei-Feng Wayne Huang, Peter Kutz, Matt Jen-Yuan Chiang, and I have a talk that presents a pair of new distance-sampling techniques for improving emission and scattering importance sampling for volume path tracing cases where low-order heterogeneous scattering dominates. These techniques were developed as part of our ongoing development on Disney’s Hyperion Renderer and first saw full-fledged production use on Raya and the Last Dragon, although limited testing of in-progress versions also happened on Frozen 2. This work was led by Wayne, building upon important groundwork that was put in place by Peter before Peter left Disney Animation. Matt and I played more of an advisory or consulting role on this project, mostly helping with brainstorming, puzzling through ideas, and figuring out how to formally describe and present these new techniques.

Here is the paper abstract:

We present two new distance-sampling methods for production volume path tracing. We extend the null-collision integral formulation to efficiently gather heterogeneous volumetric emission, achieving higher-quality results. Additionally, we propose a tabulation-based approach to importance sample volumetric in-scattering through a spatial guiding data structure. Our methods improve the sampling efficiency for scenarios where low-order heterogeneous scattering dominates, which tends to cause high variance renderings with existing null-collision methods.

The paper and related materials can be found at:

As covered in several previous publications, several years ago we replaced Hyperion’s old residual ratio tracking [Novák et al. 2014 , Fong et al. 2017] based volume rendering system with a new, state of the art, null-collision (also called delta tracking or Woodcock tracking) tracking theory based volume rendering system. Null-collision volume rendering systems are extremely good at dense volumes where light transport is dominated by high-order scattering, such as clouds and snow and sea foam. However, null-collision volume rendering systems historically have struggled with efficiently rendering optically thin volumes dominated by low-order scattering, such as mist and fog. The reason null-collision systems struggle with optically thin volumes is because in a thin volume, the average sampled distance is usually very large, meaning that ray often goes right through the volume with very few scattering events [Villemin et al. 2018]. Since we can only evaluate illumination at each scattering event, not having a lot of scattering events means that the illumination estimate is necessarily often very low-quality, leading to tons of noise.

Frozen 2’s forest scenes tended to include large amounts of atmospheric fog to lend the movie a moody look; these atmospherics proved to be a major challenge for Hyperion’s modern volume rendering system. Going in to Raya and the Last Dragon, we knew that the challenge was only going to get harder: from fairly early on in Raya and the Last Dragon’s production, we already knew that the cinematography direction for the film was going to rely heavily on atmospherics and fog [Bryant et al. 2021] even more than Frozen 2’s cinematography did. To make things even harder, we also knew that a lot of these atmospherics were going to be lit using emissive volume light sources like fire or torches; not only did we need a good way to improve how we sampled scattering events, but we also needed a better way to sample emission.

The solution to the second problem (emission sampling) actually came long before the solution to the first problem (scattering sampling). When we first implemented our new volume rendering system, we evaluated the emission term only when an absorption even happened, which is an intuitive interpretation of a random walk since each interaction is associated with one particular event. However, shortly after we wrote our Spectral and Decomposition Tracking paper [Kutz et al. 2017], Peter realized that absorption and emission can actually also be evaluated at scattering and null-collision events too, and provided that some care was taken, doing so could be kept unbiased and mathematically correct as well. Peter implemented this technique in Hyperion before he move on from Disney Animation; later, through experiences from using an early version of this technique on Frozen 2, Wayne realized that the relationship between voxel size and majorant value needed to be factored in to this technique. When Wayne made the necessary modifications from his realization, the end result sped up this technique dramatically and in some scenes sped up overall volume rendering by up to a factor of 2x. A complete description of how all of the above is done and how it can be kept unbiased and mathematically correct makes up the first part of our talk.

The solution to the first problem (scattering sampling) came out of many brainstorming and discussion sessions between Wayne, Matt, and myself. At each volume scattering point, there are three terms that need to be sampled: transmittance, radiance, and the phase function. The latter two are directly analogous to incoming radiance and the BRDF lobe at a surface scattering event; transmittance is an additional thing that volumes have to worry about over what surfaces care about. The problem we were facing in optically thin volumes fundamentally boiled down to cases where these three terms have extremely different distributions for the same point in space. In surface path tracing, the solution to this type of problem is well understood: sample these different distributions using separate techniques and combine using MIS [Villemin & Hery 2013]. However, we had two obstacles preventing us from using MIS here: first, MIS requires knowing a sampling pdf, and at the time, computing the sampling pdf for distance sampling in a null-collision system was an unsolved problem. Second, we needed a way to do distance sampling based off of not transmittance, but instead the product of incoming radiance and the phase function; this term needed to be learned on-the-fly and stored in an easy-to-sample spatial data structure. Fortunately, almost exactly around the time we were discussing these problems, Miller et al. [2019] was published, which solved the longstanding open research problem around computing a usable pdf for distance samples, allowing for MIS. Our idea for on-the-fly learning of the product of incoming radiance and the phase function was to simply piggyback off of Hyperion’s existing cache points light-selection-guiding data structure [Burley et al. 2018]. Wayne worked through the details of all of the above and implemented both in Hyperion, and also figured out how to combine this technique with the previously existing transmittance-based distance sampling and with Peter’s emission sampling technique; the detailed description of this technique makes up the second part of our talk. The end product is a system that combines different techniques for handling thin and thick volumes to produce good, efficient results in a single unified volume integrator!

Because of the limited length of the SIGGRAPH Talks short paper format, we had to compress our text significantly to fit into the required short paper length. We put much more detail into the slides that Wayne presented at SIGGRAPH 2021; for anyone that is interested and is attending SIGGRAPH 2021, I’d highly recommend giving the talk a watch (and then going to see all of the other cool Disney Animation talks this year)! For anyone interested in the technique post-SIGGRAPH 2021, hopefully we’ll be able to get a version of the slides cleared for release by the studio at some point.

Wayne’s excellent implementations of the above techniques proved to be an enormous win for both rendering efficiency and artist workflows on Raya and the Last Dragon; I personally think we would have had enormous difficulties in hitting the lighting art direction on Raya and the Last Dragon if it weren’t for Wayne’s work. I owe Wayne a huge debt of gratitude for letting me be a small part of this project; the discussions were very fun, seeing it all come together was very exciting, and helping put the techniques down on paper for the SIGGRAPH talk was an excellent exercise in figuring out how to communicate cutting edge research clearly.

A frame from Raya and the Last Dragon without our techniques (left), and with both our scattering and emission sampling applied (right). Both images are rendered using 32 spp per volume pass; surface passes are denoised and composited with non-denoised volume passes to isolate noise from volumes. A video version of this comparison is included in our talk's supplementary materials. For a larger still comparison, click here.

References

Marc Bryant, Ryan DeYoung, Wei-Feng Wayne Huang, Joe Longson, and Noel Villegas. 2021. The Atmosphere of Raya and the Last Dragon. In ACM SIGGRAPH 2021 Talks. Article 51.

Julian Fong, Magnus Wrenninge, Christopher Kulla, and Ralf Habel. 2017. Production Volume Rendering. In ACM SIGGRAPH 2017 Courses. Article 2.

Ryusuke Villemin and Christophe Hery. 2013. Practical Illumination from Flames. Journal of Computer Graphics Techniques 2, 2 (Dec. 2013), 142-155.

Ryusuke Villemin, Magnus Wrenninge, and Julian Fong. 2018. Efficient Unbiased Rendering of Thin Participating Media. Journal of Computer Graphics Techniques 7, 3 (Sep. 2018), 50-65.

Porting Takua Renderer to 64-bit ARM- Part 2

2021-07-31T00:00:00+00:00

1. Introduction
2. Porting to arm64 macOS
3. Universal Binaries
4. Rosetta 2: Running x86-64 on Apple Silicon
5. TSO Memory Ordering on the M1 Processor
6. Embree on arm64 using sse2neon

7. (More) Differences in arm64 versus x86-64
8. (More) Performance Testing
9. Conclusion to Part 2
10. Acknowledgements
11. References

Introduction

This post is the second half of my two-part series about how I ported my hobby renderer (Takua Renderer) to 64-bit ARM and what I learned from the process. In the first part, I wrote about my motivation for undertaking a port to arm64 in the first place and described the process I took to get Takua Renderer up and running on an arm64-based Raspberry Pi 4B. I also did a deep dive into several topics that I ran into along the way, which included floating point reproducibility across different processor architectures, a comparison of arm64 and x86-64’s memory reordering models, and a comparison of how the same example atomic code compiles down to assembly in arm64 versus in x86-64. In this second part, I’ll write about developments and lessons learned after I got my initial arm64 port working correctly on Linux.

We’ll start with how I got Takua Renderer up and running on arm64 macOS, and discuss various interesting aspects of arm64 macOS, such as Universal Binaries and Apple’s Rosetta 2 binary translation layer for running x86-64 binaries on arm64 macOS. As noted in the first part of this series, my initial port of Takua Renderer to arm64 did not include Embree; after the initial port, I added Embree support using Syoyo Fujita’s embree-aarch64 project (which has since been superseded by official arm64 support in Embree v3.13.0). In this post I’ll look into how Embree, a codebase containing tons of x86-64 assembly and SSE and AVX intrinsics, was ported to arm64. I will also use this exploration of Embree as a lens through which to compare x86-64’s SSE vector extensions to arm64’s Neon vector extensions. Finally, I’ll wrap up with some additional important details to keep in mind when writing portable code between x86-64 and arm64, and I’ll also provide some more performance comparisons featuring the Apple M1 processor.

Porting to arm64 macOS

At WWDC 2020 last year, Apple announced that Macs would be transitioning from using x86-64 processors to using custom Apple Silicon chips over a span of two years. Apple Silicon chips package together CPU cores, GPU cores, and various other coprocessors and controllers onto a single die; the CPU cores implement arm64. Actually, Apple Silicon implements a superset of arm64; there are some interesting extra special instructions that Apple has added to their arm64 implementation, which I’ll get to a bit later. Similar to how Apple provided developers with preview hardware during the previous Mac transition from PowerPC to x86, Apple also announced that for this transition they would be providing Developer Transition Kits (DTKs) to developers in the form of special Mac Minis based on the iPad Pro’s A12Z chip. I had been anticipating a Mac transition to arm64 for some time, so I ordered a Developer Transition Kit as soon as they were made available.

Since I had already gotten Takua Renderer up and running on arm64 on Linux, getting Takua Renderer up and running on the Apple Silicon DTK was very fast! By far the most time consuming part of this process was just getting developer tooling set up and getting Takua’s dependencies built; once all of that was done, building and running Takua basically Just Worked™. The only reason that getting developer tooling set up and getting dependencies built took a bit of work at the time was because this was just a week and a half after the entire Mac arm64 transition had even been announced.

Interestingly, the main stumbling block I ran into for most things on Apple Silicon macOS wasn’t the change to arm64 under the hood at all; the main stumbling block was… the macOS version number! For the past 20 years, modern macOS (or Mac OS X as it was originally named) has used 10.x version numbers, but the first version of macOS to support arm64, macOS Big Sur, bumps the version number to 11.x. This version number bump threw off a surprising number of libraries and packages! Takua’s build system uses CMake and Ninja, and on macOS I get CMake and Ninja through MacPorts. At the time, a lot of stuff in MacPorts wasn’t expecting an 11.x version number, so a bunch of stuff wouldn’t build, but fixing all of this just required manually patching build scripts and portfiles to expect an 11.x version number. All of this pretty much got fixed within weeks of DTKs shipping out (and Apple actually contributed a huge number of patches themselves to various projects and stuff), but I didn’t want to wait at the time, so I just charged ahead.

Only three of Takua’s dependencies needed some minor patching to get working on arm64 macOS: TBB, OpenEXR, and Ptex. TBB’s build script just had to be updated to detect arm64 as a valid architecture for macOS; I submitted a pull request for this fix to the TBB Github repo, but I guess Intel doesn’t really take pull requests for TBB. It’s okay though; the fix has since shown up in newer releases of TBB. OpenEXR ‘s build script had to be patched so that inlined AVX intrinsics wouldn’t be used when building for arm64 on macOS; I submitted a pull request for this fix to OpenEXR that got merged, although this fix was later rendered unnecessary by a fix in the final release of Xcode 12. Finally, Ptex just needed an extra include to pick up the unlink() system call correctly from unistd.h on macOS 11. This change in Ptex was needed going from macOS Catalina to macOS Big Sur, and it’s also merged into the mainline Ptex repository now.

Once I had all of the above out of the way, getting Takua Renderer itself building and running correctly on the Apple Silicon DTK took no time at all, thanks to my previous efforts to port Takua Renderer to arm64 on Linux. At this point I just ran cmake and ninja and a minute later out popped a working build. From the moment the DTK arrived on my doorstep, I only needed about five hours to get Takua Renderer’s arm64 version building and running on the DTK with all tests passing. Considering that at that point, outside of Apple nobody had done any work to get anything ready yet, I was very pleasantly surprised that I had everything up and working in just five hours! Figure 1 is a screenshot of Takua Renderer running on arm64 macOS Big Sur Beta 1 on the Apple Silicon DTK.

Universal Binaries

The Mac has now had three processor architecture migrations in its history; the Mac line began in 1984 based on Motorola 68000 series processors, transitioned from the 68000 series to PowerPC in 1994, transitioned again from PowerPC to x86 (and eventually x86-64) in 2006, and is now in the process of transitioning from x86-64 to arm64. Apple has used a similar strategy in all three of these processor architecture migrations to smooth the process. Apple’s general transition strategy consists of two major components: first, provide a “fat” binary format that packages code from both architectures into a single executable that can run on both architecture, and second, provide some way for binaries from the old architecture to run directly on the new architecture. I’ll look into the second part of this strategy a bit later; in this section, we are interested in Apple’s fat binary format. Apple calls their fat binary format Universal Binaries; specifically, Apple uses the name “Universal 2 “for the transition to arm64 since the original Universal Binary format was for the transition to x86.

Now that I had separate x86-64 and arm64 builds working and running on macOS, the next step was to modify Takua’s build system to automatically produce a single Universal 2 binary that could run on both Intel and Apple Silicon Macs. Fortunately, creating Universal 2 binaries is very easy! To understand why creating Universal 2 binaries can be so easy, we need to first understand at a high level how a Universal 2 binary works. There actually isn’t much special about Universal 2 binaries per se, in the sense that multi-architecture support is actually an inherent feature of the Mach-O binary executable code file format that Apple’s operating systems all use. A multi-architecture Mach-O binary begins with a header that declares the file as a multi-architecture file and declares how many architectures are present. The header is immediately followed by a list of architecture “slices”; each slice is a struct describing some basic information, such as what processor architecture the slice is for, the offset in the file that instructions begin at for the slice, and so on [Oakley 2020]. After the list of architecture slices, the rest of the Mach-O file is pretty much like normal, except each architecture’s segments are concatenated after the previous architecture’s segments. Also, Mach-O’s multi-architecture support allows for sharing non-executable resources between architectures.

So, because Universal 2 binaries are really just Mach-O multi-architecture binaries, and because Mach-O multi-architecture binaries don’t do any kind of crazy fancy interleaving and instead just concatenate each architecture after the previous one, all one needs to do to make a Universal 2 binary out of separate arm64 and x86-64 binaries is to concatenate the separate binaries into a single Mach-O file and set up the multi-architecture header and slices correctly. Fortunately, a lot of tooling exists to do exactly the above! The version of clang that Apple ships with Xcode natively supports building Universal Binaries by just passing in multiple -arch flags; one for each architecture. The Xcode UI of course also supports building Universal 2 binaries by just adding x86-64 and arm64 to an Xcode project’s architectures list in the project’s settings. For projects using CMake, CMake has a CMAKE_OSX_ARCHITECTURES flag; this flag defaults to whatever the native architecture of the current system is, but can be set to x86_64;arm64 to enable Universal Binary builds. Finally, since the PowerPC to Intel transition, macOS has included a tool called lipo, which is used to query and create Universal Binaries; I’m fairly certain that the macOS lipo tool is based on the llvm-lipo tool that is part of the larger LLVM compiler project. The lipo tool can combine any x86_64 Mach-O file with any arm64 Mach-O file to create a multi-architecture Universal Binary. The lipo tool can also be used to “slim” a Universal Binary down into a single architecture by deleting architecture slices and segments from the Universal Binary.

Of course, when building a Universal Binary, any external libraries that have to be linked in also need to be Universal Binaries. Takua has a relatively small number of direct dependencies, but unfortunately some of Takua’s dependencies pull in many more indirect (relative to Takua) dependencies; for example, Takua depends on OpenVDB, which in turn pulls in Blosc, zlib, Boost, and several other dependencies. While some of these dependencies are built using CMake and are therefore very easy to build as Universal Binaries themselves, some other dependencies use older or bespoke build systems that can be difficult to retrofit multi-architecture builds into. Fortunately, this problem is where the lipo tool comes in handy. For dependencies that can’t be easily built as Universal Binaries, I just built arm64 and x86-64 versions separately and then combined the separate builds into a single Universal Binary using the lipo tool.

Once all of Takua’s dependencies were successfully built as Universal Binaries, all I had to do to get Takua itself to build as a Universal Binary was to add a check in my CMakeLists file to not use a couple of x86-64-specific compiler flags in the event of an arm64 target architecture. Then I just set the CMAKE_OSX_ARCHITECTURES flag to x86_64;arm64, ran ninja, and out came a working Universal Binary! Figure 2 shows building Takua Renderer, checking that the current system architecture is an Apple Silicon Mac, using the lipo tool to see and confirm that the output Universal Binary contains both arm64 and x86-64 slices, and finally try running the Universal Binary Takua Renderer build:

Out of curiosity, I also tried creating separate x86-64-only and arm64-only builds of Takua and assembling them into a Universal Binary using the lipo tool and comparing the result with the build of Takua that was natively built as a Universal Binary. In theory natively building as a Universal Binary should be able to produce a slightly more compact output binary compared with using the lipo tool, since a natively built Universal Binary should be able to share non-code resources between different architectures, whereas the lipo tool just blindly encapsulates two separate Mach-O files into a single multi-architecture Mach-O file. In fact, you can actually use the lipo tool to combine completely different programs into a single Universal Binary; after all, lipo has absolutely no way of knowing whether or not the arm64 and x86-64 code you want to combine is actually even from the same source code or implements the same functionality. Indeed, the native Universal Binary Takua is slightly smaller than the lipo-generated Universal Binary Takua. The size difference is tiny (basically negligible) though, likely because Takua’s binary contains very few non-code resources. Figure 3 shows creating a Universal Binary by combining separate x86-64 and arm64 builds of Takua together using the lipo tool versus a Universal Binary built natively as a Universal Binary; the lipo version is just a bit over a kilobyte larger than the native version, which is negligible relative to the overall size of the files.

Rosetta 2: Running x86-64 on Apple Silicon

While getting Takua Renderer building and running as a native arm64 binary on Apple Silicon only took me about five hours, actually running Takua for the first time in any form on Apple Silicon happened much faster! Before I did anything to get Takua’s arm64 build up and running on my Apple Silicon DTK, the first thing I did was just copy over the x86-64 macOS build of Takua to see if it would run on Apple Silicon macOS through Apple’s dynamic binary translation layer, Rosetta 2. I was very impressed to find that the x86-64 version of Takua just worked out-of-the-box through Rosetta 2, and even passed my entire test suite! I have now had Takua’s native arm64 build up and running as part of a Universal 2 binary for around a year, but I recently circled back to examine how Takua’s x86-64 build works through Rosetta 2. I wanted to get a rough idea of how Rosetta 2 works, and much like many of the detours that I took on the entire Takua arm64 journey, I stumbled into a good opportunity to compare x86-64 and arm64 and learn more about how the two are similar and how they differ.

For every processor architecture transition that the Mac had undertaken, Apple has provided some sort of mechanism to run binaries for the outgoing processor architecture on Macs based on the new architecture. During the 68000 to PowerPC transition, Apple’s approach was to emulate an entire 68000 system at the lowest levels of the operating system on PowerPC; in fact, during this transition, PowerPC Macs even allowed 68000 and PowerPC code to call back and forth to each other and be interspersed within the same binary. During the PowerPC to x86 transition, Apple introduced Rosetta, which worked by JIT-compiling blocks of PowerPC code into x86 on-the-fly at program runtime. For the x86-64 to arm64 transition, Rosetta 2 follows in the same tradition as in the previous two architecture transitions. Rosetta 2 has two modes: the first is an ahead-of-time recompiler that converts an entire x86-64 binary to arm64 upon first run of an x86-64 binary and caches the translated binary for later reuse. The second mode Rosetta 2 has is a JIT translator, which is used for cases where the target program itself is also JIT-generating x86-64 code; obviously in these cases the target program’s JIT output cannot be recompiled to arm64 through an ahead-of-time process.

Apple does not publicly provide much information at all about how Rosetta 2 works under the hood. Rosetta 2 is one of those pieces of Apple technology that basically “Just Works” well enough that the typical user never really has any need to know much about how it works internally, which is great for users but unfortunate for anyone that is more curious. Fortunately though, Koh Nakagawa recently published a detailed analysis of Rosetta 2 produced through some careful reverse engineering work. What I was interested in examining was how Rosetta 2’s output arm64 assembly looks compared with natively compiled arm64 assembly, so I’ll briefly summarize the relevant parts of how Rosetta 2 generates arm64 code. There’s a lot more cool stuff about Rosetta 2, such as how the runtime and JIT mode works, that I won’t touch on here; if you’re interested, I’d highly recommend checking out Koh Nakagawa’s writeups.

When a user tries to run an x86-64 binary on an Apple Silicon Mac, Rosetta 2 first checks if this particular binary has already been translated by Rosetta 2 before; Rosetta 2 does this through a system daemon called oahd. If Rosetta 2 has never encountered this particular binary before, oahd kicks off a new process called oahd-helper that carries out the ahead-of-time (AOT) binary translation process and caches the result in a folder located at /var/db/oah; cached AOT arm64 binaries are stored in subfolders named using a SHA-256 hash calculated from the contents and path of the original x86-64 binary. If Rosetta 2 has encountered a binary before, as determined by finding an SHA-256 hash collision in /var/db/oah, then oahd just loads the cached AOT binary from before.

So what do these cached AOT binaries look like? Unfortunately, /var/db/oah is by default not accessible to users at all, not even admin and root users. Fortunately, like with all protected components of macOS, access can be granted by disabling System Integrity Protection (SIP). I don’t recommend disabling SIP unless you have a very good reason to, since SIP is designed to protect core macOS files from getting damaged or modified, but for this exploration I temporarily disabled SIP just long enough to take a look in /var/db/oah. Well, it turns out that the cached AOT binaries are just regular-ish arm64 Mach-O files named with an .aot extension; I say “regular-ish” because while the .aot files are completely normal Mach-O binaries, they cannot actually be executed on their own. Attempting to directly run a .aot binary results in an immediate SIGKILL. Instead, .aot binaries must be loaded by the Rosetta 2 runtime and require some special memory mapping to run correctly. But that’s fine; I wasn’t interested in running the .aot file, I was interested in learning what it looks like inside, and since the .aot file is a Mach-O file, we can disassemble .aot files just like any other Mach-O file.

Let’s go through a simple example to compare how the same piece of C++ code compiles to arm64 natively, versus what Rosetta 2 generates from a x86-64 binary. The simple example C++ code I’ll use here is the same basic atomic float addition implementation that I wrote about in my previous post; since that post already contains an exhaustive analysis of how this example compiles to both x86-64 and arm64 assembly, I figure that means I don’t need to go over all of that again and can instead dive straight into the Rosetta 2 comparison. To make an actually executable binary though, I had to wrap the example addAtomicFloat() function in a simple main() function:

#include <atomic>

float addAtomicFloat(std::atomic<float>& f0, const float f1) {
    do {
        float oldval = f0.load();
        float newval = oldval + f1;
        if (f0.compare_exchange_weak(oldval, newval)) {
            return oldval;
        }
    } while (true);
}

int main() {
    std::atomic<float> t(0);
    addAtomicFloat(t, 1.0f);
    return 0;
}

Listing 1: Example addAtomicFloat() implementation and a very simple main() function to make a executable program. The addAtomicFloat() implementation is the same one from Listing 2 in my previous "Porting Takua Renderer to 64-bit ARM- Part 1" post.

Modern versions of macOS’s Xcode Command Line Tools helpfully come with both otool and with LLVM’s version of objdump, both of which can be used to disassembly Mach-O binaries. For this exploration, I used otool to disassemble arm64 binaries and objdump to disassembly x86-64 binaries. I used different tools for disassembling x86-64 versus arm64 because of slightly different feature sets that I needed on each platform. By default, Apple’s version of Clang uses newer ARMv8.1-A instructions like casal. However, the version of objdump that Apple ships with the Xcode Command Line Tools only seems to support base ARMv8-a and doesn’t understand newer ARMv8.1-A instructions like casal, whereas otool does seem to know about ARMv8.1 instructions, hence using otool for arm64 binaries. For x86-64 binaries, however, otool outputs x86-64 assembly using AT&T syntax, whereas I prefer reading x86-64 assembly in Intel syntax, which matches what Godbolt Compiler Explorer defaults to. So, for x86-64 binaries, I used objdump, which can be set to output x86-64 assembly using Intel syntax with the -x86-asm-syntax=intel flag.

On both x86-64 and on arm64, I compiled the example in Listing 1 using the default Clang that comes with Xcode 12.5.1, which reports its version string as “Apple clang version 12.0.5 (clang-1205.0.22.11)”. Note that Apple’s Clang version numbers have nothing to do with mainline upstream Clang version numbers; according to this table on Wikipedia, “Apple clang version 12.0.5” corresponds roughly with mainline LLVM/Clang 11.1.0. Also, I compiled using the -O3 optimization flag.

Disassembling the x86-64 binary using objdump -disassemble -x86-asm-syntax=intel produces the following x86-64 assembly. I’ve only included the assembly for the addAtomicFloat() function and not the assembly for the dummy main() function. For readability, I have also replaced the offset for the jne instruction with a more readable label and added the label into the correct place in the assembly code:

<__Z14addAtomicFloatRNSt3__16atomicIfEEf>:     # f0 is dword ptr [rdi], f1 is xmm0
        push          rbp                      # save address of previous stack frame
        mov           rbp, rsp                 # move to address of current stack frame
        nop           word ptr cs:[rax + rax]  # multi-byte no-op, probably to align
                                               #    subsequent instructions better for
                                               #    instruction fetch performance
        nop                                    # no-op
.LBB0_1:
        mov           eax, dword ptr [rdi]     # eax = *arg0 = f0.load()
        movd          xmm1, eax                # xmm1 = eax = f0.load()
        movdqa        xmm2, xmm1               # xmm2 = xmm1 = eax = f0.load()
        addss         xmm2, xmm0               # xmm2 = (xmm2 + xmm0) = (f0 + f1)
        movd          ecx, xmm2                # ecx = xmm2 = (f0 + f1)
        lock cmpxchg  dword ptr [rdi], ecx     # if eax == *arg0 { ZF = 1; *arg0 = arg1 }
                                               #    else { ZF = 0; eax = *arg0 };
                                               #    "lock" means all done exclusively
        jne           .LBB0_1                  # if ZF == 0 goto .LBB0_1
        movdqa        xmm0, xmm1               # return f0 value from before cmpxchg
        pop           rbp                      # restore address of previous stack frame
        ret                                    # return control to previous stack frame address
        nop

Listing 2: The addAtomicFloat() function from Listing 1 compiled to x86-64 using clang++ -O3 and disassembled using objdump -disassemble -x86-asm-syntax=intel, with some minor tweaks for formatting and readability. My annotations are also included as comments.

If we compare the above code with Listing 5 in my previous post, we can see that the above code matches what we got from Clang in Godbolt Compiler Explorer. The only difference is the stack pointer pushing and popping code that happens in the beginning and end to make this function usable in a larger program; the core functionality in lines 8 through 18 of the above code matches the output from Clang in Godbolt Compiler Explorer exactly.

Next, here’s the assembly produced by disassembling the arm64 generated using Clang. I disassembled the arm64 binary using otool -Vt; here’s the relevant addAtomicFloat() function with the same minor changes as in Listing 2 for more readable section labels:

__Z14addAtomicFloatRNSt3__16atomicIfEEf:
.LBB0_1:
        ldar      w8, [x0]          // w8 = *arg0 = f0, non-atomically loaded
        fmov      s1, w8            // s1 = w8 = f0
        fadd      s2, s1, s0        // s2 = s1 + s0 = (f0 + f1)
        fmov      w9, s2            // w9 = s2 = (f0 + f1)
        mov       x10, x8           // x10 (same as w10) = x8 (same as w8)
        casal     w10, w9, [x0]     // atomically read the contents of the address stored
                                    //    in x0 (*arg0 = f0) and compare with w10;
                                    //    if [x0] == w10:
                                    //       atomically set the contents of the
                                    //       [x0] to the value in w9
                                    //    else:
                                    //       w10 = value loaded from [x0]
        cmp       w10, w8           // compare w10 and w8 and store result in N
        cset      w8, eq            // if previous instruction's compare was true,
                                    //    set w8 = 1
        cmp       w8, #0x1          // compare if w8 == 1 and store result in N
        b.ne      .LBB0_1           // if N==0 { goto .LBB0_1 }
        mov.16b   v0, v1            // return f0 value from ldar
        ret

Listing 3: The addAtomicFloat() function from Listing 1 compiled to arm64 using clang++ -O3 and disassembled using otool -Vt, with some minor tweaks for formatting and readability.
My annotations are also included as comments.

Note the use of the ARMv8.1-A casal instruction. Apple’s version of Clang defaults to using ARMv8.1-A instructions when compiling for macOS because the M1 chip implements ARMv8.4-A, and since the M1 chip is the first arm64 processor that macOS supports, that means macOS can safely assume a more advanced minimum target instruction set. Also, the arm64 assembly output in Listing 3 looks almost exactly identical structurally to the Godbolt Compiler Explorer Clang output in Listing 9 from my previous post. The only differences are in small syntactical differences with how the mov instruction in line 20 specifies a 16 byte (128 bit) SIMD register, some different register choices, and a different ordering of fmov and mov instructions in lines 6 and 7.

Finally, let’s take a look at the arm64 assembly that Rosetta 2 generates through the AOT process described earlier. Disassembling the Rosetta 2 AOT file using otool -Vt produces the following arm64 assembly; like before, I’m only including the relevant addAtomicFloat() function. Since the code below switches between x and w registers a lot, remember that in arm64 assembly, x0-x30 and w0-w30 are really the same registers; x just means use the full 64-bit register, whereas w just means use the lower 32 bits of the x register with the same register number. Also, the v registers are 128-bit vector registers that are separate from the x/y set of registers; s registers are the bottom 32 bits of v registers. In my annotations, I’ll use x for both x and w registers, and I’ll use v for both v and s registers.

__Z14addAtomicFloatRNSt3__16atomicIfEEf:
        str      x5, [x4, #-0x8]!         // store value at x5 to ((address in x4) - 8) and
                                          // write calculated address back into x4
        mov      x5, x4                   // x5 = address in x4
.LBB0_1
        ldr      w0, [x7]                 // x0 = *arg0 = f0, non-atomically loaded
        fmov     s1, w0                   // v1 = x0 = f0
        mov.16b  v2, v1                   // v2 = v1 = f0
        fadd     s2, s2, s0               // v2 = v2 + v0 = (f0 + f1)
        mov.s    w1, v2[0]                // x1 = v2 = (f0 + f1)
        mov      w22, w0                  // x22 = x0 = f0
        casal    w22, w1, [x7]            // atomically read the contents of the address stored
                                          //    in x7 (*arg0 = f0) and compare with x22;
                                          //    if [x7] == x22:
                                          //       atomically set the contents of the
                                          //       [x7] to the value in x1
                                          //    else:
                                          //       x22 = value loaded from [x7]
        cmp      w22, w0                  // compare x22 and x0 and store result in N
        csel     w0, w0, w22, eq          // if N==1 { x0 = x0 } else { x0 = x22 }
        b.ne     .LBB0_1                  // if N==0 { goto .LBB0_1 }
        mov.16b  v0, v1                   // v0 = v1 = f0
        ldur     x5, [x4]                 // x5 = value at address in x4, using unscaled load
        add      x4, x4, #0x8             // add 8 to address stored in x4
        ldr      x22, [x4], #0x8          // x22 = value at ((address in x4) + 8)
        ldp      x23, x24, [x21], #0x10   // x23 = value at address in x21 and
                                          // x24 = value at ((address in x21) + 8)
        sub      x25, x22, x23            // x25 = x22 - x23
        cbnz     x25, .LBB0_2             // if x22 != x23 { goto .LBB0_2 }
        ret      x24
.LBB0_2
        bl       0x4310                   // branch (with link) to address 0x4310

Listing 4: The x86-64 assembly from Listing 2 translated to arm64 by Rosetta 2's ahead-of-time translator. Disassembled using otool -Vt, with some minor tweaks for formatting and readability. My annotations are also included as comments.

In some ways, we can see similarities between the Rosetta 2 arm64 assembly in Listing 4 and the natively compiled arm64 assembly in Listing 3, but there are also a lot of things in the Rosetta 2 arm64 assembly that look very different from the natively compiled arm64 assembly. The core functionality in lines 9 through 21 of Listing 4 bear a strong resemblance to the core functionality in lines 5 through 19 of of Listing 3; both versions use a fadd, followed by a casal instruction to implement the atomic comparison, then follow with a cmp to compare the expected and actual outcomes, and then have some logic about whether or not to jump back to the top of the loop. However, if we look more closely at the core functionality in the Rosetta 2 version, we can see some oddities. In preparing for the fadd instruction on line 9, the Rosetta 2 version does a fmov followed by a 16-bit mov into register v2, and then the fadd takes a value from v2, adds the value to what is in v0, and stores the result back into v2. The 16-bit move is pointless! Instead of two mov instructions and an fadd where the first source registers and destination registers are the same, a better version would be to omit the second mov instruction and instead just do fadd s2 s1 s0. In fact, in Listing 3 we can see that the natively compiled version does in fact just use a single mov and do fadd s2 s1 s0. So, what’s going on here?

Things begin to make more sense once we look at the x86-64 assembly that the Rosetta 2 version is translated from. In Listing 2’s x86-64 version, the addss instruction only has two inputs because the first source register is always also the destination register. So, the x86-64 version has no choice but to use a few extra mov instructions to make sure values that are needed later aren’t overwritten by the addss instruction; whatever value needs to be in xmm2 during the addss instruction must also be squirreled away in a second location if that value is still needed after addss is executed. Since the Rosetta 2 arm64 assembly is a direct translation from the x86-64 assembly, the extra mov needed in the x86-64 version gets translated into the extraneous mov.16b in Listing 4, and the two-operand x86-64 addss gets translated into a strange looking fadd where the same register is duplicated for the first source and destination operands; this duplication is a direct one-to-one mapping to what addss does.

I think from the above we can see two very interesting things about Rosetta 2’s translation. On one hand, the fact that the overall structure of the core functionality in the Rosetta 2 and natively compiled versions is so similar is very impressive, especially when considering that Rosetta 2 had absolutely no access to the original high-level C++ source code! I guess my example function here is a very simple test case, but nonetheless I was impressed that Rosetta 2’s output overall isn’t too bad. On the other hand though, the Rosetta 2 version does have small oddities and inefficiencies that arise from doing a direct mechanical translation from x86-64. Since Rosetta 2 has no access to the original source code, no context for what the code does, and has no ability to build any kind of higher-level syntactic understanding, the best Rosetta 2 really can do is a direct mechanical translation with a relatively high level of conservatism with respect to preserving what the original x86-64 code is doing on an instruction-by-instruction basis. I don’t think that this is actually a fault in Rosetta 2; I think it’s actually pretty much the only reasonable solution. I don’t know how Rosetta 2’s translator is actually implemented internally, but my guess is that the translator is parsing the x86-64 machine code, generating some kind of IR, and then lowering that IR back to arm64 (who knows, maybe it’s even LLIR). But, even if Rosetta 2 is generating some kind of IR, that IR at best can only correspond well to the IR that was generated by the last optimization pass in the original compilation to x86-64, and in any last optimization pass, a huge amount of higher level context is likely already lost from the original source program. Short of doing heroic amounts of program analysis, there’s nothing Rosetta 2 can do about this lost higher level context, and even if implementing all of that program analysis was worthwhile (Which it almost certainly is not) there’s only so much that static analysis can do anyway. I guess all of the above is a long way of saying: looking at the above example, I think Rosetta 2’s output is really impressive and surprisingly more optimal than I would have guessed before, but at the same time the inherent advantage that natively compiling to arm64 has is obvious.

However, all of the above is just looking at the core functionality of the original function. If we look at the arm64 assembly surrounding this core functionality in Listing 4 though, we can see some truly strange stuff. The Rosetta 2 version is doing a ton of pointer arithmetic and moving around addresses and stuff, and operands seem to be passed into the function using the wrong registers (x7 instead of x0). What is this stuff all about? The answer lies in how the Rosetta 2 runtime works, and in what makes a Rosetta 2 AOT Mach-O file different from a standard macOS Mach-O binary.

One key fundamental difference between Rosetta 2 AOT binaries and regular arm64 macOS binaries is that Rosetta 2 AOT binaries use a completely different ABI from standard arm64 macOS. On Apple platforms, the ABI used for normal arm64 Mach-O binaries is largely based on the standard ARM-developed arm64 ABI [ARM Holdings 2015], with some small differences [Apple 2020] in function calling conventions and how some data types are implemented and aligned. However, Rosetta 2 AOT binaries use an arm64-ized version of the System V AMD64 ABI, with a direct mapping between x86_64 and arm64 registers [Nakagawa 2021]. This different ABI means that intermixing native arm64 code and Rosetta 2 arm64 code is not possible (or at least not at all practical), and this difference is also the explanation for why the Rosetta 2 assembly uses unusual registers for passing parameters into the function. In the standard arm64 ABI calling convention, registers x0 through x7 are used to pass function arguments 0 through 7, with the rest going on the stack. In the System V AMD64 ABI calling convention, function arguments are passed using registers rdi, rsi, rdx, rcx, r8, and r9 for arguments 0 through 5 respectively, with everything else on the stack in reverse order. In the arm64-ized version of the System V AMD64 ABI that Rosetta 2 AOT uses, the x86-64 rdi, rsi, rdx, rcx, r8, and r9 registers map to the arm64 x7, x6, x2, x1, x8, and x9, respectively [Nakagawa 2021]. So, that’s why in line 6 of Listing 4 we see a load from an address stored in x7 instead of x0, because x7 maps to x86-64’s rdi register, which is the first register used for passing arguments in the System V AMD64 ABI [OSDev 2018]. If we look at the corresponding instruction on line 9 of Listing 2, we can see that the x86-64 code does indeed use a mov instruction from the address stored in rdi to get the first function argument.

As for all of the pointer arithmetic and address trickery in lines 23 through 28 of Listing 4, I’m not 100% sure what it is for, but I have a guess. Earlier I mentioned that .aot binaries cannot run like a normal binary and instead require some special memory mapping to work; I think all of this pointer arithmetic may have to do with that. The way that the Rosetta 2 runtime interacts with the AOT arm64 code is that both the runtime and the AOT arm64 code are mapped into the same memory space at startup and the program counter is set to the entry point of the Rosetta 2 runtime; while running, the AOT arm64 code frequently can jump back into the Rosetta 2 runtime because the Rosetta 2 runtime is what handles things like translating x86_64 addresses into addresses in the AOT arm64 code [Nakagawa 2021]. The Rosetta 2 runtime also directs system calls to native frameworks, which helps improve performance; this property of the Rosetta 2 runtime means that if an x86-64 binary does most of its work by calling macOS frameworks, the translated Rosetta 2 AOT binary can still run very close to native speed (as an interesting aside: Microsoft is adding a much more generalized version of this concept to Windows 11’s counterpart to Rosetta 2: Windows 11 on Arm will allow arbitrary mixing of native arm64 code and translated x86-64 code [Sweetgall 2021]. Finally, when a Rosetta 2 AOT binary is run, not only the arm64 and Rosetta 2 runtime are mapped into the running program memory; the original x86-64 binary is mapped in as well. The AOT binary that Rosetta 2 generates does not actually contain any constant data from the original x86-64 binary; instead, the AOT file references the constant data from the x86-64 binary, which is why the x86-64 binary also needs to be loaded in. My guess is that the pointer arithmetic stuff happening in the end of Listing 4 is possibly either to calculate offsets to stuff in the x86-64 binary, or to calculate offsets into the Rosetta 2 runtime itself.

Now that we have a better understanding of what Rosetta 2 is actually doing under the hood and how good the translated arm64 code is compared with natively compiled arm64 code, how does Rosetta 2 actually perform in the real world? I compared Takua Renderer running as native arm64 code versus as x86-64 code running through Rosetta 2 on four different scenes, and generally running through Rosetta 2 yielded about 65% to 70% of the performance of running as native arm64 code. The results section at the end of this post contains the detailed numbers and data. Generally, I’m very impressed with this amount of performance for emulating x86-64 code on an arm64 processor, especially when considering that with high-performance code like Takua Renderer, Rosetta 2 has close to zero opportunities to provide additional performance by calling into native system frameworks. As can be seen in the data in the results section, even more impressive is the fact that even running at 70% of native speed, x86-64 Takua Renderer running on the M1 chip through Rosetta 2 is often on-par with or even faster than x86-64 Takua Renderer running natively on a contemporaneous current-generation 2019 16-inch MacBook Pro with a 6-core Intel Core i7-9750H processor!

TSO Memory Ordering on the M1 Processor

As I covered extensively in my previous post, one major crucial architectural difference between arm64 and x86-64 is in memory ordering: arm64 is a weakly ordered architecture, whereas x86-64 is a strongly ordered architecture [Preshing 2012]. Any system emulating x86-64 binaries on an arm64 processor needs to overcome this memory ordering difference, which means emulating strong memory ordering on a weak memory architecture. Unfortunately, doing this memory ordering emulation in software is extremely difficult and extremely inefficient. since emulating strong memory ordering on a weak memory architecture means providing stronger memory ordering guarantees than the hardware actually provides. This memory ordering emulation is widely understood to be one of the main reasons why Microsoft’s x86 emulation mode for Windows on Arm incurs a much higher performance penalty compared with Rosetta 2, even though the two systems have broadly similar architectures [Hickey et al. 2021] at a high level.

Apple’s solution to the difficult problem of emulating strong memory ordering in software was to… just completely bypass the problem altogether. Rosetta 2 does nothing whatsoever to emulate strong memory ordering in software; instead, Rosetta 2 provides strong memory ordering through hardware. Apple’s M1 processor has an unusual feature for an ARM processor: the M1 processor has optional total store memory ordering (TSO) support! By default, the M1 processor only provides the weak memory ordering guarantees that the arm64 architecture specifies, but for x86-64 binaries running under Rosetta 2, the M1 processor is capable of switching to strong memory ordering in hardware on a core-by-core basis. This capability is a great example of the type of hardware-software integration that Apple is able to accomplish by owning and building the entire tech stack from the software all the way down to the silicon.

Actually, the M1 is not the first Apple Silicon chip to have TSO support. The A12Z chip that was in the Apple Silicon DTK also has TSO support, and the A12Z is known to be a re-binned but otherwise identical variant of the A12X chip from 2018, so we can likely safely assume that the TSO hardware support has been present (albeit unused) as far back as the 2018 iPad Pro! However, the M1 processor’s TSO implementation does have a significant leg up on the implementation in the A12Z. Both the M1 and the A12Z implement a version of ARM’s big.LITTLE technology, where the processor contains two different types of CPU cores: lower-power energy-efficient cores, and high-power performance cores. On the A12Z, hardware TSO support is only implemented in the high-power performance cores, whereas in the M1, hardware TSO support is implement on both the efficiency and performance cores. As a result, on the A12Z-based Apple Silicon DTK, Rosetta 2 can only use four out of eight total CPU cores on the chip, whereas on M1-based Macs, Rosetta 2 can use all eight CPU cores.

I should mentioned here that, interestingly, the A12Z and M1 are actually not the first ARM CPUs to implement TSO as the memory model [Threedots 2021]. Remember, when ARM specifies weak ordering in the architecture, what this actually means is that any arm64 implementation can actually choose to have any kind of stronger memory model since code written for a weaker memory model should also work correctly on a stronger memory model; only going the other way doesn’t work. NVIDIA’s Denver and Carmel CPU microarchitectures (found in various NVIDIA Tegra and Xaviar system-on-a-chips) are also arm64 designs that implement a sequentially consistency memory model. If I had to guess, I would guess that Denver and Carmel’s sequential consistency memory model is a legacy of the Denver Projects’s origins as a project to build an x86-64 CPU; the project was shifted to arm64 before release. Fujitsu’s A64FX processor is another arm64 design that implements TSO as its memory model, which makes sense since the A64FX processor is meant for use in supercomputers as a successor to Fujitsu’s previous SPARC-based supercomputer processors, which also implemented TSO. However, to the best of my knowledge, Apple’s A12Z and M1 are unique in their ability to execute in both the usual weak ordering mode and TSO mode.

To me, probably the most interesting thing about hardware TSO support in Apple Silicon is that switching ability. Even more interesting is that the switching ability doesn’t require a reboot or anything like that- each core can be independently switched between strong and weak memory ordering on-the-fly at runtime through software. On Apple Silicon processors, hardware TSO support is enabled by modifying a special register named actlr_el1; this register is actually defined by the arm64 specification as an implementation-defined auxiliary control register. Since actlr_el1 is implementation-defined, Apple has chosen to use it for toggling TSO and possibly for toggling other, so far publicly unknown special capabilities. However, the actlr_el1 register, being a special register, cannot be modified by normal code; modifications to actlr_el1 can only be done by the kernel, and the only thing in macOS that the kernel enables TSO for is Rosetta 2…

…at least by default! Shortly after Apple started shipping out Apple Silicon DTKs last year, Saagar Jha figured out how to allow any program to toggle TSO mode through a custom kernel extension. The way the TSOEnabler kext works is extremely clever; the kext searches through the kernel to find where the kernel is modifying actlr_el1 and then traces backwards to figure out what pointer the kernel is reading a flag from for whether or not to enable TSO mode. Instead of setting TSO mode itself, the kext then intercepts the pointer to the flag and writes to it, allowing the kernel to handle all of the TSO mode setup work since there’s some other stuff that needs to happen in addition to modifying actlr_el1. Out of sheer curiosity, I compiled the TSOEnabler kext and installed it on my M1 Mac Mini to give it a try! I don’t suggest installing and using TSOEnabler casually, and definitely not for normal everyday use; installing a custom self-compiled, unsigned kext on modern macOS requires disabling SIP. However, I already had SIP disabled due to my earlier Rosetta 2 AOT exploration, and so I figured why not give this a shot before I reset everything and reenable SIP.

The first thing I wanted to try was a simple test to confirm that the TSOEnabler kext was working correctly. In my last post, I wrote about a case where weak memory ordering was exposing a bug in some code written around incrementing an atomic integer; the “canonical” example of this specific type of situation is Jeff Preshing’s multithreaded atomic integer incrementer example using std::memory_order_relaxed. I adapted Jeff Preshing’s example for my test; in this test, two threads both increment a shared integer counter 1000000 times, with exclusive access to the integer guarded using an atomic integer flag. Operations on the atomic integer flag use std::memory_order_relaxed. On strongly-ordered CPUs, using std::memory_order_relaxed works fine and at the end of the program, the value of the shared integer counter is always 2000000 as expected. However, on weakly-ordered CPUs, weak memory ordering means that two threads can end up in a race condition to increment the shared integer counter; as a result, on weakly-ordered CPUs, at the end of the program the value of the shared integer counter is very often something slightly less than 2000000. The key modification I made to this test program was to enable the M1 processor’s hardware TSO mode for each thread; if hardware TSO mode is correctly enabled, then the value of the shared integer counter should always end up being 2000000. If you want to try for yourself, Listing 5 below includes the test program in its entirety; compile using c++ tsotest.cpp -std=c++11 -o tsotest. The test program takes a single input parameter: 1 to enable hardware TSO mode, and anything else to leave TSO mode disabled. Remember, to use this program, you must have compiled and installed the TSOEnabled kernel extension mentioned above.

#include <atomic>
#include <iostream>
#include <thread>
#include <sys/sysctl.h>

static void enable_tso(bool enable_) {
    int enable = int(enable_);
    size_t size = sizeof(enable);
    int err = sysctlbyname("kern.tso_enable", NULL, &size, &enable, size);
    assert(err == 0);
}

int main(int argc, char** argv) {
    bool useTSO = false;
    if (argc > 1) {
        useTSO = std::stoi(std::string(argv[1])) == 1 ? true : false;
    }
    std::cout << "TSO is " << (useTSO ? "enabled" : "disabled") << std::endl;

    std::atomic<int> flag(0);
    int sharedValue = 0;
    auto counter = [&](bool enable) {
        enable_tso(enable);
        int count = 0;
        while (count < 1000000) {
            int expected = 0;
            if (flag.compare_exchange_strong(expected, 1, std::memory_order_relaxed)) {
                // Lock was successful
                sharedValue++;
                flag.store(0, std::memory_order_relaxed);
                count++;
            }
        }
    };

    std::thread thread1([&]() { counter(useTSO); });
    std::thread thread2([&]() { counter(useTSO); });
    thread2.join();
    thread1.join();

    std::cout << sharedValue << std::endl;
}

Listing 5: Jeff Preshing's weakly ordered atomic integer test program, modified to support using the M1 processor's hardware TSO mode.

Running my test program indicated that the kernel extension was working properly! In the screenshot below, I check that the Mac I’m running on has an arm64 processor, then I compile the test program and check that the output is a native arm64 binary, and then I run the test program four times each with and without hardware TSO mode enabled. As expected, with hardware TSO mode disabled, the program counts slightly less than 2000000 increments on the shared atomic counter, whereas with hardware TSO mode enabled, the program counts exactly 2000000 increments every time:

Being able to enable hardware TSO mode in a native arm64 binary outside of Rosetta 2 actually does have some practical uses. After I confirmed that the kernel extension was working correctly, I temporarily hacked hardware TSO mode into Takua Renderer’s native arm64 version, which allowed me to further verify that everything was working correctly with all of the various weakly ordered atomic fixes that I described in my previous post. As mentioned in my previous post, comparing renders across different processor architectures is difficult for a variety of reasons, and previously comparing Takua Renderer running on a weakly ordered CPU versus on a strongly ordered CPU required comparing renders made on arm64 versus renders made on x86-64. Using the M1’s hardware TSO mode though, I was able to compare renders made on exactly the same processor, which confirmed that everything works correctly! After doing this test, I then removed the hardware TSO mode from Takua Renderer’s native arm64 version.

One silly idea I tried was to disable hardware TSO mode from inside of Rosetta 2, just to see what would happen. Rosetta 2 does not support running x86-64 kernel extensions on arm64; all macOS kernel extensions must be native to the architecture they are running on. However, as mentioned earlier, the Rosetta 2 runtime bridges system framework calls from inside of x86-64 binaries to their native arm64 counterparts, and this includes sysctl calls! So we can actually call sysctlbyname("kern.tso_enable") from inside of an x86-64 binary running through Rosetta 2, and Rosetta 2 will pass the call along correctly to the native TSOEnabler kernel extension, which will then properly set hardware TSO mode. For a simple test, I added a bit of code to test if a binary is running under Rosetta 2 or not and compiled the test program from Listing 5 for x86-64. For the sake of completeness, here is how to check if a process is running under Rosetta 2; this code sample was provided by Apple in a WWDC 2020 talk about Apple Silicon:

// Use "sysctl.proc_translated" to check if running in Rosetta

// Returns 1 if running in Rosetta
int processIsTranslated() {
    int ret = 0;
    size_t size = sizeof(ret);
    // Call the sysctl and if successful return the result
    if (sysctlbyname("sysctl.proc_translated", &ret, &size, NULL, 0) != -1) 
            return ret;
    // If "sysctl.proc_translated" is not present then must be native
    if (errno == ENOENT)
            return 0;
    return -1;
}

Listing 6: Example code from Apple on how to check if the current process is running through Rosetta 2.

In Figure 5, I build the test program from Listing 5 as an x86-64 binary, with the Rosetta 2 detection function from Listing 6 added in. I then check that the system architecture is arm64 and that the compiled program is x86-64, and run the test program with TSO disabled from inside of Rosetta 2. The program reports that it is running through Rosetta 2 and reports that TSO is disabled, and then proceeds to report slightly less than 2000000 increments to the shared atomic counter:

Of course, being able to disable hardware TSO mode from inside of Rosetta 2 is only a curiosity; I can’t really think of any practical reason why anyone would ever want to do this. I guess one possible answer is to try to claw back some performance whilst running through Rosetta 2, since the hardware TSO mode does have a tangible performance impact, but this answer isn’t actually valid, since there is no guarantee that x86-64 binaries running through Rosetta 2 will work correctly with hardware TSO mode enabled. The simple example here only works precisely because it is extremely simple; I also tried hacking disabling hardware TSO mode into the x86-64 version of Takua Renderer and running that through Rosetta 2. The result was that this hacked version of Takua Renderer would run for only a fraction of a second before running into a hard crash from somewhere inside of TBB. More complex x86-64 programs with hardware TSO mode not working correctly or even crashing shouldn’t be surprising, since the x86-64 code itself can have assumptions about strong memory ordering baked into whatever optimizations the code was compiled with. As mentioned earlier, running a program written and compiled with weak memory ordering assumptions on a stronger memory model should work correctly, but running a program written and compiled with strong memory ordering assumptions on a weaker memory model can cause problems.

Speaking of the performance of hardware TSO mode, the last thing I tried was measuring the performance impact of enabling hardware TSO mode. I hacked enabling hardware TSO mode into the native arm64 version of Takua Renderer, with the idea being that by comparing the Rosetta 2, custom TSO-enabled native arm64, and default TSO-disabled native arm64 versions of Takua Renderer, I could get a better sense of exactly how much performance cost there is to running the M1 with TSO enabled, and how much of the performance cost of Rosetta 2 comes from less efficient translated arm64 code versus from TSO-enabled mode. The results section at the end of this post contains the exact numbers and data for the four scenes that I tested; the general trend I found was that native arm64 code with hardware TSO enabled ran about 10% to 15% slower than native arm64 code with hardware TSO disabled. When comparing with Rosetta 2’s overall performance, I think we can reasonably estimate that on the M1 chip, hardware TSO is responsible for somewhere between a third to a half of the performance discrepancy between Rosetta 2 and native weakly ordered arm64 code.

Apple Silicon’s hardware TSO mode is a fascinating example of Apple extending the base arm64 architecture and instruction set to accelerate application-specific needs. Hardware TSO mode to support and accelerate Rosetta 2 is just the start; Apple Silicon is well known to already contain some other interesting custom extensions as well. For example, Apple Silicon contains an entire new, so far undocumented arm64 ISA extension centered around doing fast matrix operations for Apple’s “Accelerate” framework, which supports various deep learning and image procesing applications [Johnson 2020]. This extension, called AMX (for Apple Matrix coprocessor), is separate but likely related to the “Neural Engine” hardware [Engheim 2021] that ships on the M1 chip alongside the M1’s arm64 processor and custom Apple-designed GPU. Recent open-source code releases from Apple also hint at future Apple Silicon chips having dedicated built-in hardware for doing branch predicion around Objective C’s objc_msgSend, which would considerably accelerate message passing in Cocoa apps.

Embree on arm64 using sse2neon

As mentioned earlier, porting Takua and Takua’s dependencies was relatively easy and straightforward and in large part worked basically out-of-the-box, because Takua and most of Takua’s dependencies are written in vanilla C++. Gotchas like memory-ordering correctness in atomic and multithreaded code aside, porting vanilla C++ code between x86-64 and arm64 largely just involves recompiling, and popular modern compilers such as Clang, GCC, and MSVC all have mature, robust arm64 backends today. However, for code written using inline assembly or architecture-specific vector SIMD intrinsics, recompilation is not enough to get things working on a different processor architecture.

A huge proportion of the raw compute power in modern processors is actually located in vector SIMD instruction set extensions, such as the various SSE and AVX extensions found in modern x86-64 processors and the NEON and upcoming SVE extensions found in arm64. For workloads that can benefit from vectorization, using SIMD extensions means up to a 4x speed boost over scalar code when using SSE or NEON, and potentially even more using AVX or SVE. One way to utilize SIMD extensions is just to write scalar C++ code like normal and let the compiler auto-vectorize the code at compile-time. However, relying on auto-vectorization to leverage SIMD extensions in practice can be surprisingly tricky. In order for compilers to be able to efficiently auto-vectorize code that was written to be scalar, compilers need to be able to deduce and infer an enormous amount of context and knowledge about what the code being compiled actually does, and doing this kind of work is extremely difficult and extremely prone to defeat by edge cases, complex scenarios, or even just straight up implementation bugs. The end result is that getting scalar C++ code to go through auto-vectorization well in practice ends up requiring a lot of deep knowledge about how the compiler’s auto-vectorization implementation actually works under the hood, and small innocuous changes can often suddenly lead to the compiler falling back to generating completely scalar assembly. Without a robust performance test suite, these fallbacks can happen unbeknownst to the programmer; I like the term that my friend Josh Filstrup uses for these scenarios: “real rugpull moments”. Most high-performance applications that require good vectorization usually rely on at least one of several other options: write code directly in assembly utilizing SIMD instructions, write code using SIMD intrinsics, or write code for use with ISPC: the Intel SPMD Program Compiler.

Writing SIMD code directly in assembly is more or less just like writing regular assembly, just with different instructions and wider registers; SSE uses XMM registers and many SSE instructions end in either SS or PS, AVX uses ZMM registers, and NEON uses D and Q registers. Since writing directly in assembly is often not desirable for a variety of readability and ease-of-use reasons, writing vector code directly in assembly is not nearly as common as writing vector code in normal C or C++ using vector intrinsics. Vector intrinsics are functions that look like regular functions from the outside, but within the compiler have a direct one-to-one or near one-to-one mapping to specific assembly instructions. For SSE and AVX, vector intrinsics are typically found in headers named using the pattern *mmintrin.h, where * is a letter of the alphabet corresponding to a specific subset or version of either SSE of AVX (for example, x for SSE, e for SSE2, n for SSE4.2, i for AVX, etc.). For NEON, vector intrinsics are typically found in arm_neon.h. Vector intrinsics are commonly found in many high-performance codebases, but another powerful and increasingly popular way to vectorize code is by using ISPC. ISPC compiles a special variant of the C programming language using a SPMD, or single-program-multiple-data, programming model compiled to run on SIMD execution units; the idea is that an ISPC program describes what a single lane in a vector unit does, and ISPC itself takes care of making that program run across all of the lanes of the vector unit [Pharr and Mark 2012]. While this may sound superficially like a form of auto-vectorization, there’s a crucial difference that makes ISPC far more reliable in outputting good vectorized assembly: ISPC bakes a vectorization-friendly programming model directly into the language itself, whereas normal C++ has no such affordances that C++ compilers can rely on. This SPMD model is broadly very similar to how writing a GPU kernel works, although there are some key differences between SPMD as a programming model and the SIMT model that GPU run on (namely, a SPMD program can be at a different point on each lane, whereas a SIMT program keeps the progress across all lanes in lockstep). A big advantage of using ISPC over vector intrinsics or vector assembly is that ISPC code is basically just normal C code; in fact, ISPC programs can often compile as normal scalar C code with little to no modification. Since the actual transformation to vector assembly is up to the compiler, writing code for ISPC is far more processor architecture independent than vector intrinsics are; ISPC today includes backends to generate SSE, AVX, and NEON binaries. Matt Pharr has a great blog post series that goes into much more detail about the history and motivations behind ISPC and the benefits of using ISPC.

In general, graphics workloads tend to fit the bill well for vectorization, and as a result, graphics libraries often make extensive use of SIMD instructions (actually, a surprisingly large number of problem types can be vectorized, including even JSON parsing). Since SIMD intrinsics are architecture-specific, I didn’t fully expect all of Takua’s dependencies to compile right out of the box on arm64; I expected that a lot of them would contain chunks of code written using x86-64 SSE and/or AVX intrinsics! However, almost all of Takua’s dependencies compiled without a problem either because they provided arm64 NEON or scalar C++ fallback codepaths for every SSE/AVX codepath, or because they rely on auto-vectorization by the compiler instead of using intrinsics directly. OpenEXR is an example of the former, while OpenVDB and OpenSubdiv are examples of the latter. Embree was the notable exception: Embree is heavily vectorized using code implemented directly using SSE and/or AVX intrinsics with no alternative scalar C++ or arm64 NEON fallback, and Embree also provides an ISPC interfaces. Starting with Embree v3.13.0, Embree now provides an arm64 NEON codepath as well, but at the time I first ported Takua to arm64, Embree didn’t come with anything other than SSE and AVX implementations.

Fortunately, Embree is actually written in such a way that porting Embree to different processor architectures with different vector intrinsics is, at least in theory, relatively straightforward. The Embree codebase internally is written as several different “layers”, where the bottommost layer is located in embree/common/simd/ in the Embree source tree. As one might be able to guess from the name, this bottommost layer is where all of the core SIMD functionality in Embree is implemented; this part of the codebase implements SIMD wrappers for things like 4/8/16 wide floats, SIMD math operations, and so on. The rest of the Embree codebase doesn’t really contain many direct vector intrinsics at all; the parts of Embree that actually implement BVH construction and traversal and ray intersection all call into this base SIMD library. As suggested by Ingo Wald in a 2018 blog post, porting Embree to use something other than SSE/AVX mostly requires just reimplementing this base SIMD wrapper layer, and the rest of the Embree should more or less “just work”.

In his blog post, Ingo mentioned experimenting with replacing all of Embree’s base SIMD layer with scalar implementations of all of the vectorized code. Back in early 2020, as part of my effort to get Takua up and running on arm64 Linux, I actually tried doing a scalar rewrite of the base SIMD layer of Embree as well as a first attempt at porting to arm64. Overall the process to rewrite to scalar was actually very straightforward; most things were basically just replacing a function that did something with float4 inputs using SSE instructions with a simple loop that iterates over the four floats in a float4. I did find that in addition to rewriting all of the SIMD wrapper functions to replace SSE intrinsics with scalar implementations, I also had to replace some straight-up inlined x86-64 assembly with equivalent compiler intrinsics; basically all of this code lives in common/sys/intrinsics.h. None of the inlined assembly replacement was very complicated either though, most of it was things like replacing an inlined assembly call to x86-64’s bsf bit-scan-forward instruction with a call to the more portable __builtin_ctz() integer trailing zero counter builin compiler function. Embree’s build system also required modifications; since I was just doing this as an initial test, I just did a terribly hack-job on the CMake scripts and, with some troubleshooting, got things building and running on arm64 Linux. Unfortunately, the performance of my quick-and-rough scalar Embree port was… very disappointing. I had hoped that the compiler would be able to do a decent job of autovectorizing the scalar reimplementations of all of the SIMD code, but overall my scalar Embree port on x86-64 was basically between three to four times slower than standard SSE Embree, which indicated that the compiler basically hadn’t effectively autovectorized anything at all. This level of performance regression basically meant that my scalar Embree port wasn’t actually significantly faster than Takua’s own internal scalar BVH implementation; the disappointing performance combined with how hacky and rough my scalar Embree port was led me to abandon using Embree on arm64 Linux for the time being.

A short while later in the spring of 2020 though, I remembered that Syoyo Fujita had already succesfully ported Embree to arm64 with vectorization support! Actually, Syoyo had started his Embree-aarch64 fork three years earlier in 2017 and had kept the project up-to-date with each new upstream official Embree release; I had just forgotten about the project until it popped up in my Twitter feed one day. The approach that Syoyo took to getting vectorization working in the Embree-aarch64 fork was by using the sse2neon project, which implements SSE intrinsics on arm64 using NEON instructions and serves as a drop-in replacement for the various x86-64 *mmintrin.h headers. Using sse2neon is actually the same strategy that had previously been used by Martin Chang in 2017 to port Embree 2.x to work on arm64; Martin’s earlier effort provided the proof-of-concept that paved the way for Syoyo to fork Embree 3.x into Embree-aarch64. Building the Embree-aarch64 fork on arm64 worked out-of-the-box, and on my Raspberry Pi 4, using Embree-aarch64 with Takua’s Embree backend produced a performance increase over Takua’s internal BVH implementation that was in the general range of what I expected.

Taking a look at the process that was taken to get Embree-aarch64 to a production-ready state with results that matched x86-64 Embree exactly provides a lot of interesting insights into how NEON works versus how SSE works. In my previous post I wrote about how getting identical floating point behavior between different processor architectures can be challenging for a variety of reasons; getting floating point behavior to match between NEON and SSE is even harder! Various NEON instructions such as rcp and rsqt have different levels of accuracy from their corresponding SSE counterparts, which required the Embree-aarch64 project to implement more accurate versions of some SSE intrinsics than what sse2neon provided at the time; a lot of these improvements were later contributed back to sse2neon. I originally was planning to include a deep dive into comparing SSE, NEON, ISPC, sse2neon, and SSE instructions running on Rosetta 2 as part of this post, but the writeup for that comparison has now gotten so large that it’s going to have to be its own post as a later follow-up to this post; stay tuned!

As a bit of an aside: the history of the sse2neon project is a great example of a community forming to build an open-source project around a new need. The sse2neon project was originally started by John W. Ratcliff at NVIDIA along with a few other NVIDIA folks and implemented only a small subset of SSE that was just enough for their own needs. However, after posting the project to Github with the MIT license, a community gradually formed around sse2neon and fleshed it out into a full project with full coverage of MMX and all versions of SSE from SSE1 all the way through SSE4.2. Over the years sse2neon has seen contributions and improvements from NVIDIA, Amazon, Google, the Embree-aarch64 project, the Blender project, and recently Apple as part of Apple’s larger slew of contributions to various projects to improve arm64 support for Apple Silicon.

Starting with Embree v3.13.0, released in May 2021, the official main Embree project now has also gained full support for arm64 NEON; I have since switched Takua Renderer’s arm64 builds from using the Embree-aarch64 fork to using the new official arm64 support in Embree v3.13.0. The approach the official Embree project takes is directly based off of the work that Syoyo Fujita and others did in the Embree-aarch64 fork; sse2neon is used to emulate SSE, and the same math precision improvements that were made in Embree-aarch64 were also adopted upstream by the official Embree project. Much like Embree-aarch64, the arm64 NEON backend for Embree v3.13.0 does not include ISPC support, even though ISPC has an arm64 NEON backend as well; maybe this will come in the future. Brecht Van Lommel from the Blender project seems to have done most of the work to upstream Embree-aarch64’s changes, with additional work and additional optimizations from Sven Woop on the Intel Embree team. Interestingly and excitingly, Apple also recently submitted a patch to the official Embree project that adds AVX2 support on arm64 by treating each 8-wide AVX value as a pair of 4-wide NEON values.

(More) Differences in arm64 versus x86-64

In my previous post and in this post, I’ve covered a bunch of interesting differences and quirks that I ran into and had to take into account while porting from x86-64 to arm64. There are, of course, far more differences that I didn’t touch on. However, in this small section, I thought I’d list a couple more small but interesting differences that I ran into and had to think about.

arm64 and x86-64 handle float-to-int conversions slightly differently for some edge cases. Specifically, for edge values such as a uint32_t set to INF, arm64 will make a best attempt to find the nearest possible integer to convert to, which would be 4294967295. x86-64, on the other hand, treats the INF case as basically undefined behavior and defaults to just zero. In path tracing code where occasional infinite values need to be handled for things like edge cases in sampling Dirac distributions, some care needs to be taken to make sure that the renderer is understanding and processing INF values correctly on both arm64 and x86-64.
Similarly, implicit conversion from signed integers to unsigned integers can have some different behavior between the two platforms. On arm64, negative signed integers get trimmed to zero when implicitly converted to an unsigned integer; for code that must cast between signed and unsigned integers, care must be taken to make sure that all conversions are explicitly cast and that the edge case behavior on arm64 and x86-64 are accounted for.
The signedness of char is platform specific and defaults to being signed on x86-64 but defaults to being unsigned on ARM architectures [Harmon 2003], including arm64. For custom string processing functions, this may have to be taken into account.
x86-64 is always little-endian, but arm64 is a bi-endian architecture that can be either little-endian or big-endian, as set by the operating system at startup time. Most Linux flavors, including Fedora, default to little-endian on arm64, and Apple’s various operating systems all exclusively use little-endian mode on arm64 as well, so this shouldn’t be too much of a problem for most use cases. However, for software that does expect to have to run on both little and big endian systems, endianess has to be taken into account for reading/writing/handling binary data. For example, Takua has a checkpointing system that basically dumps state information from the renderer’s memory straight to disk; these checkpoint files would need to have their endianess checked and handled appropriately if I were to make Takua bi-endian. However, since I don’t expect to ever run my own hobby stuff on a big-endian system, I just have Takua check the endianess at startup right now and refuse to run if the system is big-endian.

For more details to look out for when porting x86-64 code to arm64 code on macOS specifically, Apple’s developer documentation has a whole article covering various things to consider. Another fantastic resource for diving into arm64 assembly is Howard Oakley’s “Code in ARM Assembly” series, which covers arm64 assembly programming on Apple Silicon in extensive detail (the bottom of each article in Howard Oakley’s series contains a table of contents linking out to all of the previous articles in the series).

(More) Performance Testing

In my previous post, I included performance testing results from my initial port to arm64 Linux, running on a Raspberry Pi 4B. Now that I have Takua Renderer up and running on a much more powerful M1 Mac Mini with 16 GB of memory, how does performance look on “big” arm64 hardware? Last time around the machines / processors I compared were a Raspberry Pi 4B, which uses a Broadcom BCM2711 CPU with 4 Cortex-A72 cores dating back to 2015, a 2015 MacBook Air with a 2 core / 4 thread Intel Core i5-5250U CPU, and as an extremely unfair comparison point, my personal workstation with dual Intel Xeon E5-2680 CPUs from 2012 with 8 cores / 16 threads each (16 cores / 32 threads total). The conclusion last time was that even though the Raspberry Pi 4B’s arm64 processor basically lost in terms of render time on almost every test, the Raspberry Pi 4B was actually the absolute winner by a wide margin when it came to total energy usage per render job.

This time around, since my expectation is that Apple’s M1 chip should be able to perform extremely well, I think my dual-Xeon personal workstation should absolutely be a fair competitor. In fact, I think the comparison might actually be kind of unfair towards the dual-Xeon workstation, since the processors are from 2012 and were manufactured on the now-ancient 32 nm process, whereas the M1 is made on TSMC’s currently bleeding edge 5 nm process. So, to give x86-64 more of a fighting chance, I’m also including a 2019 16 inch MacBook Pro with a 6 core / 8 thread Intel Core i7-9750H processor and 32 GB of memory, a.k.a. one of the fastest Intel-based laptops that Apple currently sells.

The first three test scenes are the same as last time: a standard Cornell Box, the glass teacup with ice seen in my Nested Dielectrics post, and the bedroom scene from my Shadow Terminator in Takua post. Last time these three scenes were chosen since they fit in the 4 GB memory constraint that the Raspberry Pi 4B and the 2015 MacBook Air both have. This time though, since the M1 Mac Mini has a much more modern 16 GB of memory, I’m including one more scene: my Scandinavian Room scene, as seen in Figure 1 of this post. The Scandinavian Room scene is a much more realistic example of the type of complexity found in a real production render, and has much more interesting and difficult light transport. Like before, the Cornell Box is rendered to 16 SPP using unidirectional path tracing and at 1024x1024 resolution, the Tea Cup is rendered to 16 SPP using VCM and at 1920x1080 resolution, and the Bedroom is rendered to 16 SPP using unidirectional path tracing and at 1920x1080 resolution. Because the Scandinavian Room scene takes much longer to render due to being a much more complex scene, I’m rendered the Scandinavian Room scene to 4 SPP using unidirectional path tracing and at 1920x1080 resolution. I left Takua Renderer’s texture caching system enabled for the Scandinavian Room scene, in order to test that the texture caching system was working correctly on arm64. Using the texture cache could alter the performance results slightly due to disk latency to fetch texture tiles to populate the texture cache, but the texture cache hit rate after the first SPP on this scene is so close to 100% that it basically doesn’t make a difference after the first SPP, so I actually rendered the Scandinavian Room scene to 5 spp and counted the times for the last 4 and threw out timings for the first SPP.

Each test’s recorded time below is the average of the three best runs, chosen out of five runs in total for each processor. For the M1 processor, I actually did three different types of runs, which are presented separately below. I did one test with the native arm64 build of Takua Renderer, a second test with a version of the native arm64 build hacked to run with the M1’s hardware TSO mode enabled, and a third test running the x86-64 build on the M1 through Rosetta 2. Also, for the Cornell Box, Tea Cup, and Bedroom scenes, I used Takua Renderer’s internal BVH implementation instead of Embree in order to match the tests from the last post, which were done before I had Embree working on arm64. The Scandinavian Room tests use Embree as the traverser instead.

Here are the results:

	CORNELL BOX
	1024x1024, PT
Processor:	Wall Time:	Core-Seconds:
Broadcom BCM2711:	440.627 s	approx 1762.51 s
Intel Core i5-5250U:	272.053 s	approx 1088.21 s
Intel Xeon E5-2680 x2:	36.6183 s	approx 1139.79 s
Intel Core i7-9750H:	41.7408 s	approx 500.890 s
Apple M1 Native:	28.0611 s	approx 224.489 s
Apple M1 TSO-Enabled:	32.5621 s	approx 260.497 s
Apple M1 Rosetta 2:	42.5824 s	approx 340.658 s

	TEA CUP
	1920x1080, VCM
Processor:	Wall Time:	Core-Seconds:
Broadcom BCM2711:	2205.072 s	approx 8820.32 s
Intel Core i5-5250U:	2237.136 s	approx 8948.56 s
Intel Xeon E5-2680 x2:	174.872 s	approx 5593.60 s
Intel Core i7-9750H:	158.729 s	approx 1904.75 s
Apple M1 Native:	115.253 s	approx 922.021 s
Apple M1 TSO-Enabled:	128.299 s	approx 1026.39 s
Apple M1 Rosetta 2:	164.289 s	approx 1314.31 s

	BEDROOM
	1920x1080, PT
Processor:	Wall Time:	Core-Seconds:
Broadcom BCM2711:	5653.66 s	approx 22614.64 s
Intel Core i5-5250U:	4900.54 s	approx 19602.18 s
Intel Xeon E5-2680 x2:	310.35 s	approx 9931.52 s
Intel Core i7-9750H:	362.29 s	approx 4347.44 s
Apple M1 Native:	256.68 s	approx 2053.46 s
Apple M1 TSO-Enabled:	291.69 s	approx 2333.50 s
Apple M1 Rosetta 2:	366.01 s	approx 2928.08 s

	SCANDINAVIAN ROOM
	1920x1080, PT
Processor:	Wall Time:	Core-Seconds:
Intel Xeon E5-2680 x2:	119.16 s	approx 3813.18 s
Intel Core i7-9750H:	151.81 s	approx 1821.80 s
Apple M1 Native:	109.94 s	approx 879.55 s
Apple M1 TSO-Enabled:	124.95 s	approx 999.57 s
Apple M1 Rosetta 2:	153.66 s	approx 1229.32 s

The first takeaway from these new results is that Intel CPUs have advanced enormously over the past decade! My wife’s 2019 16 inch MacBook Pro comes extremely close to matching my 2012 dual Xeon workstation’s performance on most tests and even wins on the Tea Cup scene, which is extremely impressive considering that the Intel Core i7-9750H cost around a tenth as much MSRP than the dual Intel Xeon E5-2680s would have cost new in 2012, and the Intel Core i7-9750H also uses 5 times less energy at peak than the dual Intel Xeon E5-2680s do at peak.

The real story though, is in the Apple M1 processor. Quite simply, the Apple M1 processor completely smokes everything else on the list, often by margins that are downright stunning. Depending on the test, the M1 processor beats the dual Xeons by anywhere between 10% and 30% in wall time and beats the 2019 MacBook Pro’s Core i7 by even more. In terms of core-seconds, which is a measure of the overall performance of each processor core that approximates how long the render would have taken completely single-threaded, the M1’s wins are simply stunning; each of the M1’s processor cores is somewhere betweeen 4 to 6 times faster than the dual Xeons’ individual cores and between 2 to 3 times faster than the more contemporaneous Intel Core i7-9750H’s individual cores. The even more impressive result from the M1 though, is that even running the x86-64 version of Takua Renderer using Rosetta 2’s dynamic translation system, the M1 still matches or beats the Intel Core i7-9750H.

Below is the breakdown of energy utilization for each test; the total energy used for each render is the wall clock render time multiplied by the maximum TDP of each processor to get watt-seconds, which is then divided by 3600 seconds per hour to get watt-hours. Maximum TDP is used since Takua Renderer pushes processor utilization to 100% during each render. As a point of comparison, I’ve also included all of the results from my previous post:

	CORNELL BOX
	1024x1024, PT
Processor:	Max TDP:	Total Energy Used:
Broadcom BCM2711:	4 W	0.4895 Wh
Intel Core i5-5250U:	15 W	1.1336 Wh
Intel Xeon E5-2680 x2:	260 W	2.6450 Wh
Intel Core i7-9750H:	45 W	0.5218 Wh
Apple M1 Native:	15 W	0.1169 Wh
Apple M1 TSO-Enabled:	15 W	0.1357 Wh
Apple M1 Rosetta 2:	15 W	0.1774 Wh

	TEA CUP
	1920x1080, VCM
Processor:	Max TDP:	Total Energy Used:
Broadcom BCM2711:	4 W	2.4500 Wh
Intel Core i5-5250U:	15 W	9.3214 Wh
Intel Xeon E5-2680 x2:	260 W	12.6297 Wh
Intel Core i7-9750H:	45 W	1.9841 Wh
Apple M1 Native:	15 W	0.4802 Wh
Apple M1 TSO-Enabled:	15 W	0.5346 Wh
Apple M1 Rosetta 2:	15 W	0.6845 Wh

	BEDROOM
	1920x1080, PT
Processor:	Max TDP:	Total Energy Used:
Broadcom BCM2711:	4 W	6.2819 Wh
Intel Core i5-5250U:	15 W	20.4189 Wh
Intel Xeon E5-2680 x2:	260 W	22.4142 Wh
Intel Core i7-9750H:	45 W	4.5286 Wh
Apple M1 Native:	15 W	1.0695 Wh
Apple M1 TSO-Enabled:	15 W	1.2154 Wh
Apple M1 Rosetta 2:	15 W	1.5250 Wh

	SCANDINAVIAN ROOM
	1920x1080, PT
Processor:	Max TDP:	Total Energy Used:
Intel Xeon E5-2680 x2:	260 W	8.606 Wh
Intel Core i7-9750H:	45 W	1.8976 Wh
Apple M1 Native:	15 W	0.4581 Wh
Apple M1 TSO-Enabled:	15 W	0.5206 Wh
Apple M1 Rosetta 2:	15 W	0.6403 Wh

Again the first takeaway from these results is just how much processor technology has improved overall in the past decade; the total energy usage by the modern Intel Core i7-9750H and Apple M1 is leaps and bounds better than the dual Xeons from 2012. Compared to what was essentially the most powerful workstation hardware that Intel sold a little under a decade ago, a modern Intel laptop chip can now do the same work in about the same amount of time for roughly 5x less energy consumption.

The M1 though, once again entirely lives in a class of its own. Running the native arm64 build, the M1 processor is 4 times more energy efficient than the Intel Core i7-9750H to complete the same task. The M1’s maximum TDP is only a third of the Intel Core i7-9750H’s maximum TDP, but the actual final energy utilization is a quarter because the M1’s faster performance means that the M1 runs for much less time than the Intel Core i7-9750H. In other words, running native code, the M1 is both faster and more energy efficient than the Intel Core i7-9750H. This result wouldn’t be impressive if the comparison was between the M1 and some low-end, power-optimized ultra-portable Intel chip, but that’s not what the comparison is with. The comparison is with the Intel Core i7-9750H, which is a high-end, 45 W maximum TDP part that MSRPs for $395. In comparison, the M1 is estimated to cost about $50, and the entire M1 Mac Mini only has a 39 W TDP total at maximum load; the M1 itself is reported to have a 15 W maximum TDP. Where the comparison between the M1 and the Intel Core i7-9750H gets even more impressive is when looking at the M1’s energy utilization running x86-64 code under Rosetta 2: the M1 is still about 3 times more energy efficient than the Intel Core i7-9750H to do the same work. Put another way, the M1 is an arm64 processor that can run emulated x86-64 code faster than a modern native x86-64 processor that cost 5x more and uses 3x more energy can.

Another interesting observation is that the for the same work, the M1 is actually more energy efficient than the Raspberry Pi 4B as well! In the case of the Raspberry Pi 4B comparison, while the M1’s maximum TDP is 3.75x higher than the Broadcom BCM2711’s maximum TDP, the M1 is also around 20x faster to complete each render; the M1’s massive performance uplift more than offsets the higher maximum TDP.

Another aspect of the M1 processor that I was curious enough about to test further is the M1’s big.LITTLE implementation. The M1 has four “Firestorm” cores and four “Icestorm” cores, where Firestorm cores are high-performance but also use a ton of energy, and Icestorm cores are extremely energy-efficient but are also commensurately less performant. I wanted to know just how much of the overall performance of the M1 was coming from the big Firestorm cores, and just how much slower the Icestorm cores are. So, I did a simple thread scaling test where I did successive renders using 1 all the way through 8 threads. I don’t know of a good way on the M1 to explicitly pin which kind of core a given thread runs on on; on the A12Z, the easy way to pin to the high-performance cores is to just enable hardware TSO mode since the A12Z only has hardware TSO on the high-performance cores, but this is no longer the case on the M1. But, I figured that the underlying operating system’s thread scheduler should be smart enough to notice that Takua Renderer is a job that pushes performance limits, and schedule any available high-performance cores before using the energy-efficiency cores too.

Here are the results on the Scandinavian Room scene for native arm64, native arm64 with TSO-enabled, and x86-64 running using Rosetta 2:

		M1 Native
		1920x1080, PT
Threads:	Wall Time:	WT Speedup:	Core-Seconds:	CS Multiplier:
1 (1 big, 0 LITTLE)	575.6787 s	1.0x	575.6786 s	1.0x
2 (2 big, 0 LITTLE)	292.521 s	1.9679x	585.042 s	0.9839x
3 (3 big, 0 LITTLE)	197.04 s	2.9216x	591.1206 s	0.9738x
4 (4 big, 0 LITTLE)	148.9617 s	3.8646x	595.8466 s	0.9661x
5 (4 big, 1 LITTLE)	137.6307 s	4.1827x	688.1536 s	0.8365x
6 (4 big, 2 LITTLE)	128.9223 s	4.4653x	773.535 s	0.7442x
7 (4 big, 3 LITTLE)	120.496 s	4.7775x	843.4713 s	0.6825x
8 (4 big, 4 LITTLE)	109.9437 s	5.2361x	879.5476 s	0.6545x

		M1 TSO-Enabled
		1920x1080, PT
Threads:	Wall Time:	WT Speedup:	Core-Seconds:	CS Multiplier:
1 (1 big, 0 LITTLE)	643.9846 s	1.0x	643.9846 s	1.0x
2 (2 big, 0 LITTLE)	323.8036 s	1.9888x	647.6073 s	0.9944x
3 (3 big, 0 LITTLE)	220.4093 s	2.9217x	661.2283 s	0.9739x
4 (4 big, 0 LITTLE)	168.9733 s	3.8111x	675.8943 s	0.9527x
5 (4 big, 1 LITTLE)	153.849 s	4.1858x	769.2453 s	0.8371x
6 (4 big, 2 LITTLE)	143.7426 s	4.4801x	862.4576 s	0.7466x
7 (4 big, 3 LITTLE)	132.7233 s	4.8520x	929.0633 s	0.6931x
8 (4 big, 4 LITTLE)	124.9456 s	5.1541x	999.5683 s	0.6442x

		M1 Rosetta 2
		1920x1080, PT
Threads:	Wall Time:	WT Speedup:	Core-Seconds:	CS Multiplier:
1 (1 big, 0 LITTLE)	806.6843 s	1.0x	806.68433 s	1.0x
2 (2 big, 0 LITTLE)	412.186 s	1.9570x	824.372 s	0.9785x
3 (3 big, 0 LITTLE)	280.875 s	2.8720x	842.625 s	0.9573x
4 (4 big, 0 LITTLE)	207.0996 s	3.8951x	828.39966 s	0.9737x
5 (4 big, 1 LITTLE)	189.322 s	4.2609x	946.608 s	0.8521x
6 (4 big, 2 LITTLE)	175.0353 s	4.6086x	1050.2133 s	0.7681x
7 (4 big, 3 LITTLE)	166.1286 s	4.8557x	1162.9033 s	0.6936x
8 (4 big, 4 LITTLE)	153.6646 s	5.2496x	1229.3166 s	0.6562x

In the above table, WT speedup is how many times faster that given test was than the baseline single-threaded render; WT speedup is a measure of multithreading scaling efficiency. The closer WT speedup is to the number of threads, the better the multithreading scaling efficiency; with perfect multithreading scaling efficiency, we’d expect the WT speedup number to be exactly the same as the number of threads. The CS Multiplier value is another way to measure multithreading scaling efficiency; the closer the CS Multiplier number is to exactly 1.0, the closer each test is to achieving perfect multithreading scaling efficiency.

Since this test ran Takua Renderer in unidirectional path tracing mode, and depth-first unidirectional path tracing is largely trivially parallelizable using a simple parallel_for (okay, it’s not so simple once things like texture caching and things like learned path guiding data structures come into play, but close enough for now), my expectation for Takua Renderer is that on a system with homogeneous cores, multithreading scaling should be very close to perfect (assuming a fair scheduler in the underlying operating system). Looking at the first four threads, which are all using the M1’s high-performance “big” Firestorm cores, close-to-perfect multithreading scaling efficiency is exactly what we see. Adding the next four threads though, which use the M1’s low-performance energy-efficient “LITTLE” Icestorm cores, the multithreading scaling efficiency drops dramatically. This drop in multithreading scaling efficiency is expected, since the Icestorm cores are far less performant than the Firestorm cores, but the amount that multithreading scaling efficiency drops by is what is interesting here, since that drop gives us a good estimate of just how less performant the Icestorm cores are. The answer is that the Icestorm cores are roughly a quarter as performant as the high-performance Firestorm cores. However, according to Apple, the Icestorm cores only use a tenth of the energy that the Firestorm cores do; a 4x performance drop for a 10x drop in energy usage is very impressive.

Conclusion to Part 2

There’s really no way to understate what a colossal achievement Apple’s M1 processor is; compared with almost every modern x86-64 processor in its class, it achieves significantly more performance for much less cost and much less energy. The even more amazing thing to think about is that the M1 is Apple’s low end Mac processor and likely will be the slowest arm64 chip to ever power a shipping Mac (the A12Z powering the DTK is slower, but the DTK is not a shipping consumer device); future Apple Silicon chips will only be even faster. Combined with other extremely impressive high-performance arm64 chips such as Fujistu’s A64FX supercomputer CPU, NVIDIA’s upcoming Grace GPU, Ampere’s monster 80-core Altra CPU, and Amazon’s Graviton2 CPU used in AWS, I think the future for high-end arm64 looks very bright.

That being said though, x86-64 chips aren’t exactly sitting still either. In the comparisons above I don’t have any modern AMD Ryzen chips, entirely because I personally don’t have access to any Ryzen-based systems at the moment. However, AMD has been making enormous advancements in both performance and energy efficiency with their Zen series of x86-64 microarchitectures, and the current Zen 3 microarchitecture thoroughly bests Intel in both performance and energy efficiency. Intel is not sitting still either, with ambitious plans to fight AMD for the x86-64 performance crown, and I’m sure both companies have no intention of taking the rising threat from arm64 lying down.

We are currently in a very exciting period of enormous advances in modern processor technology, with multiple large, well funded, very serious players competing to outdo each other. For the end user, no matter who comes out on top and what happens, the end result is ultimately a win- faster chips using less energy for lower prices. Now that I have Takua Renderer fully working with parity on both x86-64 and arm64, I’m ready to take advantage of each new advancement!

Acknowledgements

For both the last post and this post, I owe Josh Filstrup an enormous debt of gratitude for proofreading, giving plenty of constructive and useful feedback and suggestions, and for being a great discussion partner over the past year on many of the topics covered in this miniseries. Also an enormous thanks to my wife, Harmony Li, who was patient with me while I took ages with the porting work and then was patient again with me as I took even longer to get these posts written. Harmony also helped me brainstorm through various topics and provided many useful suggestions along the way. Finally, thanks to you, the reader, for sticking with me through these two giant blog posts!

References

Apple. 2020. Addressing Architectural Differences in Your macOS Code. Retrieved July 19, 2021.

Apple. 2020. Building a Universal macOS Binary. Retrieved June 22, 2021.

Apple. 2020. Explore the New System Architecture of Apple Silicon Macs. Retrieved June 15, 2011.

Apple. 2020. Writing ARM64 Code for Apple Platforms. Retrieved June 26, 2021.

ARM Holdings. 2015. Parameters in General-Purpose Registers. In ARM Cortex-A Series Programmer’s Guide for ARMv8-A. Retrieved June 26, 2021.

ARM Holdings. 2017. ACTLR_EL1, Auxiliary Control Register, EL1. In ARM Cortex-A55 Core Technical Reference Manual. Retrieved June 26, 2021.

Martin Chang. 2017. Porting Intel Embree to ARM. In MightyNotes: A Developer’s Blog. Retrieved July 18, 2021.

Erik Engheim. 2021. The Secret Apple M1 Coprocessor. Retrieved July 23, 2021.

Trevor Harmon. 2003. Portability & the ARM Processor. In Dr. Dobb’s. Retrieved July 19, 2021.

Shawn Hickey, Matt Wojiakowski, Shipa Sharma, David Coulter, Theano Petersen, Mike Jacobs, and Michael Satran. 2021. How x86 Emulation works on ARM. In Windows on ARM. Retrieved June 26, 2021.

Saagar Jha. 2020. TSOEnabler. Retrieved June 15, 2021.

Dougall Johnson. 2020. AMX: Apple Matrix Coprocessor. Retrieved July 23, 2021.

LLVM Project. 2021. llvm-lipo - LLVM Tool for Manipulating Universal Binaries. Retrieved June 22, 2021.

LLVM Project. 2021. llvm-objdump - LLVM’s object file dumper. Retrieved June 22, 2021.

Koh M. Nakagawa. 2021. Reverse-Engineering Rosetta 2 Part 1: Analyzing AOT Files and the Rosetta 2 Runtime. In Project Champollion. Retrieved June 23, 2021.

Koh M. Nakagawa. 2021. Reverse-Engineering Rosetta 2 Part 2: Analyzing Other aspects of Rosetta 2 Runtime and AOT Shared Cache Files. In Project Champollion. Retrieved June 23, 2021.

Howard Oakley. 2020. Universal Binaries: Inside Fat Headers. In The Eclectic Light Company. Retrieved June 22, 2021.

Howard Oakley. 2021. Code in ARM Assembly Series. In The Eclectic Light Company. Retrieved July 19, 2021.

OSDev. 2018. System V ABI. Retrieved June 26, 2021.

Matt Pharr. 2018. The Story of ISPC. In Matt Pharr’s Blog. Retrieved July 18, 2021.

Matt Pharr and William R. Mark. 2012. ispc: A SPMD compiler for high-performance CPU programming. In Proc. of nnovative Parallel Computing (InPar 2012). 184-196.

Jeff Preshing. 2012. This Is Why They Call It a Weakly-Ordered CPU. In Preshing on Programming. Retrieved March 20, 2021.

Marc Sweetgall. 2021. Announcing ARM64EC: Building Native and Interoperable Apps for Windows 11 on ARM. In Windows Developers Blog. Retrieved June 26, 2021.

Threedots. 2021. Arm CPUs with Sequential Consistency. In Random Blog. Retrieved June 26, 2021.

Ingo Wald. 2018. Cfl: Embree on ARM/Power/…?. In Ingo Wald’s Blog. Retrieved July 18, 2021.

Amy Williams, Steve Barrus, Keith Morley, and Peter Shirley. 2005. An Efficient and Robust Ray-Box Intersection Algorithm. _Journal of Graphics Tools) 10, 1 (Jan. 2005), 49-54.

Wikipedia. 2021. Endianess. Retrieved July 19, 2021.

Wikipedia. 2021. SIMD. Retrieved July 18, 2021.

Wikipedia. 2021. Single Instruction, Multiple Threads. Retrieved July 18, 2021.

Wikipedia. 2021. SPMD. Retrieved July 18, 2021.

Us Again

2021-06-11T00:00:00+00:00

Raya and the Last Dragon’s release earlier this spring was accompanied Us Again, Disney Animation’s newest short film. Us Again is also included on the Blu-ray for Raya and the Last Dragon, and now that the Blu-ray is out, here are some brief notes about interesting rendering details from Us Again. Us Again is, of course, rendered entirely using Disney’s Hyperion Renderer. I only played a small role on Us Again, pretty much entirely just support, but I thought I’d post a bit about the short because it’s such a spectacular showcase of what Hyperion is capable of when placed in Disney Animation artists’ hands.

Us Again is a stunningly beautiful film; I think it’s one of the most beautiful shorts that Disney Animation has ever made. The short follows elderly couple Art and Dot re-finding the spark of youth in their relationship through dance. The short is set in a bustling city and most of the short takes place in a rain storm, with raindrops flying, wet surfaces on everything, and tons of complex crowds and neon city lighting reflecting everywhere. From a rendering perspective, this can be a preposterously difficult setting to render; there are tons of lights stressing any kind of light sampling strategy, there are tons of specular surfaces and reflections generating difficult to sample light transport, the city setting means the geometric complexity is through the roof, the face paced dance sequences and rainfall means there is high motion blur everywhere, and so on and so forth. A little under ten years ago, I was an intern at Pixar during the production of The Blue Umbrella, which also has a rainy urban setting, and I got a close up view of how that short was at the time one of the hardest things Pixar had ever rendered. Us Again has even more complex challenges on top of what was in The Blue Umbrella, so I was really worried that we would run into major difficulties rendering Us Again. I was really pleasantly surprised when… lighting and rendering Us Again went really smoothly and basically no major problems or difficulties emerged at all! Rendering technology generally has advanced enormously in the past ten years, and by fortunate coincidence Hyperion just happens to have a lot of specific capabilities that are particularly well suited to the rendering challenges in Us Again.

Us Again used essentially the same version of Hyperion that was used for Frozen 2, with one notable modification. Motion blur on Us Again took some special care to handle because of all of the rain; fast-falling raindrops with a wider shutter angle blurs out into the familiar long streaks that are usually seen in “movie rain”. Because the motion blurred streaks for rain were so long in screen space, the streaks would tend to bend whenever the camera was under high motion too, sort of like how fast-spinning propellor blades end up looking weird on rolling shutter cameras. Hyperion already allows for different objects in the same scene and same render to effectively be rendered with different shutter angles; to fix the look of the rain streaks on Us Again, all the team had to add was the ability to decouple camera motion blur from object motion blur.

For pretty much everything else on Us Again, our existing solutions in Hyperion just worked. Us Again’s urban setting represents a worst-case complexity scenario for light selection and light sampling. Because Us Again is set in a city, every shot in the short has a gazillion lights present spread across a huge region of space, and the finale of the short takes place on a boardwalk amusement park pier¹ with a ferris wheel and other rides covered in animated light bulbs. These types of lighting scenarios through the sheer number of lights present combined with the complex occlusion can sometimes be enough to overwhelm light-BVH based many-lights sampling strategies [Estevez and Kulla 2018], and without a many-lights sampling strategy that can handle these cases well, in the worse case falling back to laborious manual grouping and culling of lights may be the only option [Vavilala 2019]. However, Hyperion’s “cache points” many-lights sampling system was developed from the very beginning to handle millions of lights with complex occlusion in vast cityscapes, because this was the exact type of setting that was found throughout Big Hero 6 [Burley et al. 2018]. For Us Again, Hyperion’s cache points system handled all of the highest lighting complexity scenes (such as the ferris wheel on the pier) with millions of light sources with no problem at all; our lighters simply opened the scenes, lit like usual, hit render, and everything just worked efficiently with zero additional optimization work required [Li et al. 2024].

In a similar vein, because Us Again is set in a rain storm, every asset needed to have both wet and dry variants with custom blending between the two, which would be challenging to author if wet and dry looks had to be authored separately. However, thanks to Disney Animation’s layer stack shading model with sparse overrides [Burley 2012, Burley 2018], adding a wet look to every shader was as simple as just sticking on an additional wet layer, driven with whatever masks artists wanted.

Finally, one of the things I love the most visually about Us Again is just the sheer amount of detail and richness there is in everything. When watching the short, pay attention to things like the fuzz on Dot’s sweater, or the little splashes happening everywhere as rain drops land, or the streaks of water running down Art and Dot’s faces and clothes. I think ten years ago this short would have been extraordinarily difficult (possibly borderline impossible) for the studio to make, but thanks to a decade of continuous relentless improvement in our filmmaking technology and craft, making Us Again went off without any major hitches and turned out beautifully!

Here are some frames from Us Again from the Blu-ray, presented in random order as usual. I really recommend seeing Us Again in motion though; the stills here really don’t do the amazing choreography and animation justice. Get Us Again with a copy of Raya and the Last Dragon or watch it on on Disney+; either way, go watch it on the biggest screen you can find!

All images in this post are courtesy of and the property of Walt Disney Animation Studios.

References

Brent Burley. 2012. Physically Based Shading at Disney. In ACM SIGGRAPH 2012 Course Notes: Practical Physically-Based Shading in Film and Game Production.

Brent Burley. 2015. Extending the Disney BRDF to a BSDF with Integrated Subsurface Scattering. In ACM SIGGRAPH 2015 Course Notes: Physically Based Shading in Theory and Practice.

Brian Boyd. 2013. Lighting “The Blue Umbrella”. In ACM SIGGRAPH 2013 Talks. Article 53.

Vaibhav Vavilala. 2019. Lighting Pruning on Toy Story 4. In ACM SIGGRAPH 2019 Talks. Article 44.

Footnotes

¹ If you pay close attention, you may notice that the pier amusement park in Us Again is a nod to the old Paradise Pier area at Disney’s California Adventure, complete with the same entrance sign! keyboard_return

Porting Takua Renderer to 64-bit ARM- Part 1

2021-05-29T00:00:00+00:00

1. Introduction
2. Motivation
3. Porting to arm64 Linux
4. Floating Point Consistency (or lack thereof) on Different Systems
5. Weak Memory Ordering in arm64 and Atomic Bugs in Takua

6. A Deep Dive on x86-64 versus arm64 Through the Lens of Compiling std::atomic::compare_exchange_weak()
7. Performance Testing
8. Conclusion to Part 1
9. Acknowledgements
10. References

Introduction

For almost its entire existence my hobby renderer, Takua Renderer, has built and run on Mac, Windows, and Linux on x86-64. I maintain Takua on all three major desktop operating systems because I routinely run and use all three operating systems, and I’ve found that building with different compilers on different platforms is a good way for making sure that I don’t have code that is actually wrong but just happens to work because of the implementation quirks of a particular compiler and / or platform. As of last year, Takua Renderer now also runs on 64-bit ARM, for both Linux and Mac! 64-bit ARM is often called either aarch64 or arm64; these two terms are interchangeable and mean the same thing (aarch64 is the official name for 64-bit ARM and is what Linux tends to use, while arm64 is the name that Apple and Microsoft’s tools tend to use). For the sake of consistency, I’ll use the term arm64.

This post is the first of a two-part writeup of the process I undertook to port Takua Renderer to run on arm64, along with interesting stuff that I learned along the way. In this first part, I’ll write about motivation and the initial port I undertook in the spring to arm64 Linux (specifically Fedora). I’ll also write about how arm64 and x86-64’s memory ordering guarantees differ and what that means for lock-free code, and I’ll also do some deeper dives into topics such as floating point differences between different processors and a case study examining how code compiles to x86-64 versus to arm64. In the second part, I’ll write about porting to arm64-based Apple Silicon Macs and I’ll also write about getting Embree up and running on ARM, creating Universal Binaries, and some other miscellaneous topics.

Motivation

So first, a bit of a preamble: why port to arm64 at all? Today, basically most, if not all, of the animation/VFX industry renders on x86-64 machines (and a vast majority of those machines are likely running Linux), so pretty much all contemporary production rendering development happens on x86-64. However, this has not always been true! A long long time ago, much of the computer graphics world was based on MIPS hardware running SGI’s IRIX Unix variant; in the early 2000s, as SGI’s custom hardware began to fall behind the performance-per-dollar, performance-per-watt, and even absolute performance that commodity x86-based machines could offer, the graphics world undertook a massive migration to the current x86 world that we live in today. Apple undertook a massive migration from PowerPC to x86 in the mid/late 2000s for similar reasons.

At this point, an ocean of text has been written about why it is that x86 (and by (literal) extension x86-64) became the dominant ISA in desktop computing and in the server space. One common theory that I like is that x86’s dominance was a classic example of disruptive innovation from the low end. A super short summary of disruptive innovation from the low end is that sometimes, a new player enters an existing market with a product that is much less capable but also much cheaper than existing competing products. By being so much cheaper, the new product can generate a new, larger market that existing competing products can’t access due to their higher cost or different set of requirements or whatever. As a result, the new product gets massive investment since the new product is the only thing that can capture this new larger market, and in turn this massive influx of investment allows the new player to iterate faster and rapidly grow its product in capabilities until the new player becomes capable of overtaking the old market as well. This theory maps well to x86; x86-based desktop PCs started off being much cheaper but also much less capable than specialized hardware such as SGI machines, but the investment that poured into the desktop PC space allowed x86 chips to rapidly grow in absolute performance capability until they were able to overtake specialized hardware in basically every comparable metric. At that point, moving to x86 became a no-brainer for many industries, including the computer graphics realm.

I think that ARM is following the same disruptive innovation path that x86 did, only this time the starting “low end” point is smartphones and tablets, which is an even lower starting point than desktop PCs were. More importantly, I think we’re now at a tipping point for ARM. For many years now, ARM chips have offered better performance-per-dollar and performance-per-watt than any x86-64 chip from Intel or AMD, and the point where arm64 chips can overtake x86-64 chips in absolute performance seems plausibly within sight over the next few years. Notably, Amazon’s in-house Graviton2 arm64 CPU and Apple’s M1 arm64-based Apple Silicon chip are both already highly competitive in absolute performance terms with high end consumer x86-64 CPUs, while consuming less power and costing less. Actually, I think that this trend should have been obvious to anyone paying attention to Apple’s A-series chips since the A9 chip was released in 2015.

In cases of disruptive innovation from the low end, the outer edge of the absolute high end is often the last place where the disruption reaches. One of the interesting things about the high-end rendering field is that high-end rendering is one of a relatively small handful of applications that sits at the absolute outer edge of high end compute performance. All of the major animation and VFX studios have render farms (either on-premises or in the cloud) with core counts somewhere in the tens of thousands of cores; these render farms have more similarities with supercomputers than they do with a regular consumer desktop or laptop. I don’t know that anyone has actually tried this, but my guess is that if someone benchmarked any major animation or VFX studio’s render farm using the LINPACK supercomputer benchmark, the score would sit very respectably somewhere in the upper half of the TOP500 supercomputer list. With the above in mind, the fact that the fastest supercomputer in the world is now an arm64-based system should be an interesting indicator of where ARM is now in the process of catching up to x86-64 and how seriously all of us in high-end computer graphics should be when contemplating the possibility of an ARM-based future.

So all of the above brings me to why I undertook porting Takua to arm64. The reason is because I think we can now plausibly see a potential near future in which the fastest, most efficient, and most cost effective chips in the world are based on arm64 instead of x86-64, and the moment this potential future becomes reality, high-performance software that hasn’t already made the jump will face growing pressure to port to arm64. With Apple’s in-progress shift to arm64-based Apple Silicon Macs, we may already be at this point. I can’t speak for any animation or VFX studio in particular; everything I have written here is purely personal opinion and personal conjecture, but I’d like to be ready in the event that a move to arm64 becomes something we have to face as an industry, and what better way is there to prepare than to try with my own hobby renderer first! Also, for several years now I’ve thought that Apple eventually moving Macs to arm64 was obvious given the progress the A-series Apple chips were making, and since macOS is my primary personal daily use platform, I figured I’d have to port Takua to arm64 eventually anyway.

Porting to arm64 Linux

I actually first attempted an ARM port of Takua several years ago, when Fedora 27 became the first version of Fedora to support arm64 single-board computers (SBCs) such as the Raspberry Pi 3B or the Pine A64. I’ve been a big fan of the Raspberry Pi basically since the original first came out, and the thought of porting Takua to run on a Raspberry Pi as an experiment has been with me basically since 2012. However, Takua is written very much with 64-bit in mind, and the first two generations of Raspberry Pis only had 32-bit ARMv7 processors. I actually backed the original Pine A64 on Kickstarter in 2015 precisely because it was one of the very first 64-bit ARMv8 boards on the market, and if I remember correctly, I also ordered the Raspberry Pi 3B the week it was announced in 2016 because it was the first 64-bit ARMv8 Raspberry Pi. However, my Pine A64 and Raspberry Pi 3B mostly just sat around not doing much because I was working on a bunch of other stuff, but that actually wound up working out because by the time I got back around to tinkering with SBCs in late 2017, Fedora 27 had just been released. Thanks to a ton of work from Peter Robinson at Red Hat, Fedora 27 added native arm64 support that basically worked out-of-the-box on both the Raspberry Pi 3B and the Pine A64, which was ideal for me since my Linux distribution of choice for personal hobby projects is Fedora. Since I already had Takua building and running on Fedora on x86-64, being able to use Fedora as the target distribution for arm64 as well meant that I could eliminate different compiler and system library versions as a variable factor; I “just” had to move everything in my Fedora x86-64 build over to Fedora arm64. However, back in 2017, I found that a lot of the foundational libraries that Takua depends on just weren’t quite ready on arm64 yet. The problem usually wasn’t with the actual source code itself, since anything written in pure C++ without any intrinsics or inline assembly should just compile directly on any platform with a supported compiler; instead, the problem was usually just in build scripts not knowing how to handle small differences in where system libraries were located or stuff like that. At the time I was focused on other stuff, so I didn’t try particularly hard to diagnose and work around the problems I ran into; I kind of just shrugged and put it all aside to revisit some other day.

Fast forward to early 2020, when rumors started circulating of a potential macOS transition to 64-bit ARM. As the rumors grew, I figured that this was a good time to return to porting Takua to arm64 Fedora in preparation for if a macOS transition actually happened. I had also recently bought a Raspberry Pi 4B with 4 GB of RAM; the 4 GB of RAM made actually building and running complex code on-device a lot easier than with the Raspberry Pi 3B/3B+’s 1 GB of RAM. By this point, the arm64 build support level for Takua’s dependencies had improved dramatically. I think that as arm64 devices like the iPhone and iPad Pro have gotten more and more powerful processors over the last few years and enabled more and more advanced and complex iOS / iPadOS apps (and similarly with Android devices and Android apps), more and more open source libraries have seen adoption on ARM-based platforms and have seen ARM support improve as a result. Almost everything just built and worked out-of-the-box on arm64, including (to my enormous surprise) Intel’s TBB library! I had assumed that TBB would be x86-64-only since TBB is an Intel project, but it turns out that over the years, the community has contributed support for ARMv7 and arm64 and even PowerPC to TBB. The only library that didn’t work out-of-the-box or with minor changes was Embree, which relies heavily on SSE and AVX intrinsics and has small amounts of inline x86-64 assembly. To get things up and running initially, I just disabled Takua’s Embree-based traversal backend and fell back to my own custom BVH traversal backend. My own custom BVH traversal backend isn’t nearly as fast as Embree and is instead meant to serves as a reference implementation and fallback for when Embree isn’t available, but for the time being since the goal was just to get Takua working at all, losing performance due to not having Embree was fine. As you can see by the “Traverser: Embree” label in Takua Renderer’s UI in Figure 1, I later got Embree up and running on arm64 using Syoyo Fujita’s embree-aarch64 port, but I’ll write more about that in the next post. To be honest, the biggest challenge with getting everything compiled and running was just the amount of patience that was required. I never seem to be able to get cross-compilation for a different architecture right because I always forget something, so instead of cross-compiling for arm64 from my nice big powerful x86-64 Fedora workstation, I just compiled for arm64 directly on the Raspberry Pi 4B. While the Raspberry Pi 4B is much faster than the Raspberry Pi 3B, it’s still nowhere near as fast as a big fancy dual-Xeon workstation, so some libraries took forever to compile locally (especially Boost, which I wish I didn’t have to have a dependency on, but I have to since OpenVDB depends on Boost). Overall getting a working build of Takua up and running on arm64 was very fast; from deciding to undertake the port to getting a first image back took only about a day’s worth of work, and most of that time was just waiting for stuff to compile.

However, getting code to build is a completely different question from getting code to run correctly (unless you’re using one of those fancy proof-solver languages I guess). The first test renders I did with Takua on arm64 Fedora looked fine to my eye, but when I diff’d them against reference images rendered on x86-64, I found some subtle differences; the source of these differences took me a good amount of digging to understand! Chasing this problem down led down some interesting rabbit holes exploring important differences between x86-64 and arm64 that need to be considered when porting code between the two platforms; just because code is written in portable C++ does not necessarily mean that it is always actually as portable as one might think!

Floating Point Consistency (or lack thereof) on Different Systems

Takua has two different types of image comparison based regression tests: the first type of test renders out to high samples-per-pixel numbers and does comparisons with near-converged images, while the second type of test renders out and does comparisons using a single sample-per-pixel. The reason for these two different types of tests is because of how difficult getting floating point calculations to match across different compilers / platforms / processors is. Takua’s single-sample-per-pixel tests are only meant to catch regressions on the same compiler / platform / processor, while Takua’s longer tests are meant to test overall correctness of converged renders. Because of differences in how floating point operations come out on different compilers / platforms / processors, Takua’s convergence tests don’t require an exact match; instead, the tests use small, predefined difference thresholds that comparisons must stay within to pass. The difference thresholds are basically completely ad-hoc; I picked them to be at a level where I can’t perceive any difference when flipping between the images, since I put together my testing system before image differencing systems that formally factor in perception [Andersson et al. 2020] were published. A large part of the differences between Takua’s test results on x86-64 versus arm64 come from these problems with floating point reproducibility across different systems. Because of how commonplace this issue is and how often this issue is misunderstood by programmers who haven’t had to deal with it, I want to spend a few paragraphs talking about floating point numbers.

A lot of programmers that don’t have to routinely deal with floating point calculations might not realize that even though floating point numbers are standardized through the IEEE754 standard, in practice reproducibility is not at all guaranteed when carrying out the same set of floating point calculations using different compilers / platforms / processors! In fact, starting with the same C++ floating point code, determinism is only really guaranteed for successive runs using binaries generated using the same compiler, with the same optimizations enabled, on the same processor family; sometimes running on the same operating system is also a requirement for guaranteed determinism. There are three main reasons [Kreinin 2008] why reproducing exactly the same results from the same set of floating point calculations across different systems is so inconsistent: compiler optimizations, processor implementation details, and different implementations of built-in “complex” functions like sine and cosine .

The first reason above is pretty easy to understand: operations like addition and multiplication are commutative, meaning they can be done in any order, and often a compiler in an optimization pass may choose to reorder commutative math operations. However, as anyone who has dealt extensively with floating point numbers knows, due to how floating point numbers are represented [Goldberg 1991] the commutative and associative properties of addition and multiplication do not actually hold true for floating point numbers; not even for IEEE754 floating point numbers! Sometimes reordering floating point math is expressly permitted by the language, and sometimes doing this is not actually allowed by the language but happens anyway in the compiler because the user has specified flags like -ffast-math, which tells the compiler that it is allowed to sacrifice strict IEEE754 and language math requirements in exchange for additional optimization opportunities. Sometimes the compiler can just have implementation bugs too; here is an example that I found on the llvm-dev mailing lists describing a bug with loop vectorization that impacts floating point consistency! The end result of all of the above is that the same floating point source code can produce subtly different results depending on which compiler is used and which compiler optimizations are enabled within that compiler. Also, while some compiler optimization passes operate purely on the AST built from the parser or operate purely on the compiler’s intermediate representation, there can also be optimization passes that take into account the underlying target instruction set and choose to carry out different optimizations depending on the what’s available in the target processor architecture. These architecture-specific optimizations mean that even the same floating point source code compiled using the same compiler can still produce different results on different processor architectures! Architecture-specific optimizations are one reason why floating point results on x86-64 versus arm64 can be subtly different. Also, another fun fact: the C++ specification doesn’t actually specify a binary representation for floating point numbers, so in principle a C++ compiler could outright ignore IEEE754 and use something else entirely, although in practice this is basically never the case since all modern compilers like GCC, Clang, and MSVC use IEEE754 floats.

The second reason floating point math is so hard to reproduce exactly across different systems is in how floating point math is implemented in the processor itself. Differences at this level is a huge source of floating point differences between x86-64 and arm64. In both x86-64 and arm64, at the assembly level individual arithmetic instructions such as add, subtract, multiple, divide, etc all adhere strictly to the IEEE754 standard. However, the IEEE754 standard is itself… surprisingly loosely specified in some areas! For example, the IEEE754 standard specifies that intermediate results should be as precise as possible, but this means that two different implementations of a floating point addition instructions both adhering to IEEE754 can actually produce different results for the same input if they use different levels of increased precision internally. Here’s a bit of a deprecated example that is still useful to know for historical reasons: everyone knows that an IEEE754 floating point number is 32 bits, but older 32-bit x86 specifies that internal calculations be done using 80-bit precision, which is a holdover from the Intel 8087 math coprocessor. Every x86 (and by extension x86-64) processor when using x87 FPU instructions actually does floating point math using 80 bit internal precision and then rounds back down to 32 bit floats in hardware; the 80 bit internal representation is known as the x86 extended precision format. But even within the same x86 processor, we can still get difference floating point results depending on if the compiler has output x87 FPU instructions or SSE instructions; SSE stays within 32 bits at all times, which means SSE and x87 on the same processor doing the same floating point math isn’t guaranteed to produce the exact same answer. Of course, modern x86-64 generally uses SSE for floating point math instead of x87, but different amounts of precision truncation can still happen depending on what order values are loaded into SSE registers and back into other non-SSE registers. Furthermore, SSE is sufficiently under-specified that the actual implementation details can differ, which is why the same SSE floating point instructions can produce different results on Intel versus AMD processors. Similarly, the ARM architecture doesn’t actually specify a particular FPU implementation at all; the internals of the FPU are left up to each processor designer; for example, the VFP/NEON floating point units that ship on the Raspberry Pi 4B’s Cortex-A72-based CPU use up to 64 bits of internal precision [Johnston 2020]. So, while the x87, SSE on Intel, SSE on AMD, and VFP/NEON FPU implementations are IEEE754-compliant, because of their internal maximum precision differences they can still all produce different results from each other. There are many more examples of areas where IEEE754 leaves in wiggle room for different implementations to do different things [Obiltschnig 2006], and in practice different CPUs do use this wiggle room to do things differently from each other. For example, this wiggle room is why for floating point operations at the extreme ends of the IEEE754 float range, Intel’s x86-64 versus AMD’s x86-64 versus arm64 can produce results with minor differences from each other in the end of the mantissa.

Finally, the third reason floating point math can vary across different systems is because of transcendental functions such as sine and cosine. Transcendental functions like sine and cosine have exact, precise mathematical definitions, but unfortunately these precise mathematical definitions can’t be implemented exactly in hardware. Think back to high school trigonometry; the exact answer for a given input to functions like sine and cosine have to be determined using a Taylor series, but actually implementing a Taylor series in hardware is not at all practical nor performant. Instead, modern processors typically use some form of a CORDIC algorithm to approximate functions like sine and cosine, often to reasonably high levels of accuracy. However, the level of precision to which any given processor approximates sine and cosine is completely unspecified by either IEEE754 or any language standard; as a result, these approximations can and do vary widely between different hardware implementations on different processors! However, how much this reason actually matters in practice is complicated and compiler/language dependent. As an example using cosine, the standard library could choose to implement cosine in software using a variety of different methods, or the standard library could choose to just pass through to the hardware cosine implementation. To illustrate how much the actual execution path depends on the compiler: I originally wanted to include a simple small example using cosine that you, the reader, could go and compile and run yourself on an x86-64 machine and then on an arm64 machine to see the difference, but I wound up having so much difficulty convincing different compilers on different platforms to actually compile the cosine function (even using intrinsics like __builtin_cos!) down to a hardware instruction reliably that I wound up having to abandon the idea.

One of the things that makes all of the above even more difficult to reason about is that which specific factors are applicable at any given moment depends heavily on what the compiler is doing, what compiler flags are in use, and what the compiler’s defaults are. Actually getting floating point determinism across different systems is a notoriously difficult problem [Fiedler 2010] that volumes of stuff has been written about! On top of that, while in principle getting floating point code to produce consistent results across many different systems is possible (hard, but possible) by disabling compiler optimizations and by relying entirely on software implementations of floating point operations to ensure strict, identical IEEE754 compliance on all systems, actually doing all of the above comes with major trade-offs. The biggest trade-off is simply performance: all of the changes necessary to make floating point code consistent across different systems (and especially across different processor architectures like x86-64 versus arm64) also likely will make the floating point considerably slower too.

All of the above reasons mean that modern usage of floating point code basically falls into three categories. The first category is: just don’t use floating point code at all. Included in this first category are applications that require absolute precision and absolute consistency and determinism across all implementations; examples are banking and financial industry code, which tend to store monetary values entirely using only integers. The second category are applications that absolutely must use floats but also must ensure absolute consistency; a good example of applications in this category are high-end scientific simulations that run on supercomputers. For applications in this second category, the difficult work and the performance sacrifices that have to be made in favor of consistency are absolutely worthwhile. Also, tools do exist that can help with ensuring floating point consistency; for example, Herbie is a tool that can detect potentially inaccurate floating point expressions and suggest more accurate replacements. The last category are applications where the requirement for consistency is not necessarily absolute, and the requirement for performance may weigh heavier. This is the space that things like game engines and renderers and stuff live in, and here the trade-offs become more nuanced and situation-dependent. A single-player game may choose absolute performance over any kind of cross-platform guaranteed floating point consistency, whereas a multi-player multi-platform game may choose to sacrifice some performance in order to guarantee that physics and gameplay calculations produce the same result for all players regardless of platform.

Takua Renderer lives squarely in the third category, and historically the point in the trade-off space that I’ve chosen for Takua Renderer is to favor performance over cross-platform floating point consistency. I have a couple of reasons for choosing this trade-off, some of which are good and some of which are… just laziness, I guess! As a hobby renderer, I’ve never had shipping Takua as a public release in any form in mind, and so consistency across many platforms has never really mattered to me. I know exactly which systems Takua will be run on, because I’m the only one running Takua on anything, and to me having Takua run slightly faster at the cost of minor noise differences on different platforms seems worthwhile. As long as Takua is converging to the correct image, I’m happy, and for my purposes, I consider converged images that are perceptually indistinguishable when compared with a known correct reference to also be correct. I do keep determinism within the same platform as a major priority though, since determinism within each platform is important for being able to reliably reproduce bugs and is important for being able to reason about what’s going on in the renderer.

Here is a concrete example of the noise differences I get on x86-64 versus on arm64. This scene is the iced tea scene I originally created for my Nested Dielectrics post; I picked this scene for this comparison purely because it is has a small memory footprint and therefore fits in the relatively constrained 4 GB memory footprint of my Raspberry Pi 4B, while also being slightly more interesting than a Cornell Box. Here is a comparison of a single sample-per-pixel render using bidirectional path tracing on a dual-socket Xeon E5-2680 x86-64 system versus on a Raspberry Pi 4B with a Cortex-A72 based arm64 processor. The scene actually appears somewhat noisier than it normally would be coming out of Takua renderer because for this demonstration, I disabled low-discrepancy sampling and had the renderer fall back to purely random PCG-based sample sequences, with the goal of trying to produce more noticeable noise differences:

Figure 2: A single-spp render demonstrating noise pattern differences between x86-64 (left) versus arm64 (right). Differences are most noticeable on rim of the cup, especially on the left near the handle. For a full screen comparison, click here.

The noise differences are actually relatively minimal! The most noticeable noise differences are on the rim of the cup; note the left of the rim near the handle. Since the noise differences can be fairly difficult to see in the full render on a small screen, here is a 2x zoomed-in crop:

Figure 3: A zoomed-in crop of Figure 2 showing noise pattern differences between x86-64 (left) versus arm64 (right). For a full screen comparison, click here.

The differences are still kind of hard to see even in the zoomed-in crop! So, here’s the absolute difference between the x86-64 and arm64 renders, created by just subtracting the images from each other and taking the absolute value of the difference at each pixel. Black pixels indicate pixels where the absolute difference is zero (or at least, so close to zero so as to be completely imperceptible). Brighter pixels indicate greater differences between the x86-64 and arm64 renders; from where the bright pixels are, we can see that most of the differences occur on the rim of the cup, on ice cubes in the cup, and in random places mostly in the caustics cast by the cup. There’s also a faint horizontal line of small differences across the background; that area lines up with where the seamless white cyclorama backdrop starts to curve upwards:

Understanding why the areas with the highest differences are where they are requires thinking about how light transport is functioning in this specific scene and how differences in floating point calculations impact that light transport. This scene is lit fairly simply; the only light sources are two rect lights and a skydome. Basically everything is illuminated through direct lighting, meaning that for most areas of the scene, a ray starting from the camera is directly hitting the diffuse background cyclorama and then sampling a light source, and a ray starting from the light is directly hitting the diffuse background cyclorama and then immediately sampling the camera lens. So, even with bidirectional path tracing, the total path lengths for a lot of the scene is just two path segments, or one bounce. That’s not a whole lot of path for differences in floating point calculations to accumulate during. On the flip side, most of the areas with the greatest differences are areas where a lot of paths pass through the glass tea cup. For paths that go through the glass tea cup, the path lengths can be very long, especially if a path gets caught in total internal reflection within the glass walls of the cup. As the path lengths get longer, the floating point calculation differences at each bounce accumulate until the entire path begins to diverge significantly between the x86-64 and arm64 versions of the render. Fortunately, these differences basically eventually “integrate out” thanks to the magic of Monte Carlo integration; by the time the renders are near converged, the x86-64 and arm64 results are basically perceptually indistinguishable from each other:

Figure 5: The same cup scene from Figure 1, but now much closer to convergence (2048 spp), rendered using x86-64 (left) and arm64 (right). Note how differences between the x86-64 and arm64 renders are now basically imperceptible to the eye; these are in fact two different images! For a full screen comparison, click here.

Below is the absolute difference between the two images above. To the naked eye the absolute difference image looks completely black, because the differences between the two images are so small that they’re basically below the threshold of normal perception. So, to confirm that there are in fact differences, I’ve also included below a version of the absolute difference exposed up 10 stops, or made 1024 times brighter. Much like in the single spp renders in Figure 1, the areas of greatest difference are in the areas where the path lengths are the longest, which in this scene are areas where paths refract through the glass cup, the tea, and the ice cubes. Just, the differences between individual paths for the same sample across x86-64 and arm64 become tiny to the point of insignificance once averaged across 2048 samples-per-pixel:

Figure 6: Left: Absolute difference between the x86-64 and arm64 renders from Figure 2. Right: Since the absolute difference image basically looks completely black to the eye, I've also included a version of the absolute difference exposed up 10 stops (made 1024 times brighter) to make the differences more visible. For a full screen comparison, click here.

For many extremely precise scientific applications, the level of differences above would still likely be unacceptable, but for our purposes in just making pretty pictures, I’ll call this good enough! In fact, many rendering teams only target perceptually indistinguishable for the purposes of calling things deterministic enough, as opposed to aiming for absolute binary-level determinism; great examples include Pixar’s RenderMan XPU, Disney Animation’s Hyperion, and DreamWorks Animation’s MoonRay.

Eventually maybe I’ll get around to putting more work into trying to get Takua Renderer’s per-path results to be completely consistent even across different systems and processor architectures and compilers, but for the time being I’m fine with keeping that goal as a fairly low priority relative to everything else I want to work on, because as you can see, once the renders are converged, the difference doesn’t really matter! Floating point calculations accounted for most of the differences I was finding when comparing renders on x86-64 versus renders on arm64, but only most. The remaining source of differences turned out… to be an actual bug!

Weak Memory Ordering in arm64 and Atomic Bugs in Takua

Multithreaded programming with atomics and locks has a reputation for being one of the relatively more challenging skills for programmers to master, and for good reason. Since different processor architectures often have different semantics and guarantees and rules around multithreading-related things like memory reordering, porting between different architectures is often a great way to expose subtle multithreading bugs. The remaining source of major differences between the x86-64 and arm64 renders I was getting turned out to be caused by a memory reordering-related bug in some old multithreading code that I wrote a long time ago and forgot about.

In addition to outputing the main render, Takua Renderer is also able to generate some additional render outputs, including some useful diagnostic images. One of the diagnostic render outputs is a sample heatmap, which shows how many pixel samples were used for each pixel in the image. I originally added the sample heatmap render output to Takua when I was implementing adaptive sampling, and since then the sample heatmap render output has been a useful tool for understanding how much time Takua is spending on different parts of the image. One of the other things the sample heatmap render output has served as though is as a simple sanity check that Takua’s multithreaded work dispatching system is functioning correctly. For a render where the adaptive sampler is disabled, the sample heatmap should contain exactly the same value for every single pixel in the entire image, since without adaptive sampling, every pixel is just being rendered to the target samples-per-pixel of the entire render. So, in some of my tests, I have the renderer scripted to always output the sample heatmap, and the test system checks that the sample heatmap is completely uniform after the render as a sanity check to make sure that the renderer has rendered everything that it was supposed to. To my surprise, sometimes on arm64, a test would fail because the sample heatmap for a render without adaptive sampling would come back as nonuniform! Specifically, the sample heatmap would come back indicating that some pixels had received one fewer sample than the total target sample-per-pixel count across the whole render. These pixels were always in square blocks corresponding to a specific tile, or multithreaded work dispatch unit. The specific bug was in how Takua Renderer dispatches rendering work to each thread; to provide the relevant context and explain what I mean by a “tile”, I’ll first have to quickly describe how Takua Renderer is multithreaded.

In university computer graphics courses, path tracing is often taught as being trivially simple to parallelize: since a path tracer traces individual paths in a depth-first fashion, individual paths don’t have dependencies on other paths, so just assign each path that has to be traced to a separate thread. The easiest way to implement this simple parallelization scheme is to just run a parallel_for loop over all of the paths that need to be traced for a given set of samples, and to just repeat this for each set of samples until the render is complete. However, in reality, parallelizing a modern production-grade path tracing renderer is often not as simple as the classic “embarrassingly parallel” approach. Modern advanced path tracers often are written to take into account factors such as cache coherency, memory access patterns and memory locality, NUMA awareness, optimal SIMD utilization, and more. Also, advanced path tracers often make use of various complex data structures such as out-of-core texture caches, photon maps, path guiding trees, and more. Making sure that these data structures can be built, updated, and accessed on-the-fly by multiple threads simultaneously and efficiently often introduces complex lock-free data structure design problems. On top of that, path tracers that use a wavefront or breadth-first architecture instead of a depth-first approach are far from trivial to parallelize, since various sorting and batching operations and synchronization points need to be accounted for.

Even for relatively straightforward depth-first architectures like the one Takua has used for the past six years, the direct parallel_for approach can be improved upon in some simple ways. Before progressive rendering became the standard modern approach, many renderers used an approach called “bucket” rendering [Geupel 2018], where the image plane was divided up into a bunch of small tiles, or buckets. Each thread would be assigned a single bucket, and each thread would be responsible for rendering that bucket to completion before being assigned another bucket. For offline, non-interactive rendering, bucket rendering often ends up being faster than just a simple parallel_for because bucket rendering allows for a higher degree of memory access coherency and cache coherency within each thread since each thread is always working in roughly the same area of space (at least for the first few bounces). Even with progressive rendering as the standard approach for renderers running in an interactive mode today, many (if not most) renderers still use a bucketed approach when dispatched to a renderfarm today. For CPU path tracers today, the number of pixels that need to be rendered for a typical image is much much larger than the number of hardware threads available on the CPU. As a result, the basic locality idea that bucket rendering utilizes also ends up being applicable to progressive, interactive rendering in CPU path tracers (for GPU path tracing though, the GPU’s completely different, wavefront-based SIMT threading model means a bit of a different approach is necessary). RenderMan, Arnold, and Vray in interactive progressive mode all still render pixels in a bucket-like order, although instead of having each thread render all samples-per-pixel to completion in each bucket all at once, each thread just renders a single sample-per-pixel for each bucket and then the renderer loops over the entire image plane for each sample-per-pixel number. To differentiate using buckets in a progressive mode from using buckets in a batch mode, I will refer to buckets in progressive mode as “tiles” for the rest of this post.

Takua Renderer also supports using a tiled approach for assigning work to individual threads. At renderer startup, Takua precalculates a work assignment order, which can be in a tiled fashion, or can use a more naive parallel_for approach; the tiled mode is the default. When using a tiled work assignment order, the specific order of tiles supports several different options; the default is a spiral starting from the center of the image. Here’s a short screen recording demonstrating what this tiling work assignment looks like:

Your browser does not support the video tag.

Figure 7: A short video showing Takua Renderer's tile assignment system running in spiral mode; each red outlined square represents a single tile. This video was captured on an arm64 M1 Mac Mini running macOS Big Sur instead of on a Raspberry Pi 4B because trying to screen record on a Raspberry Pi 4B while also running the renderer was not a good time. To see this video in a full window, click here.

As threads free up, the work assignment system hands each free thread a tile to render; each thread then renders a single sample-per-pixel for every pixel in its assigned tile and then goes back to the work assignment system to request more work. Once the number of remaining tiles for the current samples-per-pixel number drops below the number of available threads, the work assignment system starts allowing multiple threads to team up on a single tile. In general, the additional cache coherency and more localizes memory access patterns from using a tiled approach gives Takua Renderer a minimum 3% speed improvement compared to using a naive parallel_for to assign work to each thread; sometimes the speed improvement can be even higher if the scene is heavily dependent on things like texture cache access or reading from a photon map.

The reason the work assignment system actually hands out tiles one by one upon request instead of just running a parallel_for loop over all of the tiles is because using something like tbb::parallel_for means that the tiles won’t actually be rendered in the correct specified order. Actually, Takua does have a “I don’t care what order the tiles are in” mode, which does in fact just run a tbb::parallel_for over all of the tiles and lets tbb’s underlying scheduler decide what order the tiles are dispatched in; rendering tiles in a specific order doesn’t actually matter for correctness. However, maintaining a specific tile ordering does make user feedback a bit nicer.

Implementing a work dispatcher that can still maintain a specific tile ordering requires some mechanism internally to track what the next tile that should be dispatched is; Takua does so using an atomic integer inside of the work dispatcher. This atomic is where the memory-reordering bug comes in that led to Takua occasionally dropping a single spp for a single tile on arm64. Here’s some pesudo-code for how threads are launched and how they ask the work dispatcher for tiles to render; this is highly simplified and condensed from how the actual code in Takua is written (specifically, I’ve inlined together code from both individual threads and from the work dispatcher and removed a bunch of other unrelated stuff), but preserves all of the important details necessary to illustrate the bug:

int nextTileIndex = 0;
std::atomic<bool> nextTileSoftLock(false);
tbb::parallel_for(int(0), numberOfTilesToRender, [&](int /*i*/) {
    bool gotNewTile = false;
    int tile = -1;
    while (!gotNewTile) {
        bool expected = false;
        if (nextTileSoftLock.compare_exchange_strong(expected, true, std::memory_order_relaxed)) {
            tile = nextTileIndex++;
            nextTileSoftLock.store(false, std::memory_order_relaxed);
            gotNewTile = true;
        }
    }
    if (tileIsInRange(tile)) {
        renderTile(tile);
    }
});

Listing 1: Simplified pseudocode for the not-very-good work scheduling mechanism Takua used to assign tiles to threads. This version of the scheduler resulted in tiles occasionally being missed on arm64, but not on x64-64.

If you remember your memory ordering rules, you already know what’s wrong with the code above; this code is really really bad! In my defense, this code is an ancient part of Takua’s codebase; I wrote it back in college and haven’t really revisited it since, and back when I wrote it, I didn’t have the strongest grasp of memory ordering rules and how they apply to concurrent programming yet. First off, why does this code use an atomic bool as a makeshift mutex so that multiple threads can increment a non-atomic integer, as opposed to just using an atomic integer? Looking through the commit history, the earliest version of this code that I first prototyped (some eight years ago!) actually relied on a full-blown std::mutex to protect from race conditions around incrementing nextTileIndex; I must have prototyped this code completely single-threaded originally and then done a quick-and-dirty multithreading adaptation by just wrapping a mutex around everything, and then replaced the mutex with a cheaper atomic bool as an incredibly lazy port to a lock-free implementation instead of properly rewriting things. I haven’t had to modify it since then because it worked well enough, so over time I must have just completely forgotten about how awful this code is.

Anyhow, the fix for the code above is simple enough: just replace the first std::memory_order_relaxed in line 8 with std::memory_order_acquire and replace the second std::memory_order_relaxed in line 10 with std::memory_order_release. An even better fix though is to just outright replace the combination of an atomic bool and non-atomic integer incremented with a single atomic integer incrementer, which is what I actually did. But, going back to the original code, why exactly does using std::memory_order_relaxed produce correctly functioning code on x86-64, but produces code that occasionally drops tiles on arm64? Well, first, why did I use std::memory_order_relaxed in the first place? My commit comments from eight years ago indicate that I chose std::memory_order_relaxed because I thought it would compile down to something cheaper than if I had chosen some other memory ordering flag; I really didn’t understand this stuff back then! I wasn’t entirely wrong, although not for the reasons that I thought at the time. On x86-64, different memory order flags don’t actually do anything, since x86-64 has a guaranteed strong memory model. On arm64, using std::memory_order_relaxed instead of std::memory_order_acquire/std::memory_order_release does indeed produce simpler and faster arm64 assembly, but the simpler and faster arm64 assembly is also wrong for what the code is supposed to do. Understanding why the above happens on arm64 but not on x86-64 requires understanding what a weakly ordered CPU is versus what a strong ordered CPU is; arm64 is a weakly ordered architecture, whereas x86-64 is a strongly ordered architecture.

One of the best resources on diving deep into weak versus strong memory orderings is the well-known series of articles by Jeff Preshing on the topic (parts 1, 2, 3, 4, 5, 6, and 7). Actually, while I was going back through the Preshing on Programming series in preparation to write this post, I noticed that by hilarious coincidence the older code in Takua represented by Listing 1, once boiled down to what it is fundamentally doing, is extremely similar to the canonical example used in Preshing on Programming’s “This Is Why They Call It a Weakly-Ordered CPU” article. If only I had read the Preshing on Programming series a year before implementing Takua’s work assignment system instead of a few years after! I’ll do my best to quickly recap what the Preshing on Programming series covers about weak versus strong memory orderings here, but if you have not read Jeff Preshing’s articles before, I’d recommend taking some time later to do so.

One of the single most important things that lock-free multithreaded code needs to take into account is the potential for memory reordering. Memory reordering is when the compiler and/or the processor decides to optimize code by changing the ordering of instructions that access and modify memory. Memory reordering is always carried out in such a way that the behavior of a single-threaded program never changes, and multithreaded code using locks such as mutexes forces the compiler and processor to not reorder instructions across the boundaries defined by locks. However, lock-free multithreaded code is basically free range for the compiler and processor to do whatever they want; even though memory reordering is carried out for each individual thread in such a way that keeps the apparent behavior of that specific thread the same as before, this rule does not take into account the interactions between threads, so different reorderings in different threads that keep behavior the same in each thread isolated can still result in very different behavior in the overall multithreaded behavior.

The easiest way to disable any kind of memory reordering at compile time is to just… disable all compiler optimizations. However, in practice we never actually want to do this, because disabling compiler optimizations means all of our code will run slower (sometimes a lot slower). Instruction selection to lower from IR to assembly also means that even disabling all compiler optimizations may not be enough to ensure no memory reordering, because we still need to contend with potential memory reordering at runtime from the CPU.

Memory reordering in multithreaded code happens on the CPU because of how CPUs access memory: modern processors have a series of caches (L1, L2, sometimes L3, etc) sitting between the actual registers in each CPU core and main memory. Some of these cache levels (usually L1) are per-CPU-core, and some of these cache levels (usually L2 and higher) are shared across some or all cores. The lower the cache level number, the faster and also smaller that cache level typically is, and the higher the cache level number, the slower and larger that cache level is. When a CPU wants to read a particular piece of data, it will check for it in cache first, and if the value is not in cache, then the CPU must make a fetch request to main memory for the value; fetching from main memory is obviously much slower than fetching from cache. Where these caches get tricky is how data is propagated from a given CPU core’s registers and caches back to main memory and then eventually up again into the L1 caches for other CPU cores. This propagation can happen… whenever! A variety of different possible implementation strategies exist for when caches update from and write back to main memory, with the end result being that by default we as programmers have no reliable way of guessing when data transfers between cache and main memory will happen.

Imagine that we have some multithreaded code written such that one thread writes, or stores, to a value, and then a little while later, another thread reads, or loads, that same value. We would expect the store on the first thread to always precede the load on the second thread, so the second thread should always pick up whatever value the first thread read. However, if we implement this code just using a normal int or float or bool or whatever, what can actually happen at runtime is our first thread writes the value to L1 cache, and then eventually the value in L1 cache gets written back to main memory. However, before the value manages to get propagated from L1 cache back to main memory, the second thread reads the value out of main memory. In this case, from the perspective of main memory, the second thread’s load out of main memory takes place before the first thread’s store has rippled back down to main memory. This case is an example of StoreLoad reordering, so named because a store has been reordered with a later load. There are also LoadStore, LoadLoad, and StoreStore reorderings that are possible. Jeff Preshing’s “Memory Barriers are Like Source Control” article does a great job of describing these four possible reordering scenarios in detail.

Different CPU architectures make different guarantees about which types of memory reordering can and can’t happen on that particular architecture at the hardware level. A processor that guarantees absolutely no memory reordering of any kind is said to have a sequentially consistent memory model. Few, if any modern processor architecture provide a guaranteed sequentially consistent memory model. Some processors don’t guarantee absolutely sequential consistency, but do guarantee that at least when a CPU core makes a series of writes, other CPU cores will see those writes in the same sequence that they were made; CPUs that make this guarantee have a strong memory model. Strong memory models effectively guarantee that StoreLoad reordering is the only type of reordering allowed; x86-64 has a strong memory model. Finally, CPUs that allow for any type of memory reordering at all are said to have a weak memory model. The arm64 architecture uses a weak memory model, although arm64 at least guarantees that if we read a value through a pointer, the value read will be at least as new as the pointer itself.

So, how can we possibly hope to be able to reason about multithreaded code when both the compiler and the processor can happily reorder our memory access instructions between threads whenever they want for whatever reason they want? The answer is in memory barriers and fence instructions; these tools allow us to specify boundaries that the compiler cannot reorder memory access instructions across and allow us to force the CPU to make sure that values are flushed to main memory before being read. In C++, specifying barriers and fences can be done by using compiler intrinsics that map to specific underlying assembly instructions, but the easier and more common way of doing this is by using std::memory_order flags in combination with atomics. Other languages have similar concepts; for example, Rust’s atomic access flags are very similar to the C++ memory ordering flags.

std::memory_order flags specify how memory accesses for all operations surrounding an atomic are to be ordered; the impacted surrounding operations include all non-atomics. There are a whole bunch of std::memory_order flags; we’ll examine the few that are relevant to the specific example in Listing 1. The heaviest hammer of all of the flags is std::memory_order_seq_cst, which enforces absolute sequential consistency at the cost of potentially being more expensive due to potentially needing more loads and/or stores. For example, on x86-64, std::memory_order_seq_cst is often implemented using slower xchg or paired mov/mfence instructions instead of a single mov instruction, and on arm64, the overhead is even greater due to arm64’s weak memory model. Using std::memory_order_seq_cst also potentially disallows the CPU from reordering unrelated, longer running instructions to starting (and therefore finish) earlier, potentially causing even more slowdowns. In C++, atomic operations default to using std::memory_order_seq_cst if no memory ordering flag is explicitly specified. Contrast with std::memory_order_relaxed, which is the exact opposite of std::memory_order_seq_cst. std::memory_order_relaxed enforces no synchronization or ordering constraints whatsoever; on an architecture like x86-64, using std::memory_order_relaxed can be faster than using std::memory_order_seq_cst if your memory ordering requirements are already met in hardware by x86-64’s strong memory model. However, being sloppy with std::memory_order_relaxed can result in some nasty nondeterministic bugs on arm64 if your code requires specific memory ordering guarantees, due to arm64’s weak memory model. The above is the exact reason why the code in Listing 1 occasionally resulted in dropped tiles in Takua on arm64!

Without any kind of memory ordering constraints, with arm64’s weak memory ordering, the code in Listing 1 can sometimes execute in such a way that one thread sets nextTileSoftLock to true, but another thread attempts to check nextTileSoftLock before the first thread’s new value propagates back to main memory and to all of the other threads. As a result, two threads can end up in a race condition, trying to both increment the non-atomic nextTileIndex at the same time. When this happens, two threads can end up working on the same tile at the same time or a tile can get skipped! We could fix this problem by just removing the memory ordering flags entirely from Listing 1, allowing everything to default back to std::memory_order_seq_cst, which would fix the problem. However, as just mentioned above, we can do better than using std::memory_order_seq_cst if we know specifically what memory ordering requirements we need for the code to work correctly.

Enter std::memory_order_acquire and std::memory_order_release, which represent acquire semantics and release semantics respectively and, when used correctly, always come in a pair. Acquire semantics apply to load (read) operations and prevent memory ordering of the load operation with any subsequent read or write operation. Release semantics apply to store (write) operations and prevent memory reordering of the store operation with any preceding read or write operation. In other words, std::memory_order_acquire tells the compiler to issue instructions that prevent LoadLoad and LoadStore reordering from happening, and std::memory_order_release tells the compiler to issue instructions that prevent LoadStore and StoreStore reordering from happening. Using acquire and release semantics allows Listing 1 to work correctly on arm64, while being ever so slightly cheaper compared with enforcing absolute sequential consistency everywhere.

What is the takeaway from this long tour through memory reordering and weak and strong memory models and memory ordering constraints? The takeaway is that when writing multithreaded code that needs to be portable across architectures with different memory ordering guarantees, such as x86-64 versus arm64, we need to be very careful with thinking about how each architecture’s memory ordering guarantees (or lack thereof) impact any lock-free cross-thread communication we need to do! Atomic code often can be written more sloppily on x86-64 than on arm64 and still have a good chance of working, whereas arm64’s weak memory model means there’s much less room for being sloppy. If you want a good way to smoke out potential bugs in your lock-free atomic code, porting to arm64 is a good way to find out!

A Deep Dive on x86-64 versus arm64 Through the Lens of Compiling `std::atomic::compare_exchange_weak()`

While I was looking for the source of the memory reordering bug, I found a separate interesting bug in Takua’s atomic framebuffer… or at least, I thought it was a bug. The thing I found turned out to not be a bug at all in the end, but at the time I thought that there was a bug in the form of a race condition in an atomic compare-and-exchange loop. I figured that the renderer must be just running correctly most of the time instead of all of the time, but as I’ll explain in a little bit, the renderer actually provably runs correctly 100% of the time. Understanding what was going on here led me to dive into the compiler’s assembly output, and wound up being an interesting case study in comparing how the same exact C++ source code compiles to x86-64 versus arm64. In order to provide the context for the not-a-bug and what I learned about arm64 from it, I need to first briefly describe what Takua’s atomic framebuffer is and how it is used.

Takua supports multiple threads writing to the same pixel in the framebuffer at the same time. There are two major uses cases for this capability: first, integration techniques that use light tracing will connect back to the camera completely arbitrarily, resulting in splats to the framebuffer that are completely unpredictable and possibly overlapping on the same pixels. Second, adaptive sampling techniques that redistribute sample allocation within a single iteration (meaning launching a single set of pixel samples) can result in multiple samples for the same pixel in the same iteration, which means multiple threads can be calculating paths starting from the same pixel and therefore multiple threads need to write to the same framebuffer pixel. In order to support multiple threads writing simultaneously to the same pixel in the framebuffer, there are three possible implementation options. The first option is to just keep a separate framebuffer per thread and merge afterwards, but this approach obviously requires potentially a huge amount of memory. The second option is to never write to the framebuffer directly, but instead keep queues of framebuffer write requests that occasionally get flushed to the framebuffer by a dedicated worker thread (or some variation thereof). The third option is to just make each pixel in the framebuffer support exclusive operations through atomics (a mutex per pixel works too, but obviously this would involve much more overhead and might be slower); this option is the atomic framebuffer. I actually implemented the second option in Takua a long time ago, but the added complexity and performance impact of needing to flush the queue led me to eventually replace the whole thing with an atomic framebuffer.

The tricky part of implementing an atomic framebuffer in C++ is the need for atomic floats. Obviously each pixel in the framebuffer has to store at the very least accumulated radiance values for the base RGB primaries, along with potentially other AOV values, and accumulated radiance values and many common AOVs all have to be represented with floats. Modern C++ has standard library support for atomic types through std::atomic, and std::atomic works with floats. However, pre-C++20, std::atomic only provides atomic arithmetic operations for integer types. C++20 adds fetch_add() and fetch_sub() implementations for std::atomic<float>, but I wrote Takua’s atomic framebuffer way back when C++11 was still the latest standard. So, pre-C++20, if you want atomic arithmetic operations for std::atomic<float>, you have to implement it yourself. Fortunately, pre-C++20 does provide compare_and_exchange() implementations for all atomic types, and that’s all we need to implement everything else we need ourselves.

Implementing fetch_add() for atomic floats is fairly straightforward. Let’s say we want to add a value f1 to an atomic float f0. The basic idea is to do an atomic load from f0 into some temporary variable oldval. A standard compare_and_exchange() implementation compares some input value with the current value of the atomic float, and if the two are equal, replaces the current value of the atomic float with a second input value; C++ provides an implementations in the form of compare_exchange_weak() and compare_exchange_strong(). So, all we need to do is run compare_exchange_weak() on f0 where the value we use for the comparison test is oldval and the replacement value is oldval + f1; if compare_exchange_weak() succeeds, we return oldval, otherwise, loop and repeat until compare_exchange_weak() succeeds. Here’s an example implementation:

float addAtomicFloat(std::atomic<float>& f0, const float f1) {
    do {
        float oldval = f0.load();
        float newval = oldval + f1;
        if (f0.compare_exchange_weak(oldval, newval)) {
            return oldval;
        }
    } while (true);
}

Listing 2: Example implementation of atomic float addition.

Seeing why the above implementation works should be very straightforward: imagine two threads are calling the above implementation at the same time. We want each thread to reload the atomic float on each iteration because we never want a situation where a first thread loads from f0, a second thread succeeds in adding to f0, and then the first thread also succeeds in writing its value to f0, because upon the first thread writing, the value of f0 that the first thread used for the addition operation is out of date!

Well, here’s the implementation that has actually been in Takua’s atomic framebuffer implementation for most of the past decade. This implementation is very similar to Listing 2, but compared with Listing 2, Lines 2 and 3 are swapped from where they should be; I likely swapped these two lines through a simple copy/paste error or something when I originally wrote it. This is the implementation that I suspected was a bug upon revisiting it during the arm64 porting process:

float addAtomicFloat(std::atomic<float>& f0, const float f1) {
    float oldval = f0.load();
    do {
        float newval = oldval + f1;
        if (f0.compare_exchange_weak(oldval, newval)) {
            return oldval;
        }
    } while (true);
}

Listing 3: What I thought was an incorrect implementation of atomic float addition.

In the Listing 3 implementation, note how the atomic load of f0 only ever happens once outside of the loop. The following is what I thought was going on and why at the moment I thought this implementation was wrong: Think about what happens if a first thread loads from f0 and then a second thread’s call to compare_exchange_weak() succeeds before the first thread gets to compare_exchange_weak(); in this race condition scenario, the first thread should get stuck in an infinite loop. Since the value of f0 has now been updated by the second thread, but the first thread never reloads the value of f0 inside of the loop, the first thread should have no way of ever succeeding at the compare_exchange_weak() call! However, in reality, with the Listing 3 implementation, Takua never actually gets stuck in an infinite loop, even when multiple threads are writing to the same pixel in the atomic framebuffer. I initially thought that I must have just been getting really lucky every time and multiple threads, while attempting to accumulate to the same pixel, just never happened to produce the specific compare_exchange_weak() call ordering that would cause the race condition and infinite loop. But then I repeatedly tried a simple test where I had 32 threads simultaneously call addAtomicFloat() for the same atomic float a million times per thread, and… still an infinite loop never occurred. So, the situation appeared to be that what I thought was incorrect code was always behaving as if it had been written correctly, and furthermore, this held true on both x86-64 and on arm64, across both compiling with Clang on macOS and compiling with GCC on Linux.

If you are well-versed in the C++ specifications, you already know which crucial detail I had forgotten that explains why Listing 3 is actually completely correct and functionally equivalent to Listing 2. Under the hood, std::atomic<T>::compare_exchange_weak(T& expected, T desired) requires doing an atomic load of the target value in order to compare the target value with expected. What I had forgotten was that if the comparison fails, std::atomic<T>::compare_exchange_weak() doesn’t just return a false bool; the function also replaces expected with the result of the atomic load on the target value! So really, there isn’t only a single atomic load of f0 in Listing 3; there’s actually an atomic load of f0 in every loop as part of compare_exchange_weak(), and in the event that the comparison fails, the equivalent of oldval = f0.load() happens. Of course, I didn’t actually correctly remember what compare_exchange_weak() does in the comparison failure case, and I stupidly didn’t double check cppreference, so it took me much longer to figure out what was going on.

So, still missing the key piece of knowledge that I had forgotten and assuming that compare_exchange_weak() didn’t modify any inputs upon comparison failure, my initial guess was that perhaps the compiler was inlining f0.load() wherever oldval was being used as an optimization, which would produce a result that should prevent the race condition from ever happening. However, after a bit more thought, I concluded that this optimization was very unlikely, since it both changes the written semantics of what the code should be doing by effectively moving an operation from outside a loop to the inside of the loop, and also inlining f0.load() wherever oldval is used is not actually a safe code transformation and can produce a different result from the originally written code, since having two atomic loads from f0 introduces the possibility that another thread can do an atomic write to f0 in between the current thread’s two atomic loads.

Things got even more interesting when I tried adding in an additional bit of indirection around the atomic load of f0 into oldval. Here is an actually incorrect implementation that I thought should be functionally equivalent to the implementation in Listing 3:

float addAtomicFloat(std::atomic<float>& f0, const float f1) {
    const float oldvaltemp = f0.load();
    do {
        float oldval = oldvaltemp;
        float newval = oldval + f1;
        if (f0.compare_exchange_weak(oldval, newval)) {
            return oldval;
        }
    } while (true);
}

Listing 4: An actually incorrect implementation of atomic float addition that might appear to be semantically identical to the implementation in Listing 3 if you've forgotten a certain very important detail about std::compare_exchange_weak().

Creating the race condition and subsequent infinite loop is extremely easy and reliable with Listing 4. So, to summarize where I was at this point: Listing 2 is a correctly written implementation that produces a correct result in reality, Listing 4 is an incorrectly written implementation that, as expected, produces an incorrect result in reality, and Listing 3 is what I thought was an incorrectly written implementation that I thought was semantically identical to Listing 4, but actually produces the same correct result in reality as Listing 2!

So, left with no better ideas, I decided to just go look directly at the compiler’s output assembly. To make things a bit easier, we’ll look at and compare the x86-64 assembly for the Listing 2 and Listing 3 C++ implementations first, and explain what important detail I had missed that led me down this wild goose chase. Then, we’ll look at and compare the arm64 assembly, and we’ll discuss some interesting things I learned along the way by comparing the x86-64 and arm64 assembly for the same C++ function.

Here is the corresponding x86-64 assembly for the correct C++ implementation in Listing 2, compiled with Clang 10.0.0 using -O3. For readers who are not very used to reading assembly, I’ve included annotations as comments in the assembly code to describe what the assembly code is doing and how it corresponds back to the original C++ code:

addAtomicFloat(std::atomic<float>&, float):  # f0 is dword ptr [rdi], f1 is xmm0
.LBB0_1:
        mov           eax, dword ptr [rdi]   # eax = *arg0 = f0.load()
        movd          xmm1, eax              # xmm1 = eax = f0.load()
        movdqa        xmm2, xmm1             # xmm2 = xmm1 = eax = f0.load()
        addss         xmm2, xmm0             # xmm2 = (xmm2 + xmm0) = (f0 + f1)
        movd          ecx, xmm2              # ecx = xmm2 = (f0 + f1)
        lock cmpxchg  dword ptr [rdi], ecx   # if eax == *arg0 { ZF = 1; *arg0 = arg1 }
                                             #    else { ZF = 0; eax = *arg0 };
                                             #    "lock" means all done exclusively
        jne           .LBB0_1                # if ZF == 0 goto .LBB0_1
        movdqa        xmm0, xmm1             # return f0 value from before cmpxchg
        ret

Listing 5: x86-64 assembly corresponding to the implementation in Listing 2, with my annotations in the comments. Compiled using armv8-a Clang 10.0.0 using -O3. See on Godbolt Compiler Explorer

Here is the corresponding x86-64 assembly for the C++ implementation in Listing 3; again, this is the version that produces the same correct result as Listing 2. Just like with Listing 5, this was compiled using Clang 10.0.0 using -O3, and descriptive annotations are in the comments:

addAtomicFloat(std::atomic<float>&, float):  # f0 is dword ptr [rdi], f1 is xmm0
        mov           eax, dword ptr [rdi]   # eax = *arg0 = f0.load()
.LBB0_1:
        movd          xmm1, eax              # xmm1 = eax = f0.load()
        movdqa        xmm2, xmm1             # xmm2 = xmm1 = eax = f0.load()
        addss         xmm2, xmm0             # xmm2 = (xmm2 + xmm0) = (f0 + f1)
        movd          ecx, xmm2              # ecx = xmm2 = (f0 + f1)
        lock cmpxchg  dword ptr [rdi], ecx   # if eax == *arg0 { ZF = 1; *arg0 = arg1 }
                                             #    else { ZF = 0; eax = *arg0 };
                                             #    "lock" means all done exclusively
        jne           .LBB0_1                # if ZF == 0 goto .LBB0_1
        movdqa        xmm0, xmm1             # return f0 value from before cmpxchg

Listing 6: x86-64 assembly corresponding to the implementation in Listing 3, with my annotations in the comments. Compiled using armv8-a Clang 10.0.0 using -O3. See on Godbolt Compiler Explorer

The compiled x86-64 assembly in Listing 5 and Listing 6 is almost identical; the only difference is that in Listing 5, copying data from the address stored in register rdi to register eax happens after label .LBB0_1 and in Listing 6 the copy happens before label .LBB0_1. Comparing the x86-64 assembly with the C++ code, we can see that this difference corresponds directly to where f0’s value is atomically loaded into oldval. We can also see that std::atomic<float>::compare_exchange_weak() compiles down to a single cmpxchg instruction, which as the instruction name suggests, is a compare and exchange operation. The lock instruction prefix in front of cmpxchg ensures that the current CPU core has exclusive ownership of the corresponding cache line for the duration of the cmpxchg operation, which is how the operation is made atomic.

This is the point where I eventually realized what I had missed. I actually didn’t notice immediately; figuring out what I had missed didn’t actually occur to me until several days later! The thing that finally made me realize what I had missed and made me understand why Listing 3 / Listing 6 don’t actually result in an infinite loop and instead match the behavior of Listing 2 / Listing 5 lies in cmpxchg. Let’s take a look at the official Intel 64 and IA-32 Architectures Software Developer’s Manual’s description [Intel 2021] of what cmpxchg does:

Compares the value in the AL, AX, EAX, or RAX register with the first operand (destination operand). If the two values are equal, the second operand (source operand) is loaded into the destination operand. Otherwise, the destination operand is loaded into the AL, AX, EAX or RAX register. RAX register is available only in 64-bit mode.

This instruction can be used with a LOCK prefix to allow the instruction to be executed atomically. To simplify the interface to the processor’s bus, the destination operand receives a write cycle without regard to the result of the comparison. The destination operand is written back if the comparison fails; otherwise, the source operand is written into the destination. (The processor never produces a locked read without also producing a locked write.)

If the compare part of cmpxchg fails, the first operand is loaded into the EAX register! After thinking about this property of cmpxchg for a bit, I finally had my head-smack moment and remembered that std::atomic<T>::compare_exchange_weak(T& expected, T desired) replaces expected with the result of the atomic load in the event of comparison failure. This property of std::atomic<T>::compare_exchange_weak() is why std::atomic<T>::compare_exchange_weak() can be compiled down to a single cmpxchg instruction on x86-64 in the first place. We can actually see the compiler being clever here in Listing 6 and exploiting the fact that cmpxchg comparison failure mode writes into the eax register: the compiler chooses to use eax as the target for the mov instruction in Line 1 instead of using some other register so that a second move from eax into some other register isn’t necessary after cmpxchg. If anything, the implementation in Listing 3 / Listing 6 is actually slightly more efficient than the implementation in Listing 2 / Listing 5, since there is one fewer mov instruction needed in the loop.

So what does this have to do with learning about arm64? Well, while I was in the process of looking at the x86-64 assembly to try to understand what was going on, I also tried the implementation in Listing 3 on my Raspberry Pi 4B just to sanity check if things worked the same on arm64. At that point I hadn’t realized that the code in Listing 3 was actually correct yet, so I was beginning to consider possibilities like a compiler bug or weird platform-specific considerations that I hadn’t thought of, so to rule those more exotic explanations out, I decided to see if the code worked the same on x86-64 and arm64. Of course the code worked exactly the same on both, so the next step was to also examine the arm64 assembly in addition to the x86-64 assembly. Comparing the same code’s corresponding assembly for x86-64 and arm64 at the same time proved to be a very interesting exercise in getting to better understand some low-level and general differences between the two instruction sets.

Here is the corresponding arm64 assembly for the implementation in Listing 2; this is the arm64 assembly that is the direct counterpart to the x86-64 assembly in Listing 5. This arm64 assembly was also compiled with Clang 10.0.0 using -O3. I’ve included annotations here as well, although admittedly my arm64 assembly comprehension is not as good as my x86-64 assembly comprehension, since I’m relatively new to compiling for arm64. If you’re well versed in arm64 assembly and see a mistake in my annotations, feel free to send me a correction!

addAtomicFloat(std::atomic<float>&, float):
        b       .LBB0_2              // goto .LBB0_2
.LBB0_1:
        clrex                        // clear this thread's record of exclusive lock
.LBB0_2:
        ldar    w8, [x0]             // w8 = *arg0 = f0, non-atomically loaded
        ldaxr   w9, [x0]             // w9 = *arg0 = f0.load(), atomically
                                     //    loaded (get exclusive lock on x0), with
                                     //    implicit synchronization
        fmov    s1, w8               // s1 = w8 = f0
        fadd    s2, s1, s0           // s2 = s1 + s0 = (f0 + f1)
        cmp     w9, w8               // compare non-atomically loaded f0 with atomically
                                     //    loaded f0 and store result in N
        b.ne    .LBB0_1              // if N==0 { goto .LBB0_1 }
        fmov    w8, s2               // w8 = s2 = (f0 + f1)
        stlxr   w9, w8, [x0]         // if this thread has the exclusive lock,
                                     //    { *arg0 = w8 = (f0 + f1), release lock },
                                     //    store whether or not succeeded in w9
        cbnz    w9, .LBB0_2          // if w9 says exclusive lock failed { goto .LBB0_2}
        mov     v0.16b, v1.16b       // return f0 value from ldaxr
        ret

Listing 7: arm64 assembly corresponding to Listing 2, with my annotations in the comments. Compiled using arm64 Clang 10.0.0 using -O3. See on Godbolt Compiler Explorer

I should note here that the specific version of arm64 that Listing 7 was compiled for is ARMv8.0-A, which is what Clang and GCC both default to when compiling for arm64; this detail will become important a little bit later in this post. When we compare Listing 7 with Listing 5, we can immediately see some major differences between the arm64 and x86-64 instruction sets, aside from superficial stuff like how registers are named. The arm64 version is just under twice as long as the x86-64 version, and examining the code, we can see that most of the additional length comes from how the atomic compare-and-exchange is implemented. Actually, the rest of the code is very similar; the rest of the code is just moving stuff around to support the addition operation and to deal with setting up and jumping to the top of the loop. In the compare and exchange code, we can see that the arm64 version does not have a single instruction to implement the atomic compare-and-exchange! While the x86-64 version can compile std::atomic<float>::compare_exchange_weak() down into a single cmpxchg instruction, ARMv8.0-A has no equivalent instruction, so the arm64 version instead must use three separate instructions to implement the complete functionality: ldaxr to do an exclusive load, stlxr to do an exclusive store, and clrex to reset the current thread’s record of exclusive access requests.

This difference speaks directly towards x86-84 being a CISC architecture and arm64 being a RISC architecture. x86-64’s CISC nature calls for the ISA to have a large number of instructions carrying out complex often-multistep operations, and this design philosophy is what allows x86-64 to encode complex multi-step operations like a compare-and-exchange as a single instruction. Conversely, arm64’s RISC nature means a design consisting of fewer, simpler operations [Patterson and Ditzel 1980]; for example, the RISC design philosophy mandates that memory access be done through specific single-cycle instructions instead of as part of a more complex instruction such as compare-and-exchange. These differing design philosophies mean that in arm64 assembly, we will often see many instructions used to implement what would be a single instruction in x86_64; given this difference, compiling Listing 2 produces surprisingly structurally similarities in the output x86_64 (Listing 5) and arm64 (Listing 7) assembly. However, if we take the implementation of addAtomicFloat() in Listing 3 and compile it for arm64’s ARMv8.0-A revision, structural differences between the x86-64 and arm64 output become far more apparent:

addAtomicFloat(std::atomic<float>&, float):
        ldar    w9, [x0]             // w9 = *arg0 = f0, non-atomically loaded
        ldaxr   w8, [x0]             // w8 = *arg0 = f0.load(), atomically
                                     // loaded (get exclusive lock on x0), with
                                     // implicit synchronization
        fmov    s1, w9               // s1 = s9 = f0
        cmp     w8, w9               // compare non-atomically loaded f0 with atomically
                                     // loaded f0 and store result in N
        b.ne    .LBB0_3              // if N==0 { goto .LBB0_3 }
        fadd    s2, s1, s0           // s2 = s1 + s0 = (f0 + f1)
        fmov    w9, s2               // w9 = s2 = (f0 + f1)
        stlxr   w10, w9, [x0]        // if this thread has the exclusive lock,
                                     //    { *arg0 = w9 = (f0 + f1), release lock },
                                     //    store whether or not succeeded in w10
        cbnz    w10, .LBB0_4.        // if w10 says exclusive lock failed { goto .LBBO_4 }
        mov     w9, #1.              // w9 = 1 (???)
        tbz     w9, #0, .LBB0_8.     // if bit 0 of w9 == 0 { goto .LBB0_8 }
        b       .LBB0_5              // goto .LBB0_5
.LBB0_3:
        clrex.                       // clear this thread's record of exclusive lock
.LBB0_4:
        mov     w9, wzr              // w9 = 0
        tbz     w9, #0, .LBB0_8      // if bit 0 of w9 == 0 { goto .LBBO_8 }
.LBB0_5:
        mov     v0.16b, v1.16b.      // return f0 value from ldaxr
        ret
.LBB0_6:
        clrex                        // clear this thread's record of exclusive lock
.LBB0_7:
        mov     w10, wzr             // w10 = 0
        mov     w8, w9               // w8 = w9
        cbnz    w10, .LBB0_5         // if w10 is not zero { goto .LBB0_5 }
.LBB0_8:
        ldaxr   w9, [x0]             // w9 = *arg0 = f0.load(), atomically
                                     //    loaded (get exclusive lock on x0), with
                                     //    implicit synchronization
        fmov    s1, w8               // s1 = w0 = f0
        cmp     w9, w8               // compare non-atomically loaded f0 with atomically
                                     // loaded f0 and store result in N
        b.ne    .LBB0_6              // if N==0 { goto .LBBO_6 }
        fadd    s2, s1, s0           // s2 = s1 + s0 = (f0 + f1)
        fmov    w8, s2               // w2 = s2 = (f0 + f1)
        stlxr   w10, w8, [x0]        // if this thread has the exclusive lock,
                                     //    { *arg0 = w8 = (f0 + f1), release lock },
                                     //    store whether or not succeeded in w10
        cbnz    w10, .LBB0_7         // if w10 says exclusive lock failed { goto .LBB0_7 }
        mov     w10, #1              // w10 = 1
        mov     w8, w9               // w8 = w9 = f0.load()
        cbz     w10, .LBB0_8         // if w10==0 { goto .LBB0_8 }
        b       .LBB0_5              // goto .LBB0_5

Listing 8: arm64 assembly corresponding to Listing 3, with my annotations in the comments. Compiled using arm64 Clang 10.0.0 using -O3. See on Godbolt Compiler Explorer

Moving the atomic load out of the loop in Listing 3 resulted in a single line change between Listing 5 and Listing 6’s x86-64 assembly, but causes the arm64 version to explode in size and radically change in structure between Listing 7 and Listing 8! The key difference between Listing 7 and Listing 8 is that in Listing 8, the entire first iteration of the while loop is lifted out into it’s own code segment, which can then either directly return out of the function or go into the main body of the loop afterwards. I initially thought that Clang’s decision to lift out the first iteration of the loop was surprising, but it turns out that GCC 10.3 and MSVC v19.28’s respective arm64 backends also similarly decide to lift the first iteration of the loop out as well. The need to lift the entire first iteration out of the loop likely comes from the need to use an ldaxr instruction to carry out the initial atomic load of f0. Compared with GCC 10.3 and MSVC v19.28 though, Clang 10.0.0’s arm64 output does seem to do a bit more jumping around (see .LBB0_4 through .LBBO_7) though. Also, admittedly I’m not entirely sure why register w9 gets set to 1 and then immediately compared with 0 in lines 16/17 and lines 47/49; maybe that’s just a convenient way to clear the z bit of the CPSR (Current Program Status Register; this is analogous to EFLAG on x86-64)? But anyhow, compared with Listing 7, the arm64 assembly in Listing 8 is much longer in terms of code length, but actually is only slightly more inefficient in terms of total instructions executed. The slight additional inefficiency comes from some of the additional setup work needed to manage all of the jumping and the split loop. However, the fact that Listing 8 is less efficient compared with Listing 7 is interesting when we compare with what Listing 3 does to the x86-64 assembly; in the case of x86-64, pulling the initial atomic load out of the loop makes the output x86-64 assembly slightly more efficient, as opposed to slightly less efficient as we have here with arm64.

As a very loose general rule of thumb, arm64 assembly tends to be longer than the equivalent x86-64 assembly for the same high-level code because CISC architectures simply tend to encode a lot more stuff per instruction compared with RISC architectures [Weaver and McKee 2009]. However, compiled x86-64 binaries having fewer instructions doesn’t actually mean x86-64 binaries necessarily runs faster than equivalent, less “instruction-dense” compiled arm64 binary. x86-64 instructions are variable length, requiring more complex logic in the processor’s instruction decoder, and also since x86-64 instructions are more complex, they can take many more cycles per instruction to execute. Contrast with arm64, in which instructions are fixed length. Generally RISC architectures usually feature fixed length instructions, although this generalization isn’t a hard rule; the SuperH architecture (famously used in the Sega Saturn and Sega Dreamcast) is notably a RISC architecture with variable length instructions. Fixed length instructions allow for arm64 chips to have simpler logic in decoding, and arm64 also tends to take many many fewer instructions per cycle (often, but not always, as low as one or two cycles per instruction). The end result is that even though compiled arm64 binaries have lower instruction-density than compiled x86-64 binaries, arm64 processors tend to be able to retire more instructions per cycle than comparable x86-64 processors, allowing arm64 as an architecture to make up for the difference in code density.

…except, of course, all of the above is only loosely true today! While the x86-64 instruction set is still definitively a CISC instruction set today and the arm64 instruction set is still clearly a RISC instruction set today, a lot of the details have gotten fuzzier over time. Processors today rarely directly implement the instruction set that they run; basically all modern x86-64 processors today feed x86-64 instructions into a huge hardware decoder block that breaks down individual x86-64 instructions into lower-level micro-operations, or μops. Compared with older x86 processors from decades ago that directly implemented x86, these modern micro-operation-based x86-64 implementations are often much more RISC-like internally. In fact, if you were to examine all of the parts of a modern Intel and AMD x86-64 processor that take place after the instruction decoding phase, without knowing what processor you were looking at beforehand, you likely would not be able to determine if the processor implemented a CISC or a RISC ISA [Thomadakis 2011].

The same is true going the other way; while modern x86-64 is a CISC architecture that in practical implementation is often more RISC-like, modern arm64 is a RISC architecture that sometimes has surprisingly CISC-like elements if you look closely. Modern arm64 processors often also decode individual instructions into smaller micro-operations [ARM 2016], although the extent to which modern arm64 processors do this is a lot less intensive than what modern x86-64 does [Castellano 2015]. Modern arm64 instruction decoders usually rely on simple hardwired control to break instructions down into micro-operations, whereas modern x86-64 must use a programmable ROM containing advanced microcode to store mappings from x86-64 instructions to micro-instructions.

Another way that arm64 has slowly gained some CISC-like characteristics is that arm64 over time has gained some surprisingly specialized complex instructions! Remember the important note I made earlier about Listing 7 and Listing 8 being generated specifically for the ARMv8.0-A revision of arm64? Well, the specific ldaxr/stlxr combination in Listings 6 and 7 that is needed to implement an atomic compare-and-exchange (and generally any kind of atomic load-and-conditional-store operation) is a specific area where a more complex single-instruction implementation generally can perform better than an implementation using several instructions. As discussed earlier, one complex instruction is not necessarily always faster than several simpler instructions due to how the instructions actually have to be decoded and executed, but in this case, one atomic instruction allows for a faster implementation than several instructions combined since a single atomic instruction can take advantage of more available information at once [Cownie 2021]. Accordingly, the ARMv8.1-A revision of arm64 introduces a collection of new single-instruction atomic operations. Of interest to our particular example here is the new casal instruction, which performs a compare-and-exchange to memory with acquire and release semantics; this new instruction is a direct analog to the x86_64 cmpxchg instruction with the lock prefix.

We can actually use these new ARMv8.1-A single-instruction atomic operations today; while GCC and Clang both target ARMv8.0-A by default today, ARMv8.1-A support can be enabled using the -march=armv8.1-a flag starting in GCC 10.1 and starting in Clang 9.0.0. Actually, Clang’s support might go back even earlier; Clang 9.0.0 was the furthest back I was able to test. Here’s what Listing 2 compiles to using the -march=armv8.1-a flag to enable the casal instruction:

addAtomicFloat(std::atomic<float>&, float):
.LBB0_1:
        ldar    w8, [x0]             // w8 = *arg0 = f0, non-atomically loaded
        fmov    s1, w8               // s1 = w8 = f0
        fadd    s2, s1, s0           // s2 = s1 + s0 = (f0 + f1)
        mov     w9, w8               // w9 = w8 = f0
        fmov    w10, s2              // w10 = s2 = (f0 + f1)
        casal   w9, w10, [x0]        // atomically read the contents of the address stored
                                     //    in x0 (*arg0 = f0) and compare with w9;
                                     //    if [x0] == w9:
                                     //       atomically set the contents of the
                                     //       [x0] to the value in w10
                                     //    else:
                                     //       w9 = value loaded from [x0]
        cmp     w9, w8               // compare w9 and w8 and store result in N
        cset    w8, eq               // if previous instruction's compare was true,
                                     //    set w8 = 1
        cmp     w8, #1               // compare if w8 == 1 and store result in N
        b.ne    .LBB0_1              // if N==0 { goto .LBB0_1 }
        mov     v0.16b, v1.16b       // return f0 value from ldar
        ret

Listing 9: arm64 revision ARMv8.1-A assembly corresponding to Listing 2, with my annotations in the comments. Compiled using arm64 Clang 10.0.0 using -O3 and also -march=armv8.1-a. See on Godbolt Compiler Explorer

If we compare Listing 9 with the ARMv8.0-A version in Listing 7, we can see that Listing 9 is only slightly shorted in terms of total instructions used, but the need for separate ldaxr, stlxr, and clrex instructions has been completely replaced with a single casal instruction. Interestingly, Listing 9 is now structurally very very similar to it’s x86-64 counterpart in Listing 5. My guess is that if someone was familiar with x86-64 assembly but had never seen arm64 assembly before, and that person was given Listing 5 and Listing 9 to compare side-by-side, they’d be able to figure out almost immediately what each line in Listing 9 does.

Now let’s see what Listing 3 compiles to using the -march=armv8.1-a flag:

addAtomicFloat(std::atomic<float>&, float):
        ldar    w9, [x0]             // w9 = *arg0 = f0, non-atomically loaded
        fmov    s1, w9               // s1 = w9 = f0
        fadd    s2, s1, s0           // s2 = s1 + s0 = (f0 + f1)
        mov     w8, w9               // w8 = w9 = f0
        fmov    w10, s2              // w10 = s2 = (f0 + f1)
        casal   w8, w10, [x0]        // atomically read the contents of the address stored
                                     //    in x0 (*arg0 = f0) and compare with w8;
                                     //    if [x0] == w8:
                                     //       atomically set the contents of the
                                     //       [x0] to the value in w10
                                     //    else:
                                     //       w8 = value loaded from [x0]
        cmp     w8, w9               // compare w8 and w9 and store result in N
        b.eq    .LBB0_3              // if N==1 { goto .LBB0_3 }
        mov     w9, w8
.LBB0_2:
        fmov    s1, w8               // s1 = w8 = value previously loaded from [x0] = f0
        fadd    s2, s1, s0           // s2 = s1 + s0 = (f0 + f1)
        fmov    w10, s2              // w10 = s2 = (f0 + f1)
        casal   w9, w10, [x0]        // atomically read the contents of the address stored
                                     //    in x0 (*arg0 = f0) and compare with w9;
                                     //    if [x0] == w9:
                                     //       atomically set the contents of the
                                     //       [x0] to the value in w10
                                     //    else:
                                     //       w9 = value loaded from [x0]
        cmp     w9, w8               // compare w9 and w8 and store result in N
        cset    w8, eq               // if previous instruction's compare was true,
                                     //    set w8 = 1
        cmp     w8, #1               // compare if w8 == 1 and store result in N
        mov     w8, w9               // w8 = w9 = value previously loaded from [x0] = f0
        b.ne    .LBB0_2              // if N==0 { goto .LBB0_2 }
.LBB0_3:
        mov     v0.16b, v1.16b       // return f0 value from ldar
        ret

Listing 10: arm64 revision ARMv8.1-A assembly corresponding to Listing 3, with my annotations in the comments. Compiled using arm64 Clang 10.0.0 using -O3 and also -march=armv8.1-a. See on Godbolt Compiler Explorer

Here, the availability of the casal instruction makes a huge difference in the compactness of the output assembly! Listing 10 is nearly half the length of Listing 8, and more importantly, Listing 10 is also structurally much simpler than Listing 8. In Listing 10, the compiler still decided to unroll the first iteration of the loop, but the amount of setup and jumping around in between iterations of the loop is significantly reduced, which should make Listing 10 a bit more performant than Listing 8 even before we take into account the performance improvements from using casal.

By the way, remember our discussion of weak versus strong memory models in the previous section? As you may have noticed, Takua’s implementation of addAtomicFloat() uses std::atomic<T>::compare_exchange_weak() instead of std::atomic<T>::compare_exchange_strong(). The difference between the weak and strong versions of std::atomic<T>::compare_exchange_*() is that the weak version is allowed to sometimes report a failed comparison even if the values are actually equal (that is, the weak version is allowed to spuriously report a false negative), while the strong version guarantees always accurately reporting the outcome of the comparison. On x86-64, there is no difference between using the weak and strong versions of because x86-64 always provides strong memory ordering (in other words, on x86-64 the weak version is allowed to report a false negative by the spec but never actually does). However, on arm64, the weak version actually does report false negatives in practice. The reason I chose to use the weak version is because when the compare-and-exchange is attempted repeatedly in a loop, if the underlying processor actually has weak memory ordering, using the weak version is usually faster than the strong version. To see why, let’s take a look at the arm64 ARMv8.0-A assembly corresponding to Listing 2, but with std::atomic<T>::compare_exchange_strong() swapped in instead of std::atomic<T>::compare_exchange_weak():

addAtomicFloat(std::atomic<float>&, float):
.LBB0_1:
        ldar    w8, [x0]       // w8 = *arg0 = f0, non-atomically loaded
        fmov    s1, w8         // s1 = w8 = f0
        fadd    s2, s1, s0     // s2 = s1 + s0 = (f0 + f1)
        fmov    w9, s2         // w9 = s2 = (f0 + f1)
.LBB0_2:
        ldaxr   w10, [x0]      // w10 = *arg0 = f0.load(), atomically
                               //    loaded (get exclusive lock on x0), with
                               //    implicit synchronization
        cmp     w10, w8        // compare non-atomically loaded f0 with atomically
                               //    loaded f0 and store result in N
        b.ne    .LBB0_4        // if N==0 { goto .LBB0_4 }
        stlxr   w10, w9, [x0]  // if this thread has the exclusive lock,
                               //    { *arg0 = w9 = (f0 + f1), release lock },
                               //    store whether or not succeeded in w10
        cbnz    w10, .LBB0_2   // if w10 says exclusive lock failed { goto .LBB0_2}
        b       .LBB0_5        // goto .LBB0_5
.LBB0_4:
        clrex                  // clear this thread's record of exclusive lock
        b       .LBB0_1        // goto .LBB0_1
.LBB0_5:
        mov     v0.16b, v1.16b // return f0 value from ldaxr
        ret

Listing 11: arm64 revision ARMv8.0-A assembly corresponding to Listing 2 but using std::atomic::compare_exchange_strong() instead of std::atomic::compare_exchange_weak(), with my annotations in the comments. Compiled using arm64 Clang 10.0.0 using -O3 and also -march=armv8.1-a. See on Godbolt Compiler Explorer

If we compare Listing 11 with Listing 7, we can see that just changing the compare and exchange to a strong version instead of a weak version causes a major restructuring of the arm64 assembly and the addition of a bunch more jumps. In Listing 7, loads from [x0] (corresponding to reads of f0 in the C++ code) happen together at the top of the loop and the loaded values are reused through the rest of the loop. However, Listing 11 is restructured such that loads from [x0] happen immediately before the instruction that uses the loaded value from [x0] to do a comparison or other operation. This change means that there is less time for another thread to change the value at [x0] while this thread is still doing stuff. Interestingly, if we compile using ARMv8.1-A, the availability of single-instruction atomic operations means that just like on x86-64, the difference between the strong and weak versions of the compare and exchange go away and end up compiling to the same arm64 assembly.

At this point in process of porting Takua to arm64, I only had a couple of Raspberry Pis, as Apple Silicon Macs hadn’t even been announced yet. Unfortunately, the Raspberry Pi 3B’s Cortex-A53-based CPU and the Raspberry Pi 4B’s Cortex-A72-based CPU only implement ARMv8.0-A, which means I couldn’t actually test and compare the versions of the compiled assembly with and without casal. Fortunately though, we can still compile the code such that if the processor the code is running on implements ARMv8.1-A, the code will use casal and other ARMv8.1-A single-instruction atomic operations, and otherwise if only ARMv8.0-A is implemented, then the code will fall back to using ldaxr, stlxr, and clrex. We can get the compiler to automatically do the above by using the -moutline-atomics compiler flag, which Richard Henderson of Linaro contributed into GCC 10.1 [Tkachov 2020] and which also recently was added to Clang 12.0.0 in April 2021. The -moutline-atomics flag tells the compiler to generate a runtime helper function and stub the runtime helper function into the atomic operation call-site instead of directly generating atomic instructions; this helper function then does a runtime check for what atomic instructions are available on the current processor and dispatches to the best possible implementation given the available instructions. This runtime check is cached to make subsequent calls to the helper function faster. Using this flag means that if a future Raspberry Pi 5 or something comes out hopefully with support for something newer than ARMv8.0-A, Takua should be able to automatically take advantage of faster single-instruction atomics without me having to reconfigure Takua’s builds per processor.

Performance Testing

So, now that I have Takua up and running on arm64 on Linux, how does it actually perform? Here are some comparisons, although there are some important caveats. First, at this stage in the porting process, the only arm64 hardware I had that could actually run reasonably sized scenes on was a Raspberry Pi 4B with 4 GB of memory. The Raspberry Pi 4B’s CPU is a Broadcom BCM2711, which has 4 Cortex-A72 cores; these cores aren’t exactly fast, and even though the Raspberry Pi 4B came out in 2019, the Cortex-A72 core actually dates back to 2015. So, for the x86-64 comparison point, I’m using my early 2015 MacBook Air, which also has only 4 GB of memory and has an Intel Core i5-5250U CPU with 2 cores / 4 threads. Also, as an extremely unfair comparison point, I also ran the comparisons on my workstation, which has 128 GB of memory and dual Intel Xeon E5-2680 CPUs with 8 cores / 16 threads each, for 16 cores / 32 threads in total. The three scenes I used were the Cornell Box seen in Figure 1, the glass teacup seen in Figure 2, and the bedroom scene from my shadow terminator blog post; these scenes were chosen because they fit in under 4 GB of memory. All scenes were rendered to 16 samples-per-pixel, because I didn’t want to wait forever. The Cornell Box and Bedroom scenes are rendered using unidirectional path tracing, while the tea cup scene is rendered using VCM. The Cornell Box scene is rendered at 1024x1024 resolution, while the Tea Cup and Bedroom scenes are rendered at 1920x1080 resolution.

Here are the results:

	CORNELL BOX
	1024x1024, PT
Processor:	Wall Time:	Core-Seconds:
Broadcom BCM2711:	440.627 s	approx 1762.51 s
Intel Core i5-5250U:	272.053 s	approx 1088.21 s
Intel Xeon E5-2680 x2:	36.6183 s	approx 1139.79 s

	TEA CUP
	1920x1080, VCM
Processor:	Wall Time:	Core-Seconds:
Broadcom BCM2711:	2205.072 s	approx 8820.32 s
Intel Core i5-5250U:	2237.136 s	approx 8948.56 s
Intel Xeon E5-2680 x2:	174.872 s	approx 5593.60 s

	BEDROOM
	1920x1080, PT
Processor:	Wall Time:	Core-Seconds:
Broadcom BCM2711:	5653.66 s	approx 22614.64 s
Intel Core i5-5250U:	4900.54 s	approx 19602.18 s
Intel Xeon E5-2680 x2:	310.35 s	approx 9931.52 s

In the results above, “wall time” refers to how long the render took to complete in real-world time as if measured by a clock on the wall, while “core-seconds” is a measure of how long the render would have taken completely single-threaded. Both values are separately tracked by the renderer; “wall time” is just a timer that starts when the renderer begins working on its first sample and stops when the very last sample is finished, while “core-seconds” is tracked by using a separate timer per thread and adding up how much time each thread has spent rendering.

The results are interesting! The Raspberry Pi 4B and 2015 MacBook Air are both just completely outclassed by the dual-Xeon workstation in absolute wall time, but that should come as a surprise to absolutely nobody. What’s more surprising is that the multiplier by which the dual-Xeon workstation is faster than the Raspberry Pi 4B in wall time is much higher than the multiplier in core-seconds. For the Cornell Box scene, the dual-Xeon is 12.033x faster than the Raspberry Pi 4B in wall time, but is only 1.546x faster in core-seconds. For the Tea Cup scene, the dual-Xeon is 12.61x faster than the Raspberry Pi 4B in wall time, but is only 1.577x faster in core-seconds. For the Bedroom scene, the dual-Xeon is 18.217x faster than the Raspberry Pi 4B in wall time, but is only 2.277x faster in core-seconds. This difference in wall time multiplier versus core-seconds multiplier indicates that the Raspberry Pi 4B and dual-Xeon workstation are shockingly close in single-threaded performance; the dual-Xeon workstation only has such a crushing lead in wall clock time because it just has way more cores and threads available than the Raspberry Pi 4B.

When we compare the Raspberry Pi 4B to the 2015 MacBook Air, the results are even more interesting. Between these two machines, the times are actually relatively close; for the Cornell Box and Bedroom scenes, the Raspberry Pi 4B is within striking distance of the 2015 MacBook Air, and for the Tea Cup scene, the Raspberry Pi 4B is actually faster than the 2015 MacBook Air. The reason the Raspberry Pi 4B is likely faster than the 2014 MacBook Air at the Tea Cup scene is likely because the Tea Cup scene was rendered using VCM; VCM requires the construction of a photon map, and from previous profiling I know that Takua’s photon map builder works better with more actual physical cores. The Raspberry Pi 4B has four physical cores, whereas the 2014 MacBook Air only has two physical cores and gets to four threads using hyperthreading; my photon map builder doesn’t scale well with hyperthreading.

So, overall, the Raspberry Pi 4B’s arm64 processor intended for phones got handily beat by a dual-Xeon workstation but came very close to a 2015 MacBook Air. The thing here to remember though, is that the Raspberry Pi 4B’s arm64-based processor has a TDP of just 4 watts! Contrast with the MacBook Air’s Intel Core i5-5250U, which has a 15 watt TDP, and with the dual Xeon E5-2680 in my workstation, which have a 130 watt TDP each for a combined 260 watt TDP. For this comparison, I think using the max TDP of each processor is a relatively fair thing to do, since Takua Renderer pushes each CPU to 100% utilization for sustained periods of time. So, the real story here from an energy perspective is that the Raspberry Pi 4B was between 12 to 18 times slower than the dual-Xeon workstation, but the Raspberry Pi 4B also has a TDP that is 65x lower than the dual-Xeon workstation. Similarly, the Raspberry Pi 4B nearly matches the 2015 MacBook Air, but with a TDP that is 3.75x lower!

When factoring in energy utilization, the numbers get even more interesting once we look at total energy used across the whole render. We can get the total energy used for each render by multiplying the wall clock render time with the TDP of each processor (again, we’re assuming 100% processor utilization during each render); this gives us total energy used in watt-seconds, which we divide by 3600 seconds per hour to get watt-hours:

	CORNELL BOX
	1024x1024, PT
Processor:	Max TDP:	Total Energy Used:
Broadcom BCM2711:	4 W	0.4895 Wh
Intel Core i5-5250U:	15 W	1.1336 Wh
Intel Xeon E5-2680 x2:	260 W	2.6450 Wh

	TEA CUP
	1920x1080, VCM
Processor:	Max TDP:	Total Energy Used:
Broadcom BCM2711:	4 W	2.4500 Wh
Intel Core i5-5250U:	15 W	9.3214 Wh
Intel Xeon E5-2680 x2:	260 W	12.6297 Wh

	BEDROOM
	1920x1080, PT
Processor:	Max TDP:	Total Energy Used:
Broadcom BCM2711:	4 W	6.2819 Wh
Intel Core i5-5250U:	15 W	20.4189 Wh
Intel Xeon E5-2680 x2:	260 W	22.4142 Wh

From the numbers above, we can see that even though the Raspberry Pi 4B is a lot slower than the dual-Xeon workstation in wall clock time, the Raspberry Pi 4B absolutely crushes both the 2015 MacBook Air and the dual-Xeon workstation in terms of energy efficiency. To render the same image, the Raspberry Pi 4B used between approximately 3.5x to 5.5x less energy overall than the dual-Xeon workstation, and used between approximately 2.3x to 3.8x less energy than the 2015 MacBook Air. It’s also worth noting that the 2015 MacBook Air cost $899 when it first launched (and the processor had a recommended price from Intel of $315), and the dual-Xeon workstation cost… I don’t actually know. I bought the dual-Xeon workstation used for a pittance when my employer retired it, so I don’t know how much it actually cost new. But, I do know that the processors in the dual-Xeon had a recommended price from Intel of $1723… each, for a total of $3446 when they were new. In comparison, the Raspberry Pi 4B with 4 GB of RAM costs about $55 for the entire computer, and the processor cost… well, the actual price for most ARM processors is not ever publicly disclosed, but since a baseline Raspberry Pi 4B costs only $35, the processor can’t have cost more than a few dollars at most, possibly even under a dollar.

I think the main takeaway from these performance comparisons is that even back with 2015 technology, even though most arm64 processors were slower in absolute terms compared to their x86-64 counterparts, the single-threaded performance was already shockingly close, and arm64 energy usage per compute unit and price already were leaving x86-64 in the dust. Fast forward to the present day in 2021, where we have seen Apple’s arm64-based M1 chip take the absolute performance crown in its category from all x86-64 competitors, at both a lower energy utilization level and a lower price. The even wilder thing is: the M1 is likely the slowest desktop arm64 chip that Apple will ever ship, and arm64 processors from NVIDIA and Samsung and Qualcomm and Broadcom won’t be far behind in the consumer space while Amazon and Ampere and other companies are also introducing enormous, extremely powerful arm64 chips in the high end server space. Intel and (especially) AMD aren’t sitting still in the x86-64 space either though. The next few years are going to be very interesting; no matter what happens, on x86-64 or on arm64, Takua Renderer is now ready to be there!

Conclusion to Part 1

Through the process of porting to arm64 on Linux, I learned a lot about the arm64 architecture and how it differs from x86-64, and I also found a couple of good reminders about topics like memory ordering and how floating point works. Originally I thought that my post on porting Takua to arm64 would be a nice, short, and fast to write, but instead here we are some 17,000 words later and I have not even gotten to porting Takua to arm64 on macOS and Apple Silicon yet! So, I think we will stop here for now and save the rest for an upcoming Part 2. In Part 2, I’ll write about the process to port to arm64, about how to create Universal Binaries, and examine Apple’s Rosetta 2 system for running x86-64 binaries on arm64. Also, in Part 2 we’ll examine how Embree works on arm64 and compare arm64’s NEON vector extensions with x86-64’s SSE vector extensions, and we’ll finish with some additional miscellaneous differences between x86-64 and arm64 that need to be considered when writing C++ code for both architectures.

Acknowledgements

Thanks so much to Mark Lee and Wei-Feng Wayne Huang for puzzling through some of the std::compare_exchange_weak() stuff with me. Thanks a ton to Josh Filstrup for proofreading and giving feedback and suggestions on this post pre-release! Josh was the one who told me about the Herbie tool mentioned in the floating point section, and he made an interesting suggestion about using e-graph analysis to better understand floating point behavior. Also Josh pointed out SuperH as an example of a variable width RISC architecture, which of course he would because he knows all there is to know about the Sega Dreamcast. Finally, thanks to my wife, Harmony Li, for being patient with me while I wrote up this monster of a blog post and for also puzzling through some of the technical details with me.

References

Pontus Andersson, Jim Nilsson, Tomas Akenine-Möller, Magnus Oskarsson, Kalle Åström, and Mark D. Fairchild. 2020. FLIP: A Difference Evaluator for Alternating Images. Proc. of the ACM on Computer Graphics and Interactive Techniques (Proc. of High Performance Graphics) 3, 2 (Jul. 2020), Article 15.

ARM Holdings. 2016. Cortex-A57 Software Optimization Guide. Retrieved May 12, 2021.

ARM Holdings. 2021. Arm Architecture Reference Manual Armv8, for Armv8-A Architecture Profile, Version G.a. Retrieved May 14, 2021.

ARM Holdings. 2021. Arm Architecture Reference Manual Supplement ARMv8.1, for ARMv8-A Architecture Profile, Version: A.b. Retrieved May 14, 2021.

Brandon Castellano. 2015. SuperUser Answer to “Do ARM Processors like Cortex-A9 Use Microcode?”. Retrieved May 12, 2021.

Jim Cownie. 2021. Atomics in AArch64. In CPU Fun. Retrieved May 14, 2021.

CppReference. 2021. std::atomic<T>::compare_exchange_weak. Retrieved April 02, 2021.

CppReference. 2021. std::memory_order. Retrieved March 20, 2021.

Intel Corporation. 2021. Intel 64 and IA-32 Architectures Software Developer’s Manual. Retrieved April 02, 2021.

Bruce Dawson. 2020. ARM and Lock-Free Programming. In Random ASCII. Retrieved April 15, 2021.

Glenn Fiedler. 2008. Floating Point Determinism. In Gaffer on Games. Retrieved April 20, 2021.

David Goldbery. 1991. What Every Computer Scientist Should Know About Floating-Point Arithmetic. ACM Computing Surveys 32, 1 (Mar. 1991), 5-48.

Martin Geupel. 2018. Bucket and Progressive Rendering. In CG Basics. Retrieved May 12, 2021.

Phillip Johnston. 2020. Demystifying ARM Floating Point Compiler Options. In Embedded Artistry. Retrieved April 20, 2021.

Yossi Kreinin. 2008. Consistency: How to Defeat the Purpose of IEEE Floating Point. In Proper Fixation. Retrieved April 20, 2021.

Günter Obiltschnig. 2006. Cross-Platform Issues with Floating-Point Arithmetics in C++. In ACCU Conference 2006.

David A. Patterson and David R. Ditzel. 1980. The Case for the Reduced Instruction Set Computer. ACM SIGARCH Computer Architecture News 8, 6 (Oct. 1980), 25-33.

Jeff Preshing. 2012. Memory Reordering Caught in the Act. In Preshing on Programming. Retrieved March 20, 2021.

Jeff Preshing. 2012. An Introduction to Lock-Free Programming. In Preshing on Programming. Retrieved March 20, 2021.

Jeff Preshing. 2012. Memory Ordering at Compile Time. In Preshing on Programming. Retrieved March 20, 2021.

Jeff Preshing. 2012. Memory Barriers Are Like Source Control Operations. In Preshing on Programming. Retrieved March 20, 2021.

Jeff Preshing. 2012. Acquire and Release Semantics. In Preshing on Programming. Retrieved March 20, 2021.

Jeff Preshing. 2012. Weak vs. Strong Memory Models. In Preshing on Programming. Retrieved March 20, 2021.

Jeff Preshing. 2012. This Is Why They Call It a Weakly-Ordered CPU. In Preshing on Programming. Retrieved March 20, 2021.

The Rust Team. 2021. Atomics. In The Rustonomicon. Retrieved March 20, 2021.

Michael E. Thomadakis. 2011. The Architecture of the Nehalem Processor and Nehalem-EP SMP Platforms. JFE Technical Report. Texas A&M University.

Kyrylo Tkachov. 2020. Making the Most of the Arm Architecture with GCC 10. In ARM Tools, Software, and IDEs Blog. Retrieved May 14, 2021.

Vincent M. Weaver and Sally A. McKee. 2009. Code Density Concerns for New Architectures. In Proc. of IEEE International Conference on Computer Design (ICCD 2009). 459-464.

WikiBooks. 2021. Microprocessor Design: Instruction Decoder. Retrieved May 12, 2021.

Wikipedia. 2021. Complex Instruction Set Computer. Retrieved April 05, 2021.

Wikipedia. 2021. CPU Cache. Retrieved March 20, 2021.

Wikipedia. 2021. Extended Precision. Retrieved April 20, 2021.

Wikipedia. 2021. Hardwired Control Unit. Retrieved May 12, 2021.

Wikipedia. 2021. IEEE 754. Retrieved April 20, 2021.

Wikipedia. 2021. Intel 8087. Retrieved April 20, 2021.

Wikipedia. 2021. Micro-Code. Retrieved May 12, 2021.

Wikipedia. 2021. Micro-Operation. Retrieved May 10, 2021.

Wikipedia. 2021. Reduced Instruction Set Computer. Retrieved April 05, 2021.

Wikipedia. 2021. SuperH. Retrieved June 02, 2021.

New Responsive Layout and Blog Plans

2021-05-18T00:00:00+00:00

I recently noticed that my blog and personal website’s layout looked really bad on mobile devices and in smaller browser windows. When I originally created the current layout for this blog and for my personal website back in 2013, I didn’t really design the layout with mobile in mind whatsoever. Back in 2013, responsive web design had only just started to take off, and being focused entirely on renderer development and computer graphics, I wasn’t paying much attention to the web design world that much! I then proceeded to not notice at all how bad the layout on mobile and in small windows was because… well, I don’t really visit my own website and blog very much, because why would I? I know everything that’s on them already!

Well, I finally visited my site on my iPhone, and immediately noticed how terrible the layout looked. On an iPhone, the layout was just the full desktop browser layout shrunk down to an unreadable size! So, last week, I spent two evenings extending the current layout to incorporate responsive web design principles. Responsive web design principles call for a site’s layout to adjust itself according to the device and window size such that the site renders in a way that is maximally readable in a variety of different viewing contexts. Generally this means that content and images and stuff should resize so that its always at a readable size, and elements on the page should be on a fluid grid that can reflow instead of being located at fixed locations.

Here is how the layout used by my blog and personal site used to look on an iPhone 11 display, compared with how the layout looks now with modern responsive web design principles implemented:

So why did I bother with implementing these improvements to my blog and personal site now, some eight years after I first deployed the current layout and current version of the blog? To answer this (self-asked) question, I want to first write a bit about how the purpose of this blog has evolved over the years. I originally started this blog back when I first started college, and it originally didn’t have any clear purpose. If anything, starting a blog really was just an excuse to rewrite and expand a custom content management system that I had written in PHP 5 back in high school. Sometime in late 2010, as I got more interested in computer graphics, this blog became something of a personal journal to document my progress in exploring computer graphics. Around this time I also decided that I wanted to focus all of my attention on computer graphics, so I dropped most of the web-related projects I had at the time and moved this blog from my own custom CMS to Blogger. In grad school, I started to experiment with writing longer-form posts; for the first time for this blog, these posts were written primarily with a reader other than my future self in mind. In other words, this is the point where I actually started to write posts intended for an external audience. At this point I also moved the blog from Blogger to running on Jekyll hosted through Github Pages, and that’s when the first iterations of the current layout were put into place.

Fast forward to today; I’ve now been working at Disney Animation for six years, and (to my constant surprise) this blog has picked up a small but steady readership in the computer graphics field! The purpose I see for this blog now is to provide high quality, in-depth writeups of whatever projects I find interesting, with the hope that 1. my friends and colleagues and other folks in the field will find the posts similarly interesting and 2. that the posts I write can be informative and inspiring for aspirational students that might stumble upon this blog. When I was a student, I drew a lot of inspiration from reading a lot of really cool computer graphics and programming blogs, and I want to be able to give back the same to future students! Similarly, my personal site, which uses an extended version of the blog’s layout, now serves primarily as a place to collect and showcase projects that I’ve worked on with an eye towards hopefully inspiring other people, as opposed to serving as a tool to get recruited.

The rate that I post at now is much slower than when I was in school, but the reason for this slowdown is because I put far more thought and effort into each post now, and while the rate at which new posts appear has slowed down, I like to think that I’ve vastly improved both the quality and quantity of content within each post. I recently ran wc -w on the blog’s archives, which yielded some interesting numbers. From 2014 to now, I’ve only written 38 posts, but these 38 posts total a bit over 96,000 words (which averages to roughly 2,500 words per post). Contrast with 2010 through the end of 2013, when I wrote 78 posts that together total only about 28,000 words (which averages to roughly 360 words per post)! Those early posts came frequently, but a lot of those early posts are basically garbage; I only leave them there so that new students can see that my stuff wasn’t very good when I started either.

When I put the current layout into place eight years ago, I wanted the layout to have as little clutter as possible and focus on presenting a clear, optimized reading experience. I wanted computer graphics enthusiasts that come to read this blog to be able to focus on the content and imagery with as little distraction from the site’s layout as possible, and that meant keeping the layout as simple and minimal as possible while still looking good. Since the main topic this blog focuses on is computer graphics, and obviously computer graphics is all about pictures and the code that generates those pictures (hence the name of the blog being “Code & Visuals”), I wanted the layout to allow for large, full-width images. The focus on large full-width images is why the blog is single-column with no sidebars of any sort; in many ways, the layout is actually more about the images than the text, hence why text never wraps around an image either. Over the years I have also added additional capabilities to the layout in support of computer graphics content, such as MathJax integration so that I can embed beautiful LaTeX math equations, and an embedded sliding image comparison tool so that I can show before/after images with a wiping interface.

So with all of the above in mind, the reason for finally making the layout responsive is simple: I want the blog to be as clear and as readable as I can reasonably make it, and that means clear and readable on any device, not just in a desktop browser with a large window! I think a lot of modern “minimal” designs tend to use too much whitespace and sacrifice information and text density; a key driving principle behind my layout is to maintain a clean and simple look while still maintaining a reasonable level of information and text density. However, the old non-responsive layout’s density in smaller viewports was just ridiculous; nothing could be read without zooming in a lot, which on phones then meant a lot of swiping both up/down and left/right just to read a single sentence. For the new responsive improvements, I wanted to make everything readable in small viewports without any zooming or swiping left/right. I think the new responsive version of the layout largely accomplishes this goal; here’s an animation of how the layout resizes as the content window shrinks, as applied to the landing page of my personal site:

Adapting my layout to be responsive was surprisingly easy and straightforward! My blog and personal site use the same layout design, but the actual implementations are a bit different. The blog’s layout is a highly modified version of an old layout called N-Coded, which in turn is an homage to what Ghost’s default Casper layout looked like back in 2014 (Casper looks completely different today). Since the blog’s layout inherited some bits of responsive functionality from the layout that I forked from, getting most things working just required updating, fixing, and activating some already existing but inactive parts of the CSS. My personal site, on the other hand, reimplements the same layout using completely hand-written CSS instead of using the same CSS as the blog; the reason for this difference is because my personal site extends the design language of the layout for a number of more customized pages such as project pages, publication pages, and more. Getting my personal site’s layout updated with responsive functionality required writing more new CSS from scratch.

I used to be fairly well versed in web stuff back in high school, but obviously the web world has moved on considerably since then. I’ve forgotten most of what I knew back then anyway since it’s been well over a decade, so I kind of had to relearn a lot of things. However, I guess a lot of things in programming are similar to riding a bicycle- once you learn, you never fully forget! Relearning what I had forgotten was pretty easy, and I quickly figured out that the only really new thing I needed to understand for implementing responsive stuff was the CSS @media rule, which was introduced in 2009 but only gained full support across all major browsers in 2012. For the totally unfamiliar with web stuff: the @media rule allows for checking things like the width and height and resolution of the current viewport and allows for specifying CSS rule overrides per media query. Obviously this capability is super useful for responsive layouts; implementing responsive layouts really boils down to just making sure that positions are specified as percentages or relative positions instead of fixed positions and then using @media rules to make larger adjustments to the layout as the viewport size reaches different thresholds. For example, I use @media rules to determine when to reorganize from a two-column layout into stacked single-column layout, and I also use @media rules to determine when to adjust font sizes and margins and stuff. The other important part to implementing a responsive layout is to instruct the browser to set the width of the page to follow the screen-width of the viewing device on mobile. The easiest way to implement this requirement by far is to just insert the following into every page’s HTML headers:

<meta name="viewport" content="width=device-width">

For the most part, the new responsive layout actually doesn’t really noticeably change how my blog and personal site look on full desktop browsers and in large windows much, aside from some minor cleanups to spacing and stuff. However, there is one big noticeable change: I got rid of the shrinking pinned functionality for the navbar. Previously, as a user scrolled down, the header for my blog and personal site would shrink and gradually transform into a more compact version that would then stay pinned to the top of the browser window:

The shrinking pinned navbar functionality was implemented by using a small piece of JavaScript to read how far down the user had scrolled and dynamically adjusting the CSS for the navbar accordingly. This feature was actually one of my favorite things that I implemented for my blog and site layout! However, I decided to get rid of it because on mobile, space in the layout is already at a premium, and taking up space that otherwise could be used for content with a pinned navbar just to have my name always at the top of the browser window felt wasteful. I thought about changing the navbar so that as the user scrolled down, the nav links would turn into a hidden menu accessible through a hamburger button, but I personally don’t actually really like the additional level of indirection and complexity that hamburger buttons add. So, the navbar is now just fixed and scrolls just like a normal element of each page:

I think a fixed navbar is fine for now; I figure that if someone is already reading a post on my blog or something on my personal site, they’ll already know where they are and don’t need a big pinned banner with my name on it to remind them of where they are. However, if I start to find that scrolling up to reach nav links is getting annoying, I guess I’ll put some more thought into if I can come up with a design that I like for a smaller pinned navbar that doesn’t take up too much space in smaller viewports.

While I was in the code, I also made a few other small improvements to both the blog and my personal site. On the blog, I made a small improvement for embedded code snippets: embedded code snippets now include line numbers on the side! The line numbers are implemented using a small bit of JavaScript and exist entirely through CSS, so they don’t interfere with selecting and copying text out of the embedded code snippets. On my personal site, removing the shrinking/pinning aspect of the navbar actually allowed me to completely remove almost all JavaScript includes on the site, aside from some analytics code. On the blog, JavaScript is still present for some small things like the code line numbers, some caption features, MathJax, and analytics, but otherwise is at a bare minimum.

Over time I’d like to pare back what includes my layout uses even further to help improve load times even more. One of the big motivators for moving my blog from Blogger to Jekyll was simply for page loading speed; under the hood Blogger is a big fancy dynamic CMS, whereas Jekyll just serves up static pages that are pre-generated once from Markdown files. A few years ago, I similarly moved my personal site from using a simple dynamic templating engine I had written in PHP to instead be entirely 100% static; I now just write each page on my personal site directly as simple HTML and serve everything statically as well. As a result, my personal site loads extremely fast! My current layout definitely still has room for optimization though; currently, I use fonts from TypeKit because I like nice typography and having nice fonts like Futura and Proxima Nova is a big part of the overall “look” of the layout. Fonts can add a lot of weight if not optimized carefully though, so maybe down the line I’ll need to streamline how fonts work in my layout. Also, since the blog has a ton of images, I think updating the blog to use native browser lazy loading of images through the loading="lazy" attribute on img tags should help a lot with load speeds, but not all major browsers support this attribute yet. Some day I’d like to get my site down to something as minimal and lightweight as Tom MacWright’s blog, but still, for now I think things are in decent shape.

If for some reason you’re curious to see how all of the improvements mentioned in this post are implemented, the source code for both my blog and my personal site are available on my Github. Please feel free to either steal any bits of the layout that you may find useful, or if you want, feel free to even fork the entire layout to use as a basis for your own site. Although, if you do fork the entire layout, I would suggest and really prefer that you put some effort into personalizing the layout and really making it your own instead of just using it exactly as how I have it!

Hopefully this is the last time for a very long while that I’ll write a blog post about the blog itself; I’m an excruciating slow writer these days, but I currently have the largest simultaneous number of posts near completion that I’ve had in a long time, and I’ll be posting them soon. As early as later this week I’ll be posting the first part of a two-part series about porting Takua Renderer to 64-bit ARM; get ready for a deep dive into some fun concurrency and atomics-related problems at the x86-64 and arm64 assembly level in this post. The second part of this series should come soon too, and over the summer I’m also hoping to finish posts about hex-tiling in Takua and on implementing/using different light visibility modes. Stay-at-home during the pandemic has also given me time to slowly chip away on the long-delayed second and third parts of what was supposed to be a series on mipmapped tiled texture caching, so with some luck maybe those posts will finally appear this year too. Beyond that, I’ve started some very initial steps on new next-generation from-the-ground-up reimplementations of Takua in CUDA/Optix and in Metal, and I’ve started to dip my toes into Rust as well, so who knows, maybe I’ll have stuff to write about that too in the future!

RenderMan Art Challenge: Magic Shop

2021-04-12T00:00:00+00:00

1. Introduction
2. Character Explorations
3. Skin Shading and Subsurface Scattering
4. Clothes and Fuzz
5. Layout, Framing, and Building the Shop

6. So Many Props!
7. Signs and Records and Books
8. Putting Everything Together
9. Conclusion
10. References

Introduction

Last fall, I participated in my third Pixar’s RenderMan Art Challenge, “Magic Shop”! I wasn’t initially planning on participating this time around due to not having as much free time on my hands, but after taking a look at the provided assets for this challenge, I figured that it looked fun and that I could learn some new things, so why not? Admittedly participating in this challenge is why some technical content I had planned for this blog in the fall wound up being delayed, but in exchange, here’s another writeup of some fun CG art things I learned along the way! This RenderMan Art Challenge followed the same format as usual: Pixar supplied some base models without any uvs, texturing, shading, lighting, etc, and participants had to start with the supplied base models and come up with a single final image. Unlike in previous challenges though, this time around Pixar also provided a rigged character in the form of the popular open-source Mathilda Rig, to be incorporated into the final entry somehow. Although my day job involves rendering characters all of the time, I have really limited experience with working with characters in my personal projects, so I got to try some new stuff! Considering that I my time spent on this project was far more limited than on previous RenderMan Art Challenges, and considering that I didn’t really know what I was doing with the character aspect, I’m pretty happy that my final entry won third place in the contest!

Character Explorations

I originally wasn’t planning on entering this challenge, but I downloaded the base assets anyway because I was curious about playing with the rigged character a bit. I discovered really quickly that the Mathilda rig is reasonably flexible, but the flexibility meant that the rig can go off model really fast, and also the face can get really creepy really fast. I think part of the problem is just the overall character design; the rig is based on a young Natalie Portman’s character from the movie Léon: The Professional, and the character in that movie is… something of an unusual character, to say the least. The model itself has a head that’s proportionally a bit on the large side, and the mouth is especially large, which is part of why the facial rig gets so creepy so fast. One of the first things I discovered was that I had to scale down the rig’s mouth and teeth a bit just to bring things back into more normal proportions.

After playing with the rig for a few evenings, I started thinking about what I should make if I did enter the challenge after all. I’ve gotten a lot busier recently with personal life stuff, so I knew I wasn’t going to have as much time to spend on this challenge, which meant I needed to come up with a relatively straightforward simple concept and carefully choose what aspects of the challenge I was going to focus on. I figured that most of the other entries into the challenge were going to use the provided character in more or less its default configuration and look, so I decided that I’d try to take the rig further away from its default look and instead use the rig as a basis for a bit of a different character. The major changes I wanted to make to take the rig away from its default look were to add glasses, completely redo the hair, simplify the outfit, and shade the outfit completely differently from its default appearance.

With this plan in mind, the first problem I tackled was creating a completely new hairstyle for the character. The last time I did anything with making CG hair was about a decade ago, and I did a terrible job back then, so I wanted to figure out how to make passable CG hair first because I saw the hair as basically a make-or-break problem for this entire project. To make the hair in this project, I chose to use Maya’s XGen plugin, which is a generator for arbitrary primitives, including but not limited to curves for things like hair and fur. I chose to use XGen in part because it’s built into Maya, and also because I already have some familiarity with XGen thanks to my day job at Disney Animation. XGen was originally developed at Disney Animation [Thompson et al. 2003] and is used extensively on Disney Animation feature films; Autodesk licensed XGen from Disney Animation and incorporated XGen into Maya’s standard feature set in 2011. XGen’s origins as a Disney Animation technology explain why XGen’s authoring workflow uses Ptex [Burley and Lacewell 2008) for maps and SeExpr [Walt Disney Animation Studios 2011] for expressions. Of course, since 2011, the internal Disney Animation version of XGen has developed along its own path and gained capabilities and features [Palmer and Litaker 2016] beyond Autodesk’s version of XGen, but the basics are still similar enough that I figured I wouldn’t have too difficult of a time adapting.

I found a great intro to XGen course from Jesus FC, which got me up and running with guides/splines XGen workflow. I eventually found that the workflow that worked best for me was to actually model sheets of hair using just regular polygonal modeling tools, and then use the modeled polygonal sheets as a base surface to help place guide curves on to drive the XGen splines. After a ton of trial and error and several restarts from scratch, I finally got to something that… admittedly still was not very good, but at least was workable as a starting point. One of the biggest challenges I kept running into was making sure that different “planes” of hair didn’t intersect each other, which produces grooms that look okay at first glance but then immediately look unnatural after anything more than just a moment. Here are some early drafts of the custom hair groom:

To shade the hair, I used RenderMan’s PxrMarschnerHair shader, driven using RenderMan’s PxrHairColor node. PxrHairColor implements d’Eon et al. [2011], which allow for realistic hair colors by modeling melanin concentrations in hair fibers, and PxrMarschnerHair [Hery and Ling 2017] implements a version of the classic Marschner et al. [2003] hair model improved using adaptive importance sampling [Pekelis et al. 2015]. In order to really make hair look good, some amount of randomization and color variation between different strands is necessary; PxrHairColor supports randomization and separately coloring stray flyaway hairs based on primvars. In order to use the randomization features, I had to remember to check off the “id” and “stray” boxes under the “Primitive Shader Parameters” section of XGen’s Preview/Output tab. Overall I found the PxrHairColor/PxrMarschnerHair system a little bit difficult to use; figuring out how a selected melanin color maps to a final rendered look isn’t exactly 1-to-1 and requires some getting used to. This difference in authored hair color and final rendered hair color happens because the authored hair color is the color of a single hair strand, whereas the final rendered hair color is the result of multiple scattering between many hair strands combined with azimuthal roughness. Fortunately, hair shading should get easier in future versions of RenderMan, which are supposed to ship with an implementation of Disney Animation’s artist-friendly hair model [Chiang et al. 2016]. The Chiang model uses a color re-parameterization that allows for the final rendered hair color to closely match the desired authored color by remapping the authored color to account for multiple scattering and azimuthal roughness; this hair model is what we use in Disney’s Hyperion Renderer of course, and is also implemented in Redshift and is the basis of VRay’s modern VRayHairNextMtl shader.

Skin Shading and Subsurface Scattering

For shading the character’s skin, the approach I took was to use the rig’s default textures as a starting point, modify heavily to get the textures that I actually wanted, and then use the modified textures to author new materials using PxrSurface. The largest changes I made to the supplied skin textures are in the maps for subsurface; I basically had to redo everything to provide better inputs to subsurface color and mean free path to get the look that I wanted, since I used PxrSurface’s subsurface scattering set to exponential path-traced mode. I generally like the controllability and predictability that path-traced SSS brings, but RenderMan 23’s PxrSurface implementation includes a whole bunch of different subsurface scattering modes, and the reason for this is interesting and worth briefly discussing.

Subsurface scattering models how light penetrates the surface of a translucent object, bounces around and scatters inside of the object, and exits at a different surface point from where it entered; this effect is exhibited by almost all organic and non-conductive materials to some degree. However, subsurface scattering has existed in renderers for a long time; strong subsurface scattering support was actually a standout feature for RenderMan as early as 2002/2003ish [Hery 2003], when RenderMan was still a REYES rasterization renderer. Instead of relying on brute-force path tracing, earlier subsurface scattering implementations relied on diffusion approximations, which approximate the effect of light scattering around inside of an object by modeling the aggregate behavior of scattered light over a simplified surface. One popular way of implementing diffusion is through dipole diffusion [Jensen et al. 2001, d’Eon 2012, Hery 2012] and another popular technique is through the normalized diffusion model [Burley 2015, Christensen and Burley 2015] that was originally developed at Disney Animation for Hyperion. These models are implemented in RenderMan 23’s PxrSurface as the “Jensen and d’Eon Dipoles” subsurface model and the “Burley Normalized” subsurface model, respectively.

Diffusion models were the state-of-the-art for a long time, but diffusion models require a number of simplifying assumptions to work; one of the fundamental key simplifications universal to all diffusion models is an assumption that subsurface scattering is taking place on a semi-infinite slab of material. Thin geometry breaks this fundamental assumption, and as a result, diffusion-based subsurface scattering tends to loose more energy than it should in thin geometry. This energy loss means that thin parts of geometry rendered with diffusion models tend to look darker than one would expect in reality. Along with other drawbacks, this thin geometry energy loss drawback in diffusion models is one of the major reasons why most renderers have moved to brute-force path-traced subsurface scattering in the past half decade, and avoiding the artifacts from diffusion is exactly what the controllability and predictability that I mentioned earlier refers to. Subsurface scattering is most accurately simulated by brute-force path tracing within a translucent object, but brute-force path-traced subsurface scattering has only really become practical for production in the past 5 or 6 years for two major reasons: first, computational cost, and second, the (up until recently) lack of an intuitive, artist-friendly parameterization for apparent color and scattering distance. Much like how the final color of a hair model is really the result of the color of individual hair fibers and the aggregate multiple scattering behavior between many hair strands, the final color result of subsurface scattering arises from a complex interaction between single-scattering albedo, mean free path, and numerous multiple scattering events. So, much like how an artist-friendly, controllable hair model requires being able to invert an artist-specified final apparent color to produce internally-used scattering albedos (this process is called albedo inversion), subsurface scattering similarly requires an albedo inversion step to allow for artist-friendly controllable parameterizations. The process of albedo inversion for diffusion models is relatively straightforward and can be computed using nice closed-form analytical solutions, but the same is not true for path-traced subsurface scattering. A major key breakthrough to making path-traced subsurface scattering practical was the development of a usable data-fitted albedo inversion technique [Chiang et al. 2016] that allows path-traced subsurface scattering and diffusion subsurface scattering to use the same parameterization and controls. This technique was first developed at Disney Animation for Hyperion, and this technique was modified by Wrenninge et al. [2017] and combined with additional support for anisotropic scattering and non-exponential free flight to produce the “Multiple Mean Free Paths” and “path-traced” subsurface models in RenderMan 23’s PxrSurface.

In my initial standalone lookdev test setup, something that took a while was dialing the subsurface back from looking too gummy while at the same time trying to preserve something of a glow-y look, since the final scene I had in mind would be very glow-y. From both personal and production experience, I’ve found that one of the biggest challenges in moving from diffusion or point based subsurface scattering solutions to brute-force path-traced subsurface scattering often is in having to readjust mean free paths to prevent characters from looking too gummy, especially in areas where the geometry gets relatively thin because of the aforementioned thin geometry problem that diffusion models suffer from. In order to compensate for energy loss and produce a more plausible result, parameters and texture maps for diffusion-based subsurface scattering are often tuned to overcompensate for energy loss in thin areas. However, applying these same parameters to an accurate brute-force path tracing model that already models subsurface scattering in thin areas correctly results in overly bright thin areas, hence the gummier look. Since I started with the supplied skin textures for the character model, and the original skin shader for the character model was authored for a different renderer that used diffusion-based subsurface scattering, the adjustments I had to make where specifically to fight this overly glow-y gummy look in path-traced mode when using parameters authored for diffusion.

Clothes and Fuzz

For the character’s clothes and shoes, I wanted to keep the outfit geometry to save time, but I also wanted to completely re-texture and re-shade the outfit to give it my own look. I had a lot of trouble posing the character without getting lots of geometry interpenetration in the provided jacket, so I decided to just get rid of the jacket entirely. For the shirt, I picked a sort of plaid flannel-y look for no other reason than I like plaid flannel. The character’s shorts come with this sort of crazy striped pattern, which I opted to replace with a much more simplified denim shorts look. I used Substance Painter for texturing the clothes; Substance Painter comes with a number of good base fabric materials that I heavily modified to get to the fabrics that I wanted. I also wound up redoing the UVs for the clothing completely; my idea was to lay out the UVs similar to how the sewing patterns for each piece of clothing might work if they were made in reality; doing the UVs this way allowed for quickly getting the textures to meet up and align properly as if the clothes were actually sewn together from fabric panels. A nice added bonus is that Substance Painter’s smart masks and smart materials often use UV seams as hints for effects like wear and darkening, and all of that basically just worked out of the box perfectly with sewing pattern styled UVs.

Bringing everything back into RenderMan though, I didn’t feel that the flannel shirt looked convincingly soft and fuzzy and warm. I tried using PxrSurface’s fuzz parameter to get more of that fuzzy look, but the results still didn’t really hold up. The reason the flannel wasn’t looking right ultimately has to do with what the fuzz lobe in PxrSurface is meant to do, and where the fuzzy look in real flannel fabric comes from. PxrSurface’s fuzz lobe can only really approximate the look of fuzzy surfaces from a distance, where the fuzz is small enough relative to the viewing position that they can essentially be captured as an aggregate microfacet effect. Even specialized cloth BSDFs really only hold up at a relatively far distance from the camera, since they all attempt to capture cloth’s appearance as an aggregated microfacet effect; an enormous body of research exists on this topic [Schröder et al. 2011, Zhao et al. 2012, Zhao et al. 2016, Allaga et al. 2017, Deshmukh et al. 2017, Montazeri et al. 2020]. However, up close, the fuzzy look in real fabric isn’t really a microfacet effect at all- the fuzzy look really arises from multiple scattering happening between individual flyaway fuzz fibers on the surface of the fabric; while these fuzz fibers are very small to the naked eye, they are still a macro-scale effect when compared to microfacets. The way feature animation studios such as Disney Animation and Pixar have made fuzzy fabric look really convincing over the past half decade is to… just actually cover fuzzy fabric geometry with actual fuzz fiber geometry [Crow et al. 2018]. In the past few years, Disney Animation and Pixar and others have actually gone even further. On Frozen 2, embroidery details and lace and such were built out of actual curves instead of displacement on surfaces [Liu et al. 2020]. On Brave, some of the clothing made from very coarse fibers were rendered entirely as ray-marched woven curves instead of as subdivision surfaces and shaded using a specialized volumetric scheme [Child 2012], and on Soul, many of the hero character outfits (including ones made of finer woven fabrics) are similarly rendered as brute-force path-traced curves instead of as subdivision surfaces [Hoffman et al. 2020]. Animal Logic similarly renders hero cloth as actual woven curves [Smith 2018], and I wouldn’t be surprised if most VFX shops use a similar technique now.

Anyhow, in the end I decided to just bite the bullet in terms of memory and render speed and cover the flannel shirt in bazillions of tiny little actual fuzz fibers, instanced and groomed using XGen. The fuzz fibers are shaded using PxrMarschnerHair and colored to match the fabric surface beneath. I didn’t actually go as crazy as replacing the entire cloth surface mesh with woven curves; I didn’t have nearly enough time to write all of the custom software that would require, but fuzzy curves on top of the cloth surface mesh is a more-than-good-enough solution for the distance that I was going to have the camera at from the character. The end result instantly looked vastly better, as seen in this comparison of before and after adding fuzz fibers:

Figure 5: Shirt before (left) and after (right) XGen fuzz. For a full screen comparison, click here.

Putting fuzz geometry on the shirt actually worked well enough that I proceeded to do the same for the character’s shorts and socks as well. For the socks especially having actual fuzz geometry really helped sell the overall look. I also added fine peach fuzz geometry to the character’s skin as well, which may sound a bit extreme, but has actually been standard practice in the feature animation world for several years now; Disney Animation began adding fine peach fuzz on all characters on Moana [Burley et al. 2017], and Pixar started doing so on Coco. Adding peach fuzz to character skin ends up being really useful for capturing effects like rim lighting without the need for dedicated lights or weird shader hacks to get that distinct bright rim look; the rim lighting effect instead comes entirely from multiple scattering through the peach fuzz curves. Since I wanted my character to be strongly backlit in my final scene, I knew that having good rim lighting was going to be super important, and using actual peach fuzz geometry meant that it all just worked! Here is a comparison of my final character texturing/shading/look, backlit without and with all of the geometric fuzz. The lighting setup is exactly the same between the two renders; the only difference is the presence of fuzz causing the rim effect. This effect doesn’t happen when using only the fuzz lobe of PxrSurface!

Figure 6: Character backlit without and with fuzz. The rim lighting effect is created entirely by backlighting scattering through XGen fuzz on the character and the outfit. For a full screen comparison, click here. Click here and here to see the full 4K renders by themselves.

I used SeExpr expressions instead of using XGen’s guides/splines workflow to control all of the fuzz; the reason for using expressions was because I only needed some basic noise and overall orientation controls for the fuzz instead of detailed specific grooming. Of course, adding geometric fuzz to all of a character’s skin and clothing does increase memory usage and render times, but not by as much as one might expect! According to RenderMan’s stats collection system, adding geometric fuzz increased overall memory usage for the character by about 20%, and for the renders in Figure 8, adding geometric fuzz increased render time by about 11%. Without the geometric fuzz, there are 40159 curves on the character, and with geometric fuzz the curve count increases to 1680364. Even though there was a 41x increase in the number of curves, the total render time didn’t really increase by too much, thanks to logarithmic scaling of ray tracing with respect to input complexity. In a rasterizer, adding 41x more geometry would slow the render down to a crawl due to the linear scaling of rasterization, but ray tracing makes crazy things like actual geometric fuzz not just possible, but downright practical. Of course all of this can be made to work in a rasterizer with sufficiently clever culling and LOD and such, but in a ray tracer it all just works out of the box!

Here are a few closeup test renders of all of the fuzz:

Layout, Framing, and Building the Shop

After completing all of the grooming and re-shading work on the character, I finally reached a point where I felt confident enough in being able to make an okay looking character that I was willing to fully commit into entering this RenderMan Art Challenge. I got to this decision really late in the process relative to on previous challenges! Getting to this point late meant that I had actually not spent a whole lot of time thinking about the overall set yet, aside from a vague notion that I wanted backlighting and an overall bright and happy sort of setting. For whatever reason, “magic shop” and “gloomy dark place” are often associated with each other (and looking at many of the other competitors’ entries, that association definitely seemed to hold on this challenge too). I wanted to steer away from “gloomy dark place”, so I decided I instead wanted more of a sunny magic bookstore with lots of interesting props and little details to tell an overall story.

To build my magic bookstore set, I wound up remixing the provided assets fairly extensively; I completely dismantled the entire provided magic shop set and used the pieces to build a new corner set that would emphasize sunlight pouring in through windows. I initially was thinking of placing the camera up somewhere in the ceiling of the shop and showing a sort of overhead view of the entire shop, but I abandoned the overhead idea pretty quickly since I wanted to emphasize the character more (especially after putting so much work into the character). Once I decided that I wanted a more focused shot of the character with lots of bright sunny backlighting, I arrived at an overall framing and even set dressing that actually largely stayed mostly the same throughout the rest of the project, albeit with minor adjustments here and there. Almost all of the props are taken from the original provided assets, with a handful of notable exceptions: in the final scene, the table and benches, telephone, and neon sign are my own models. Figuring out where to put the character took some more experimentation; I originally had the character up front and center and sitting such that her side is facing the camera. However, having the character up front and center made her feel not particularly integrated with the rest of the scene, so I eventually placed her behind the big table and changed her pose so that she’s sitting facing the camera.

Here are some major points along the progression of my layout and set dressing explorations:

One interesting change that I think had a huge impact on how the scene felt overall actually had nothing to do with the set dressing at all, but instead had to do with the camera itself. At some point I tried pulling the camera back further from the character and using a much narrower lens, which had the overall effect of pulling the entire frame much closer and tighter on the character and giving everything an ever-so-slightly more orthographic feel. I really liked how this lensing worked; to me it made the overall composition feel much more focused on the character. Also around this point is when I started integrating the character with completed shading and texturing and fuzz into the scene, and I was really happy to see how well the peach fuzz and clothing fuzz worked out with the backlighting:

Once I had the overall blocking of the scene and rough set dressing done, the next step was to shade and texture everything! Since my scene is set indoors, I knew that global illumination coming off of the walls and floor and ceiling of the room itself was going to play a large role in the overall lighting and look of the final image, so I started the lookdev process with the room’s structure itself.

The first decision to tackle was whether or not to have glass in the big window thing behind the character. I didn’t really want to put glass in the window, since most of the light for the scene was coming through the window and having to sample the primary light source through glass was going to be really bad for render times. Instead, I decided that the window was going to be an interior window opening up into some kind of sunroom on the other side, so that I could get away with not putting glass in. The story I made up in my head was that the sunroom on the other side, being a sunroom, would be bright enough that I could just blow it out entirely to white in the final image. To help sell the idea, I thought it would be fun to have some ivy or vines growing through the window’s diamond-shaped sections; maybe they’re coming from a giant potted plant or something in the sunroom on the other side.

I initially tried creating the ivy vines using SpeedTree, but I haven’t really used SpeedTree too extensively before and the vines toolset was completely unfamiliar to me. Since I didn’t have a whole lot of time to work on this project overall, I wound up tabling SpeedTree on this project and instead opted to fall back on a (much) older but more familiar tool: Thomas Luft’s standalone Ivy Generator program. After several iterations to get an ivy growth pattern that I liked, I textured and shaded the vines and ivy leaves using some atlases from Quixel Megascans. The nice thing about adding in the ivy was that it helped break up how overwhelmingly bright the entire window was:

For the overall look of the room, I opted for a sort-of Mediterranean look inspired by the architecture of the tower that came with the scene (despite the fact that the tower isn’t actually in my image). Based on the Mediterranean idea, I wanted to make the windows out of a fired terracotta brick sort of material and, after initially experimenting with brick walls, I decided to go with stone walls. To help sell the look of a window made out of stacked fired terracotta blocks, I added a bit more unevenness to the window geometry, and I used fired orange terracotta clay flower pots as a reference for what the fired terracotta material should look like. To help break up how flat the window geometry is and to help give the blocks a more handmade look, I added unique color unevenness per block and also added a bunch of swirly and dimply patterns to the material’s displacement:

To create the stone walls, I just (heavily) modified a preexisting stone material that I got off of Substance Source; the final look relies very heavily on displacement mapping since the base geometry is basically just a flat plane. I made only the back wall a stone wall; I decided to make the side wall on the right out of plaster instead just so I wouldn’t have to figure out how to make two stone walls meet up at a corner. I also wound up completely replacing the stone floor with a parquet wood floor, since I wanted some warm bounce coming up from the floor onto the character. Each plank in the parquet wood floor is a piece of individual geometry. Putting it all together, here’s what the shading for the room structure looks like:

The actual materials in my final image are not nearly as diffuse looking as everything looks in the above test render; my lookdev test setup’s lighting setup is relatively diffuse/soft, which I guess didn’t really serve as a great predictor for how things looked in my actual scene since the lighting in my actual scene landed somewhere super strongly backlit. Also, note how all of the places where different walls meet each other and where the walls meet the floor are super janky; I didn’t bother putting much effort in there since I knew that those areas were either going to be outside of the final frame or were going to be hidden behind props and furniture.

So Many Props!

With the character and room completed, all that was left to do for texturing and shading was just lots and lots of props. This part was both the easiest and most difficult part of the entire project- easy because all of the miscellaneous props were relatively straightforward to texture and shade, but difficult simply because there were a lot of props. However, the props were also the funnest part of the whole project! Thinking about how to make each prop detailed and interesting and unique was an interesting exercise, and I also had fun sneaking in a lot of little easter eggs and references to things I like here and there.

My process for texturing and shading props was a very straightforward workflow that is basically completely unchanged from the workflow I settled into on the previous Shipshape RenderMan Art Challenge: use Substance Painter for texturing, UDIM tiles for high resolution textures, and PxrSurface as the shader for everything. The only different from in previous projects was that I used a far lazier UV mapping process: almost every prop was just auto-UV’d with some minor adjustments here and there. The reason I relied on auto-UVs this time was just because I didn’t have a whole lot of time on this project and couldn’t afford to spend the time to do precise careful high quality by-hand UVs for everything, but I figured that since all of the props would be relatively small in image space in the final frame, I could get away with hiding seams from crappy UVs by just exporting really high-resolution textures from Substance Painter. Yes, this approach is extremely inefficient, but it worked well enough considering how little time I had.

Since a lot of bounce lighting on the character’s face was going to have to come from the table, the first props I textured and shaded were the table and accompanying benches. I tried to make the table and bench match each other; they both use a darker wood for the support legs and have metal bits in the frame, and have a lighter wood for the top. I think I got a good amount of interesting wear and stuff on the benches on my first attempt, but getting the right amount of wear on the table’s top took a couple of iterations to get right. Again, due to how diffuse my lookdev test setup on this project was, the detail and wear in the table’s top showed up better in my final scene than in these test renders:

To have a bit of fun and add a slight tiny hint of mystery and magic into the scene, I put some inlaid gold runes into the side of the table. The runes are a favorite scifi/fantasy quote of mine, which is an inversion of Clarke’s third law. They read: “any sufficiently rigorously defined magic is indistinguishable from technology”; this quote became something of a driving theme for the props in the scene. I wanted to give a sense that this shop is a bookshop specializing in books about magic, but the magic of this world is not arbitrary and random; instead, this world’s magic has been studied and systematized into almost another branch of science.

A lot of the props did require minor geometric modifications to make them more plausible. For example, the cardboard box was originally made entirely out of single-sided surfaces with zero thickness; I had to extrude the surfaces of the box in order to have enough thickness to seem convincing. There’s not a whole lot else interesting to write about with the cardboard box; it’s just corrugated cardboard. Although, I do have to say that I am pretty happy with how convincingly cardboard the cardboard boxes came out! Similarly, the scrolls just use a simple paper texture and, as one would expect with paper, use some diffuse transmission as well. Each of the scrolls has a unique design, which provided an opportunity for some fun personal easter eggs. Two of the scrolls have some SIGGRAPH paper abstracts translated into the same runes that the inlay on the table uses. One of the scrolls has a wireframe schematic of the wand prop that sits on the table in the final scene; my idea was that this scroll is one of the technical schematics that the character used to construct her wand. To fit with this technical schematic idea, the two sheets of paper in the background on the right wall use the same paper texture as the scrolls and similarly have technical blueprints on them for the record player and camera props. The last scroll in the box is a city map made using Oleg Dolya’s wonderful Medieval Fantasy City Generator tool, which is a fun little tool that does exactly what the name suggests and with which I’ve wasted more time than I’d like to admit generating and daydreaming about made up little fantasy towns.

The next prop I worked on was the mannequin, which was even more straightforward than the cardboard box and scrolls. For the mannequin’s wooden components, I relied entirely on triplanar projections in Substance Painter oriented such that the grain of the wood would flow correctly along each part. The wood material is just a modified version of a default Substance Painter smart material, with additional wear and dust and stuff layered on top to give everything a bit more personality:

The record player was a fun prop the texture and shade, since there were a lot of components and a lot of room for adding little details and touches. I found a bunch of reference online for briefcase record players and, based off of the reference, I chose to make the actual record player part of the briefcase out of metal, black leather, and black plastic. The briefcase itself is made from a sort of canvas-like material stretched over a hard shell, with brass hardware for the clasps and corner reinforcements and stuff. For the speaker openings, instead of going with a normal grid-like dot pattern, I put in an interesting swirly design. The inside of the briefcase lid uses a red fabric, with a custom gold imprinted logo for an imaginary music company that I made up for this project: “SeneTone”. I don’t know why, but my favorite details to do when texturing and shading props is stuff like logos and labels and stuff; I think that it’s always things like labels that you’d expect in real life that really help make something CG believable.

The camera prop took some time to figure out what to do with, mostly because I wasn’t actually sure whether it was a camera or a projector initially! While this prop looks like an old hand-cranked movie camera. the size of the prop in the scene that Pixar provided threw me off; the prop is way larger than any references for hand-cranked movie cameras that I could find. I eventually decided that the size could probably be handwaved away by explaining the camera as some sort of really large-format camera. I decided to model the look of the camera prop after professional film equipment from roughly the 1960s, when high-end cameras and stuff were almost uniformly made out of steel or aluminum housings with black leather or plastic grips. Modern high-end camera gear also tends to be made from metal, but in modern gear the metal is usually completely covered in plastic or colored power-coating, whereas the equipment from the 1960s I saw had a lot of exposed silvery-grey metal finishes with covering materials only in areas that a user would expect to touch or hold. So, I decided to give the camera prop an exposed gunmetal finish, with black leather and black plastic grips. I also reworked the lens and what I think is a rangefinder to include actual optical elements, so that they would look right when viewed from a straight-on angle. As an homage to old film cinema, I made a little “Super 35” logo for the camera (even though the Super 35 film format is a bit anachronistic for a 1960s era camera). The “Senecam” typemark is inspired by how camera companies often put their own typemark right across the top of the camera over the lens mount.

The crystal was really interesting to shade. I wanted to give the internals of the crystal some structure, and I didn’t want the crystal to refract a uniform color throughout. To get some interesting internal structure, I wound up just shoving a bunch of crumpled up quads inside of the crystal mesh. The internal crumpled up geometry refracts a couple of different variants of blue and light blue, and the internal geometry has a small amount of emission as well to get a bit of a glowy effect. The outer shell of the crystal refracts mostly pink and purple; this dual-color scheme gives the internals of the crystal a lot of interesting depth. The back-story in my head was that this crystal came from a giant geode or something, so I made the bottom of the crystal have bits of a more stony surface to suggest where the crystal was once attached to the inside of a stone geode. The displacement on the crystal is basically just a bunch of rocky displacement patterns piled on top of each other using triplanar projects in Substance Painter; I think the final look is suitably magical!

Originally the crystal was going to be on one of the back shelves, but I liked how the crystal turned out so much that I decided to promote it to a foreground prop and put it on the foreground table. I then filled the crystal’s original location on the back shelf with a pile of books.

I liked the crystal look so much that I decided to make the star on the magic wand out of the same crystal material. The story I came up with in my head is that in this world, magic requires these crystals as a sort of focusing or transmitting element. The magic wand’s star is shaded using the same technique as the crystal: the inside has a bunch of crumpled up refractive geometry to produce all of the interesting color variation and appearance of internal fractures and cracks, and the outer surface’s displacement is just a bunch of rocky patterns randomly stacked on top of each other.

The flower-shaped lamps hanging above the table are also made from the same crystal material, albeit a much more simplified version. The lamps are polished completely smooth and don’t have all of the crumpled up internal geometry since I wanted the lamps to be crack-free.

The potted plant on top of the stack of record crates was probably one of the easiest props to texture and shade. The pot itself uses the same orange fired terracotta material as the main windows, but with displacement removed and with a bit less roughness. The leaves and bark on the branches are straight from Quixel Megascans. The displacement for the branches is actually slightly broken in both the test render below and in the final render, but since it’s a background prop and relatively far from the camera, I actually didn’t really notice until I was writing this post.

The reason that the character in my scene is talking on an old-school rotary dial phone is… actually, there isn’t a strong reason. I originally was tinkering with a completely different idea on that did have a strong story reason for the phone, but I abandoned that idea very early on. Somehow the phone always stayed in my scene though! Since the setting of my final scene is a magic bookshop, I figured that maybe the character is working at the shop and maybe she’s casting a spell over the phone!

The phone itself is kit-bashed together from a stock model that I had in my stock model library. I did have to create the cord from scratch, since the cord needed to stretch from the main phone set to the receiver in the character’s hand. I modeled the cord in Maya by first creating a guide curve that described the path the cord was supposed to follow, and then making a helix and making it follow the guide curve using Animate -> Motion Paths -> Flow Path Object tool. The Flow Path Object tool puts a lattice deformer around the helix and makes the lattice deformer follow the guide curve, which in turn deforms the helix to follow as well.

As with everything else in the scene, all of the shading and texturing for the phone is my own. The phone is made from a simple red Bakelite plastic with some scuffs and scratches and fingerprints to make it look well used, while the dial and hook switch are made of a simple metal material. I noticed that in some of the references images of old rotary phones that I found, the phones sometimes had a nameplate on them somewhere with the name of the phone company that provided the phone, so I made up yet another fictional logo and stuck it on the front of the phone. The fictional phone company is “Senecom”; all of the little references to a place called Seneca hint that maybe this image is set in the same world as my entry for the previous RenderMan Art Challenge. In the final image, you can’t actually see the Senecom logo though, but again at least I know it’s there!

Signs and Records and Books

While I was looking up reference for bookstores with shading books in mind, I came across an image of a sign reading “Books are Magic” from a bookstore in Brooklyn with that name. Seeing that sign provided a good boost of inspiration for how I proceeded with theming my bookstore set, and I liked the sign so much that I decided to make a bit of an homage to it in my scene. I wasn’t entirely sure how to make a neon sign though, so I had to do some experimentation. I started by laying out curves in Adobe Illustrator and bringing them into Maya. I then made each glass tube by just extruding a cylinder along each curve, and then I extruded a narrower cylinder along the same curve for the glowy part inside of the glass tube. Each glass tube has a glass shader with colored refraction and uses the thin glass option, since real neon glass tubes are hollow. The glowy part inside is a mesh light. To make the renders converge more quickly, I actually duplicated each mesh light; one mesh light is white, is visible to camera, and has thin shadows disabled to provide to look of the glowy neon core, and the second mesh light is red, invisible to camera, and has thin shadows enabled to allow for casting colored glow outside of the glass tubes without introducing tons of noise. Inside of Maya, this setup looks like the following:

After all of this setup work, I gave the neon tubes a test render, and to my enormous surprise and relief, it looks promising! This was the first test render of the neon tubes; when I saw this, I knew that the neon sign was going to work out after all:

After getting the actual neon tubes part of the neon sign working, I added in a supporting frame and wires and stuff. In the final scene, the neon sign is held onto the back wall using screws (which I actually modeled as well, even though as usual for all of the tiny things that I put way too much effort into, you can’t really see them). Here is the neon sign on its frame:

The single most time consuming prop in the entire project wound up being the stack of record crates behind the character to the right; I don’t know why I decided to make a stack of record crates, considering how many unique records I wound up having to make to give the whole thing a plausible feel. In the end I made around twenty different custom album covers; the titles are borrowed from stuff I had recently listened to at the time, but all of the artwork is completely custom to avoid any possible copyright problems with using real album artwork. The sharp-eyed long-time blog reader may notice that a lot of the album covers reuse renders that I’ve previously posted on this blog before! For the record crates themselves, I chose a layered laminated wood, which I figured in real life is a sturdy but relatively inexpensive material. Or course, instead of making all of the crates identical duplicates of each other, I gave each crate a unique wood grain pattern. The vinyl records that are sticking out here and there have a simple black glossy plastic material with bump mapping for the grooves; I was pleasantly surprised at how well the grooves catch light given that they’re entirely done through bump mapping.

Coming up with all of the different album covers was pretty fun! Different covers have different neat design elements; some have metallic gold leaf text, others have embossed designs, there are a bunch of different paper varieties, etc. The common design element tying all of the album covers together is that they all have a “SeneTone” logo on them, to go with the “SeneTone” record player prop. To create the album covers, I created the designs in Photoshop with separate masks for different elements like metallic text and whatnot, and then used the masks to drive different layers in Substance Painter. In Substance Painter, I actually created different paper finishes for different albums; some have a matte paper finish, some have a high gloss magazine-like finish, some have rough cloth-like textured finishes, some have smooth finishes, and more. I guess none of this really matters from a distance, but it was fun to make, and more importantly to myself, I know that all of those details are there! After randomizing which records get which album covers, here’s what the record crates look like:

The various piles of books sitting around the scene also took a ton of time, for similar reasons to why the records took so much time: I wanted each book to be unique. Much like the records, I don’t know why I chose to have so many books, because it sure took a long time to make around twenty different unique books! My idea was to have a whole bunch of the books scattered around suggesting that the main character has been teaching herself how to build a magic wand and cast spells and such- quite literally “books are magic” because the books are textbooks for various magical topics Here is one of the textbooks- this one about casting spells over the telephone, since the character is on the phone. Maybe she’s trying to charm whoever is on the other end!

I wound up significantly modifying the provided book model; I created several different basic book variants and also a few open book variants, for which I had to also model some pages and stuff. Because of how visible the books are in my framing, I didn’t want to have any obvious repeats in the books, so I textured every single one of them to be unique. I also added in some little sticky-note bookmarks into the books, to make it look like they’re being actively read and referenced.

Creating all of the different books with completely different cover materials and bindings and page styles was a lot of fun! Some of the most interesting covers to create were the ones with intricate gold or silver foil designs on the front; for many of these, I found pictures of really old books and did a bunch of Photoshop work to extract and clean up the cover design for use as a layer mask in Substance Painter. Here are some of the books I made:

One fun part of making all of these books was that they were a great opportunity for sneaking in a bunch of personal easter eggs. Many of the book titles are references to computer graphics and rendering concepts. Some of the book authors are just completely made up or pulled from whatever book caught my eye off of my bookshelf at the moment, but also included among the authors are all of the names of the Hyperion team’s current members at the time that I did this project. There is also, of course, a book about Seneca, and there’s a book referencing Minecraft. The green book titled “The Compleat Atlas of the House and Immediate Environs” is a reference to Garth Nix’s “Keys to the Kingdom” series, which my brother and I loved when we were growing up and had a significant influence on how the type of kind-of-a-science magic I like in fantasy settings. Also, of course, as is obligatory since I am a rendering engineer, there is a copy of Physically Based Rendering 3rd Edition hidden somewhere in the final scene; see if you can spot it!

Putting Everything Together

At this point, with all extra modeling completed and everything textured and shaded, the time came for final touches and lighting! Since one of the books I made is about levitation enchantments, I decided to use that to justify making one of the books float in mid-air in front of the character. To help sell that floating-in-air enchantment, I made some magical glowy pixie dust particles coming from the wand; the pixie dust is just some basic nParticles following a curve. The pixie dust is shaded using PxrSurface’s glow parameter. I used the particleId primvar to drive a PxrVary node, which in turn is used to randomize the pixie dust colors and opacity. Putting everything together at this point looked like this:

I originally wanted to add some cobwebs in the corners of the room and stuff, but at this point I had so little time remaining that I had to move on directly to final shot lighting. I did however have time for two small last-minute tweaks: I adjusted the character’s pose a slight amount to tilt her head towards the phone more, which is closer to how people actually talk on the phone, and I also moved up the overhead lamps a bit to try not to crowd out her head.

The final shot lighting is not actually that far of a departure from the lighting I had already roughed in at this point; mostly the final lighting just consisted of tweaks and adjustments here and there. I added a bunch of PxrRodFilters to take down hot spots and help shape the lighting overall a bit more. The rods I added were to bright down the overhead lamps and prevent the lamps from blowing out, to slightly brighten up some background shelf books, to knock down a hot spot on a foreground book, and to knock down hot spots on the floor and on the bench. I also brought down the brightness of the neon sign a bit, since the brightness of the sign should be lower relative to how incredibly bright the windows were. Here is what my Maya viewport looked like with all of the rods; everything green in this screenshot is a rod:

One of the biggest/trickiest changes I made to the lighting setup was actually for technical reasons instead of artistic reasons: the back window was originally so bright that the brightness was starting to break pixel filtering for any pixel that partially overlapped the back window. To solve this problem, I split the dome light outside of the window into two dome lights; the two new lights added up to the same intensity as the old one, but the two lights split the energy such that one light had 85% of the energy and was not visible to camera while the other light had 15% of the energy and was visible to camera. This change had the effect of preserving the overall illumination in the room while knocking down the actual whites seen through the windows to a level low enough that pixel filtering no longer broke as badly.

At this point I arrived at my final main beauty pass. In previous RenderMan Art Challenges, I broke out lights into several different render passes so that I could adjust them separately in comp before recombining, but for this project, I just rendered out everything on a single pass:

Here is a comparison of the final beauty pass with the initial putting-everything-together render from Figure 40. Note how the overall lighting is actually not too different, but there are many small adjustments and tweaks:

Figure 43: Before (left) and after (right) final lighting. For a full screen comparison, click here.

To help shape the lighting a bit more, I added a basic atmospheric volume pass. Unlike in previous RenderMan Art Challenges where I used fancy VDBs and whatnot to create complex atmospherics and volumes, for this scene I just used a simple homogeneous volume box. My main goal with the atmospheric volume pass was to capture some subtly godray-like lighting effects coming from the back windows:

For the final composite, I used the same Photoshop and Lightroom workflow that I used for the previous two RenderMan Art Challenges. For future personal art projects I’ll be moving to a DaVinci Resolve/Fusion compositing workflow, but this time around I reached for what I already knew since I was so short on time. Just like last time, I used basically only exposure adjustments in Photoshop, flattened out, and brought the image into Lightroom for final color grading. In Lightroom I further brightened things a bit, made the scene warmer, and added just a bit more glowy-ness to everything. Figure 45 is a gif that visualizes the compositing steps I took for the final image. Figure 46 shows what all of the lighting, comp, and color grading looks like applied to a 50% grey clay shaded version of the scene, and Figure 47 repeats what the final image looks like so that you don’t have to scroll all the way back to the top of this post.

Conclusion

Despite having much less free time to work on this RenderMan Art Challenge, and despite not having really intended to even enter the contest initially, I think things turned out okay! I certainly wasn’t expect to actually win a placed position again! I learned a ton about character shading, which I think is a good step towards filling a major hole in my areas of experience. For all of the props and stuff, I was pretty happy to find that my Substance Painter workflow is now sufficiently practiced and refined that I was able to churn through everything relatively efficiently. At the end of the day, stuff like art simply requires practice to get better at, and this project was a great excuse to practice!

Here is a progression video I put together from all of the test and in-progress renders that I made throughout this entire project:

Magic Shop Art Challenge Progression Reel

Figure 48: Progression reel made from test and in-progress renders leading up to my final image.

As usual with these art projects, I owe an enormous debt of gratitude to my wife, Harmony Li, both for giving invaluable feedback and suggestions (she has a much better eye than I do!), and also for putting up with me going off on another wild time-consuming art adventure. Also, as always, Leif Pederson from Pixar’s RenderMan group provided lots of invaluable feedback, notes, and encouragement, as did everyone else in the RenderMan Art Challenge community. Seeing everyone else’s entries is always super inspiring, and being able to work side by side with such amazing artists and such friendly people is a huge honor and very humbling. If you would like to see more about my contest entry, check out the work-in-progress thread I kept on Pixar’s Art Challenge forum, and I also have an Artstation post for this project.

Finally, here’s a bonus alternate angle render of my scene. I made this alternate angle render for fun after the project and out of curiosity to see how well things held up from a different angle, since I very much “worked to camera” for the duration of the entire project. I was pleasantly surprised that everything held up well from a different angle!

References

Carlos Allaga, Carlos Castillo, Diego Gutierrez, Miguel A. Otaduy, Jorge López-Moreno, and Adrian Jarabo. 2017. An Appearance Model for Textile Fibers. Computer Graphics Forum (Proc. of Eurographics Symposium on Rendering) 36, 4 (Jun. 2017), 35-45.

Brent Burley and Dylan Lacewell. 2008. Ptex: Per-face Texture Mapping for Production Rendering. Computer Graphics Forum (Proc. of Eurographics Symposium on Rendering) 27, 4 (Jun. 2008), 1155-1164.

Brent Burley. 2015. Extending the Disney BRDF to a BSDF with Integrated Subsurface Scattering. In ACM SIGGRAPH 2015 Course Notes: Physically Based Shading in Theory and Practice.

Brent Burley, David Adler, Matt Jen-Yuan Chiang, Ralf Habel, Patrick Kelly, Peter Kutz, Yining Karl Li, and Daniel Teece. 2017. Recent Advances in Disney’s Hyperion Renderer. In ACM SIGGRAPH 2017 Course Notes: Path Tracing in Production Part 1, 26-34.

Matt Jen-Yuan Chiang, Peter Kutz, and Brent Burley. 2016. Practical and Controllable Subsurface Scattering for Production Path Tracing. In ACM SIGGRAPH 2016 Talks. Article 49.

Philip Child. 2012. Ill-Loom-inating Brave’s Handmade Fabric. In ACM SIGGRAPH 2012 Talks.

Per H. Christensen and Brent Burley. 2015. Approximate Reflectance Profiles for Efficient Subsurface Scattering. Pixar Technical Memo #15-04.

Trent Crow, Michael Kilgore, and Junyi Ling. 2018. Dressed for Saving the Day: Finer Details for Garment Shading on Incredibles 2. In ACM SIGGRAPH 2018 Talks. Article 6.

Priyamvad Deshmukh, Feng Xie, and Eric Tabellion. 2017. DreamWorks Fabric Shading Model: From Artist Friendly to Physically Plausible. In ACM SIGGRAPH 2017 Talks. Article 38.

Eugene d’Eon. 2012. A Better Dipole. http://www.eugenedeon.com/project/a-better-dipole/

Eugene d’Eon, Guillaume Francois, Martin Hill, Joe Letteri, and Jean-Marie Aubry. 2011. An Energy-Conserving Hair Reflectance Model. Computer Graphics Forum (Proc. of Eurographics Symposium on Rendering) 30, 4 (Jun. 2011), 1181-1187.

Christophe Hery. 2003. Implementing a Skin BSSRDF. In ACM SIGGRAPH 2003 Course Notes: RenderMan, Theory and Practice. 73-88.

Christophe Hery. 2012. Texture Mapping for the Better Dipole Model. Pixar Technical Memo #12-11.

Christophe Hery and Junyi Ling. 2017. Pixar’s Foundation for Materials: PxrSurface and PxrMarschnerHair. In ACM SIGGRAPH 2017 Course Notes: Physically Based Shading in Theory and Practice.

Jonathan Hoffman, Matt Kuruc, Junyi Ling, Alex Marino, George Nguyen, and Sasha Ouellet. 2020. Hypertextural Garments on Pixar’s Soul. In ACM SIGGRAPH 2020 Talks. Article 75.

Henrik Wann Jensen, Steve Marschner, Marc Levoy, and Pat Hanrahan. 2001. A Practical Model for Subsurface Light Transport In Proc. of SIGGRAPH (SIGGRAPH 2001). 511-518.

Ying Liu, Jared Wright, and Alexander Alvarado. 2020. Making Beautiful Embroidery for “Frozen 2”. In ACM SIGGRAPH 2020 Talks. Article 73.

Steve Marschner, Henrik Wann Jensen, Mike Cammarano, Steve Worley, and Pat Hanrahan. 2003. Light Scattering from Human Hair Fibers. ACM Transactions on Graphics (Proc. of SIGGRAPH) 22, 3 (Jul. 2003), 780-791.

Zahra Montazeri, Søren B. Gammelmark, Shuang Zhao, and Henrik Wann Jensen. 2020. A Practical Ply-Based Appearance Model of Woven. ACM Transactions on Graphics (Proc. of SIGGRAPH Asia) 39, 6 (Dec. 2020), Article 251.

Sean Palmer and Kendall Litaker. 2016. Artist Friendly Level-of-Detail in a Fur-Filled World. In ACM SIGGRAPH 2016 Talks. Article 32.

Leonid Pekelis, Christophe Hery, Ryusuke Villemin, and Junyi Ling. 2015. A Data-Driven Light Scattering Model for Hair. Pixar Technical Memo #15-02.

Kai Schröder, Reinhard Klein, and Arno Zinke. 2011. A Volumetric Approach to Predictive Rendering of Fabrics. Computer Graphics Forum (Proc. of Eurographics Symposium on Rendering) 30, 4 (Jun. 2011), 1277-1286.

Brian Smith, Roman Fedetov, Sang N. Le, Matthias Frei, Alex Latyshev, Luke Emrose, and Jean Pascal leBlanc. 2018. Simulating Woven Fabrics with Weave. In ACM SIGGRAPH 2018 Talks. Article 12.

Thomas V. Thompson, Ernest J. Petti, and Chuck Tappan. 2003. XGen: Arbitrary Primitive Generator. In ACM SIGGRAPH 2003 Sketches and Applications.

Walt Disney Animation Studios. 2011. SeExpr.

Magnus Wrenninge, Ryusuke Villemin, and Christophe Hery. 2017. Path Traced Subsurface Scattering using Anisotropic Phase Functions and Non-Exponential Free Flighs. Pixar Technical Memo #17-07.

Shuang Zhao, Wenzel Jakob, Steve Marschner, and Kavita Bala. 2012. Structure-Aware Synthesis for Predictive Woven Fabric Appearance. ACM Transactions on Graphics (Proc. of SIGGRAPH) 31, 4 (Aug. 2012), Article 75.

Shuang Zhao, Fujun Luan, and Kavita Bala. 2016. Fitting Procedural Yarn Models for Realistic Cloth Rendering. ACM Transactions on Graphics (Proc. of SIGGRAPH) 35, 4 (Jul. 2016), Article 51.

Raya and the Last Dragon

2021-03-05T00:00:00+00:00

After a break in 2020, Walt Disney Animation Studios has two films lined up for release in 2021! The first of these is Raya and the Last Dragon, which is simultaneously out in theaters and available on Disney+ Premiere Access on the day this post is being released. I’ve been working on Raya and the Last Dragon in some form or another since early 2018, and Raya and the Last Dragon is the first original film I’ve worked on at Disney Animation that I was able to witness from the very earliest idea all the way through to release; every other project I’ve worked on up until now was either based on a previous idea or began before I started at the studio. Raya and the Last Dragon was an incredibly difficult film to make, in every possible aspect. The story took time to really get right, the technology side of things saw many challenges and changes, and the main production of the film ran headfirst into the Covid-19 pandemic. Just as production was getting into the swing of things last year, the Covid-19 pandemic forced the physical studio building to temporarily shut down, and the studio’s systems/infrastructure teams had to scramble and go to heroic lengths to get production back up and running again from around 400 different homes. As a result, Raya and the Last Dragon is the first Disney Animation film made entirely from our homes instead of from the famous “hat building”.

In the end though, all of the trials and tribulations this production saw were more than worthwhile; Raya and the Last Dragon is the most beautiful film we’ve ever made, and the movie has a message and story about trust that is deeply relevant for the present time. The Druun as a concept and villain in Raya and the Last Dragon actually long predate the Covid-19 pandemic; they’ve been a part of every version of the movie going back years, but the Druun’s role in the movie’s plot meant that the onset of the pandemic suddenly lent extra weight to this movie’s core message. Also, as someone of Asian descent, I’m so so proud that Raya and the Last Dragon’s basis is found in diverse Southeast Asian cultures. Early in the movie’s conceptualization, before the movie even had a title or a main character, the movie’s producers and directors and story team reached out to all of the people in the studio of Asian descent and engaged us in discussing how the Asian cultures we came from shaped our lives and our families. These discussions continued for years throughout the production process, and throughlines from those discussions can be seen everywhere from the movie, from major thematic elements like the importance of food and sharing meals in the world of Kumandra, all the way down to tiny details like young Raya taking off her shoes when entering the Dragon Gem chamber. The way I get to contribute to our films is always in the technical realm, but thanks to Fawn Veerasunthorn, Scott Sakamoto, Adele Lim, Osnat Shurer, Paul Briggs, and Dean Wellins, this is the first time where I feel like I maybe made some small, tiny, but important contribution creatively too! Raya and the Last Dragon has spectacular fight scenes with real combat, and the fighting styles aren’t just made up- they’re directly drawn from Thailand, Malaysia, Cambodia, Laos, and Vietnam. Young Raya’s fighting sticks are Filipino Arnis sticks, the food in the film is recognizably dishes like fish amok, tom yam, chicken satay and more, Raya’s main mode of transport is her pet Tuk Tuk, who has the same name as those motorbike carriages that can be found all over Southeast Asia; the list goes on and on.

From a rendering technology perspective, Raya and the Last Dragon in a lot of ways represents the culmination of a huge number of many-year-long initiatives that began on previous films. Water is a huge part of Raya and the Last Dragon, and the water in the film looks so incredible because we’ve been able to build even further upon the water authoring pipeline [Palmer et al. 2017] that we first built on Moana and improved on Frozen 2. One small bit of rendering tech I worked on for this movie was further improving the robustness and stability of the water levelset meshing system that we first developed on Moana. Other elements of the film, such as being able to render convincing darker skin and black hair, along with the colorful fur of the dragons, are the result of multi-year efforts to productionize path traced subsurface scattering [Chiang et al. 2016b] (first deployed on Ralph Breaks the Internet) and a highly artistically controllable principled hair shading model [Chiang et al. 2016a] (first deployed on Zootopia). The huge geometric complexity challenges that we’ve had to face on all of our previous projects prepared us for rendering Raya and the Last Dragon’s setting, the vast world of Kumandra. Even more niche features, such as our adaptive photon mapping system [Burley et al. 2018], proved to be really useful on this movie, and even saw new improvements- Joe Schutte added support for more geometry types to the photon mapping system to allow for caustics to be cast on Sisu whenever Sisu was underwater. Raya and the Last Dragon also contains a couple of more stylized sequences that look almost 2D, but even these sequences were rendered using Hyperion! These more stylized sequences build upon the 3D-2D hybrid stylization experience that Disney Animation has gained over the years from projects such as Paperman, Feast, and many of the Short Circuit shorts [Newfield and Staub 2020]. I think all of the above is really what makes a production renderer a production renderer- years and years of accumulated research, development, and experience over a variety of challenging projects forging a powerful, reliable tool custom tailored to our artists’ work and needs. Difficult problems are still difficult, but they’re no longer scary, because now, we’ve seen them before!

For this movie though, the single biggest rendering effort by far was on volume rendering. After encountering many volume rendering challenges on Moana, our team undertook an effort to replace Hyperion’s previous volume rendering system [Fong et al. 2017] with a brand new, from scratch implementation based on new research we had conducted [Kutz et al. 2017]. The new system first saw wide deployment on Ralph Breaks the Internet, but all things considered, the volumes use cases on Ralph Breaks the Internet didn’t actually require us to encounter the types of difficult cases we ran into on Moana, such as ocean foam and spray. Frozen 2 was really the show where we got a second chance at tackling the ocean foam and spray and dense white clouds cases that we had first encounted on Moana, and new challenges on Frozen 2 with thin volumes gave my teammate Wayne Huang the opportunity to make the new volume rendering system even better. Raya and the Last Dragon is the movie where I feel like all of the past few years of development on our modern volume rendering system came together- this movie threw every single imaginable type of volume rendering problem at us, often in complex combinations with each other. On top of that, Raya and the Last Dragon has volumes in basically every single shot; the highly atmospheric, naturalistic cinematography on this film demanded more volumes than we’ve ever had on any past movie. Wayne really was our MVP in the volume rendering arena; Wayne worked with our lighters to introduce a swath of powerful new tools to give artists unprecedented control and artistic flexibility in our modern volume rendering system [Bryant et al. 2021], and Wayne also made huge improvements in the volume rendering system’s overall performance and efficiency [Huang et al. 2021]. We now have a single unified volume integrator that can robustly handle basically every volume you can think of: fog, thin atmospherics, fire, smoke, thick white clouds, sea foam, and even highly stylized effects such as the dragon magic [Navarro & Rice 2021] and the chaotic Druun characters [Rice 2021] in Raya and the Last Dragon.

A small fun new thing I got to do for this movie was to add support for arbitrarily custom texture-driven camera aperture shapes. Raya and the Last Dragon’s cinematography makes extensive use of shallow depth-of-field, and one idea the film’s art directors had early on was to stylize bokeh shapes to resemble the Dragon Gem. Hyperion has long had extensive support for fancy physically-based lensing features such as uniformly bladed apertures and cateye bokeh, but the request for a stylized bokeh required much more art-directability than we previously had in this area. The texture-driven camera aperture feature I added to Hyperion is not necessarily anything innovative (similar features can be found on many commercial renderers), but iterating with artists to define and refine the feature’s controls and behavior was a lot of fun. There were also a bunch of fun nifty little details to solve, such as making sure that importance sampling ray directions based on a arbitrary textured aperture didn’t mess up stratified sampling and Sobol distributions; repurposing hierarchical sample warping [Clarberg et al. 2005] wound up being super useful here.

There are a ton more really cool technical advancements that were made for Raya and the Last Dragon, and there were also several really ambitious, inspiring, and potentially revolutionary projects that just barely missed being deployed in time for this movie. One extremely important point I want to highlight is that, as cool as all of the tech that we develop at Disney Animation is, at the end of the day our tech and tools are only as good as the artists that use them every day to handcraft our films. Hyperion only renders amazing films because the artists using Hyperion are some of the best in the world; I count myself as super lucky to be able to work with my teammates and with our artists every day. At SIGGRAPH 2021, most of the talks about Raya and the Last Dragon are actually from our artists, not our engineers! Our artists had to come up with new crowd simulation techniques for handling the huge crowds seen in the movie [Nghiem 2021, Luceño Ros et al. 2021], new cloth simulation techniques for all of the beautiful, super complex outfits worn by all of the characters [Kaur et al. 2021, Kaur & Coetzee 2021], and even new effects techniques to simulate cooking delicious Southeast Asia-inspired food [Wang et al. 2021].

Finally, here are a bunch of stills from the movie, 100% rendered using Hyperion. Normally I post somewhere between 40 to 70 stills per film, but I had so many favorite images from Raya and the Last Dragon that for this post, there are considerably more. You may notice what looks like noise in the stills below- it’s not noise! The actual renders are super clean thanks to Wayne’s volumes work and David Adler’s continued work on our Disney-Research-tech-based deep learning denoising system [Dahlberg et al. 2019, Vogels et al. 2018], but the film’s cinematography style called for adding film grain back in after rendering.

These frames are pulled from the Blu-ray. Of course, the stills here are just a few of my favorites, and represent just a tiny fraction of the incredible imagery in this film. If you like what you see here, I’d strongly encourage seeing the film on Disney+ or on Blu-ray; whichever way, I suggest watching on the biggest screen you have available to you!

To try to help avoid spoilers, the stills below are presented in no particular order; however, if you want to avoid spoilers entirely, then please go watch the movie first and then come back here to be able to appreciate each still on its own!

Here is the credits frame for Disney Animation’s rendering and visualization teams! The rendering and visualization teams are separate teams, but seeing them grouped together in the credits is very appropriate- we all are dedicated to making the best pixels possible for our films!

All images in this post are courtesy of and the property of Walt Disney Animation Studios.

Also, one more thing: in theaters (and also on Disney+ starting in the summer), Raya and the Last Dragon is accompanied by our first new theatrical short in 5 years, called Us Again. Us Again is one of my favorite shorts Disney Animation has ever made; it’s a joyous, visually stunning celebration of life and dance and music. I’ll probably dedicate a separate post to Us Again once it’s out on Disney+.

References

Marc Bryant, Ryan DeYoung, Wei-Feng Wayne Huang, Joe Longson, and Noel Villegas. 2021. The Atmosphere of Raya and the Last Dragon. In ACM SIGGRAPH 2021 Talks. Article 51.

Matt Jen-Yuan Chiang, Peter Kutz, and Brent Burley. 2016. Practical and Controllable Subsurface Scattering for Production Path Tracing. In ACM SIGGRAPH 2016 Talks. Article 49.

Petrik Clarberg, Wojciech Jarosz, Tomas Akenine-Möller, and Henrik Wann Jensen. 2005. Wavelet Importance Sampling: Efficiently Evaluating Products of Complex Functions. ACM Transactions on Graphics (Proc. of SIGGRAPH) 24, 3 (Aug. 2005), 1166-1175.

Henrik Dahlberg, David Adler, and Jeremy Newlin. 2019. Machine-Learning Denoising in Feature Film Production. In ACM SIGGRAPH 2019 Talks. Article 21.

Julian Fong, Magnus Wrenninge, Christopher Kulla, and Ralf Habel. 2017. Production Volume Rendering. In ACM SIGGRAPH 2017 Courses. Article 2.

Avneet Kaur and Johann Francois Coetzee. 2021. Wrapped Clothing on Disney’s Raya and the Last Dragon. In ACM SIGGRAPH 2021 Talks. Article 28.

Avneet Kaur, Erik Eulen, and Johann Francois Coetzee. 2021. Creating Diversity and Variety in the People of Kumandra for Disney’s Raya and the Last Dragon. In ACM SIGGRAPH 2021 Talks. Article 58.

Alberto Luceño Ros, Kristin Chow, Jack Geckler, Norman Moses Joseph, and Nicolas Nghiem. 2021. Populating the World of Kumandra: Animation at Scale for Disney’s Raya and the Last Dragon. In ACM SIGGRAPH 2021 Talks. Article 39.

Mike Navarro and Jacob Rice. 2021. Stylizing Volumes with Neural Networks. In ACM SIGGRAPH 2021 Talks. Article 54.

Jennifer Newfield and Josh Staub. 2020. How Short Circuit Experiments: Experimental Filmmaking at Walt Disney Animation Studios. In ACM SIGGRAPH 2020 Talks. Article 72.

Nicolas Nghiem. 2021. Mathematical Tricks for Scalable and Appealing Crowds in Walt Disney Animation Studios’ Raya and the Last Dragon. In ACM SIGGRAPH 2021 Talks. Article 38.

Sean Palmer, Jonathan Garcia, Sara Drakeley, Patrick Kelly, and Ralf Habel. 2017. The Ocean and Water Pipeline of Disney’s Moana. In ACM SIGGRAPH 2017 Talks. Article 29.

Jacob Rice. 2021. Weaving the Druun’s Webbing. In ACM SIGGRAPH 2021 Talks. Article 32.

Cong Wang, Dale Mayeda, Jacob Rice, Thom Whicks, and Benjamin Huang. 2021. Cooking Southeast Asia-Inspired Soup in Animated Film. In ACM SIGGRAPH 2021 Talks. Article 35.

Art Exercise: Lupine Forest Cabin

2020-12-28T00:00:00+00:00

1. Introduction
2. Initial Blocking
3. Forest Floor
4. Making Moss

5. Filling out the Clearing
6. Mist and Atmospherics
7. Final Lighting
8. Conclusion

9. References
10. Footnotes

Introduction

I recently did a small personal art exercise to experiment with building out a detailed forest type of environment. I have worked with a detailed forest scene before, when I used one while I was developing Takua Renderer’s mipmapping and texture caching system, but in that case I didn’t actually make the scene, I just took an off-the-shelf scene and ported it to Takua. I also made a forest for the background of my Woodville RenderMan Art Challenge piece, but to be honest, I wasn’t very happy with how that turned out. I think the background forest in my Woodville piece looks decent when viewed from a distance, as it is seen in that scene, but it doesn’t come even remotely close to holding up when viewed close up. Also, that forest was incredibly difficult to author. For the Woodville piece, I chose Maya’s MASH toolset over the XGen toolset to create the background forest; I chose MASH because MASH generally feels much more native to Maya and doesn’t have all of the extra file management that XGen requires. However, I found that MASH has performance problems when the things being instanced have really heavy geometry and there are a lot of instances, as is the case when instancing trees and bushes and the type of stuff generally found in forests. For this exercise, I wanted to focus on building a forest environment that would be detailed enough to hold up at closer distances, which means including things like leaf litter and twigs and small rocks on the ground, displacement-based bark on trees, moss on various surfaces, and so on. I also wanted to find a more efficient way to author these kinds of environments.

This post is essentially a collated version of my notes from this project, published here mostly because having them all in one place is useful for me to refer back to later but also in case anyone else finds this stuff interesting. Before I go into the details of how this project went, here’s what the final result looks like:

Initial Blocking

I didn’t really come into this project with much of a plan on what the final image would look like. To be honest, the goal of this project wasn’t even necessarily to make the most visually or artistically interesting image; the goal was primarily a technical exercise to practice building a complex scene. I actually started not with forest related things at all but instead I started with just trying to build an interesting groundplane. My plan was to use as many off-the-shelf premade assets as possible and focus on assembling everything together into a nice scene and focus on final lighting. Quixel Megascans has a lot of amazing large-scale 3D scans of rock formations and cliff faces; I had used a few of these back on the Woodville RenderMan Art Challenge and had made a note to myself back then to play with the Megascans large-scale rock formation assets more later. I started with placing a bunch of Megascans rock formations to get a sense of initial shaping for the scene. The overall shaping I went with for the scene draws from the principle of “compression and release”- I wanted the foreground to be through a narrower passage opening up into a larger space in the background, which I think helps give a sense that we’re looking into a larger world:

The next step was to put in a groundplane of dirt and soil. Since at the time RenderMan didn’t have hex-tiling [Heitz and Neyret 2018, Burley 2019] yet, I used the same technique that I used on the Woodville RenderMan Art Challenge to hide tiling in the groundplane textures: the groundplane texture is made up of several different tiled soil and dirt textures, projected using PxrRoundCube with different rotations per texture, blended together using a couple of different noise projections. One useful trick I added in this time was to also use the displacement maps to drive which texture “wins” at each blend point; this allows for nice things like larger rocks poking through layers of dirt and stuff.

Overall though I didn’t put too much effort into the groundplane since I expected it to be almost entirely covered with vegetation in the final scene, meaning very little of it would be directly visible. After getting the groundplane to a good-enough state, I fired up Maya’s XGen Geometry Instancer and started roughing in what all of the vegetation in the scene would look like.

The XGen toolset that ships with Maya today is really two essentially completely different toolsets packaged into a single plugin- XGen Interactive Grooming, and XGen Geometry Instancer. XGen Geometry Instancer was forked from Disney Animation’s internal XGen instancing system over a decade ago, while XGen Interactive Grooming was then added some years later [Todd 2013] as a totally separate authoring interface for hair workflows. For this project, I only used the Geometry Instancer half of XGen. I’ve had a fair amount of experience with Disney Animation’s version of XGen, so using the Maya version of XGen Geometry Instancer has been an interesting experience- the version in Maya was fairly heavily reworked to be more integrated into Maya compared with Disney Animation’s standalone version, and the two versions have since diverged even more due to many years of independent development and evolution, but there are still clearly a lot of Disney-isms in Maya’s XGen, such as the use of SeExpr [Disney Animation 2011] and Ptex [Burley & Lacewell 2008] to control instancing distributions and properties.

The basic workflow for Xgen Geometry Instancer is to first create a bunch of archives of geometry that the user wants to instance, and then create a description, which is a set of instances of the archives scattered across some area of geometry, called a patch. When the underlying geometry that XGen instances are being scattered on is a polygon mesh, patches bind to the faces of the mesh. Multiple decriptions can be organized together as a collection. Something that is a bit frustrating about XGen Geometry Instancer is that its integration into Maya is… somewhat strange. Maya’s XGen Geometry Instancer doesn’t really feel or behave like a native part of Maya- XGen descriptions and collections show up in the Maya outliner, but are edited through a dedicated separate interface instead of the standard Attribute or Channel editors, and instead of storing all data as part of the Maya file, XGen Geometry Instancer creates and depends on a bunch of external sidecar files, and even worse, paths baked into these external files don’t respect the project workspace system that Maya uses, making reusing XGen description between different Maya projects kind of annoying to do. All of this is the result of XGen Geometry Instancer being derived from an internal standalone tool at Disney Animation- the internal version of XGen makes a lot of sense integrated within the overall Disney Animation pipeline, but none of this was originally designed to be a deeply integrated part of Maya outside of Disney Animation, and even with all of the work that Autodesk put into making the commercially available Maya version of the system better integrated, the outside origin still shows through. However, I found on this project that XGen Geometry Instancer is nonetheless still super powerful and for huge numbers of instances is vastly more performant than the native Maya MASH instancing toolset.

The first vegetation test I did was to put down some basic grass, scattered using a super basic XGen Description setup where the base geometry used for scattering was just the groundplane itself. The grass is made up of just a couple of off-the-shelf grass models from Evermotion and Megascans, reshaded for the project’s needs, exported as XGen archives and scattered with random rotations applied per instance:

Next, I started blocking in trees since this scene was meant to be a forest scene. Originally I had planned on scattering trees using XGen as well, but I wound up placing the trees by hand. Part of the reason for placing by hand was simply that there aren’t that many trees in this scene, but the primary reason was just for more direct control since the trees play a really large role in establishing the overall framing of the scene.

The base spruce tree models I used are from Evermotion, but I opted to reshade them using the Evermotion textures as a starting point but then improved using some stuff from Megascans. The main reason for reshading was to improve how well the trees would hold up to being close to the camera, which requires more detailed displacement on the tree bark. I also put some work into making sure the spruce needles transmit light in a convincing way, which required some initial reverse-engineering of how the tree’s original Vray materials work in order to understand how to get a similar effect in RenderMan’s PxrSurface material, and then required further tweaking and adjustment to both the spruce needle textures and the material.

Here are a couple of lookdev test renders of one of the spruce trees; the first two show the overall look of the trees with my reworked shading, and the last one shows the close-up detail and also how the spruce needles look with backlighting:

One interesting thing I ran into was RenderMan’s opacity caching system. The opacity caching system stores an opacity value per micropolygon, under the assumption that micropolygon sizes in world space should be close to one micropolygon per screen-space pixel. As long as the one-micropolygon-per-screen-pixel assumption holds more or less true, this opacity caching approach works well; the average opacity value across a single screen-pixel sized micropolygon should match what the filtered mip-mapped texture lookup for opacity should be across the micropolygon, therefore, caching this value on the micropolygon allows for skipping a lot of texture lookups over the course of the render. However, as soon as micropolygons are larger than a single screen-pixel, the assumptions that allow for opacity caching to work without visual artifacts break, because the frequency that opacity caching can store values at drops below the frequency that the filtered mip-mapped signal from texturing represents; the net visual result is blurring in opacity edges, which can look super strange.

PxrSurface in RenderMan for Maya enables opacity caching by default, but for the spruce needles on the trees, I had to disable opacity caching. The spruce needles on the trees are modeled as a single quad strip per bundle of needles with opacity mapping; I left the quad strips unsubdivided to keep memory usage lower, but this meant that spruce needle polygon sizes were often way larger than one pixel in screen space for trees closer to the camera. Before doing this project I didn’t realize that RenderMan has an opacity caching system and enables it by default, so figuring out why the spruce needles looked all messed up took me a moment. This wasn’t a problem in the Woodville project because in that scene, all of the trees are in the background and therefore are so far from camera that all unsubdivided quad strips are subpixel in screen space anyway.

Here are the spruce trees blocked out along with the initial grass and rock formations. While nothing here is as close to the camera as in my lookdev tests, nonetheless the nearest foreground trees are close enough to the camera where the reworked high resolution displacement mapping and exact opacity evaluation on the spruce needles instead of cached opacity start to matter a great deal:

Forest Floor

At this stage, with conifer trees and large rock formations, the scene was starting to remind me of trips to Yosemite (even though technically Yosemite’s conifers include pine, fir, and sequoia but notably not spruce). I’ve seen lupine blooms in Yosemite before, which struck me as really visually interesting since large fields of blue-purple flowers feel somewhat unusual and unexpected normally. So, to make the foreground in this scene more interesting than just a grassy plane, I added in a couple of layers of lupine flowers to make a purple lupine meadow. The lupine flowers are placed as a couple of different overlapping XGen descriptions, with varying clustering rules per description. I also carved a footpath through the grass and flowers by painting a simple density mask.

I also realized here that the overall massing of the scene was all vertical but in a horizontal framing, which was starting to make the scene feel claustrophobic to me. Having the rock formations be completely vertical also didn’t make sense to me for how they probably would have been formed in real life; a completely vertical cut through rock cliffs like that feels like it would have had to been man-made. So, I tried sloping the rock formations back to give more of an impression of a small valley or something that could have been carved glacially or by water erosion or something. Sloping the rock formations back also helped bring some more light into the foreground of the scene. Here is a comparison of what the scene looked like at this point with vertical rock formations versus sloped rock formations:

Figure 9: Blocking in lupine flowers and carving out a footpath through the grass and flowers, with vertical rock formations (left) versus sloping rock formations (right). For a full screen comparison, click here.

One nice thing about XGen Geometry Instancing is that it treats driving properties through painted masks as a first-class citizen, as opposed to the MASH toolset where using textures as masks is a lot more annoying than I feel like it should be. In my experience, MASH encourages procedural placement or hand placement using the MASH Placer tool over using painted masks to drive everything; using a painted mask only really works well with MASH’s World node, but the World node in turn doesn’t always play well with everything else in MASH’s toolkit; in this sense I really prefer how easy using painted masks is in XGen. However, in XGen, painted masks also come with a minor complication- one Disney-ism that remains in XGen to this day is that XGen expects painted masks to all be Ptex files, which means the masks either have to be made using a paint tool that natively supports Ptex, or means that masks created as regular UV textures need to be converted to Ptex. Within Disney Animation, which has a native Ptex workflow, this is not a problem at all, but in a vanilla Maya workflow in the outside world, this Ptex dependency adds some friction and an additional layer of file juggling to the XGen workflow if one is more used to a UV-based painting workflow.

Here is what the painted density mask for the footpath looks like in Maya’s viewport. Since leaving this mask pretty coarse was fine for the use case, I just painted it directly in Maya using the built-in simple surface paint tool:

Having just the lupine flowers and base grass looked kind of monotonous, so I also threw in some small shrubs here and there to help visually break up the fields of purple. In the areas where the groundplane show through because of the footpath, I also added several more layered XGen descriptions to scatter various small bits of ground debris that I pulled from Megascans; the debris includes things you’d expect from a conifer forest, such as sticks and twigs and clusters of dead conifer needles, in addition to some pebbles and small rocks and stuff.

With all of the debris on the groundplane, even as instanced geometry, the scene was starting to get really heavy to render, so as a small optimization I took the density mask for the footpath through the lupine flowers and grass and simply inverted it and applied it to the debris XGen descriptions. This way, there’s only debris where it’ll actually be visible. Putting all of this together got the scene to a point where I think the groundplane was looking pretty good:

Here’s a top-down view of what all of the footpath debris looks like in the final version of the scene (I’ll get to the moss and volumes next). The nice thing about using instanced geometry instead of just adding decals to the groundplane texturing, or even adding displacement, is that the instanced geometry provides much more of a feeling of depth, especially in all of the areas where cavities and holes are visible under piles of dried needles:

Making Moss

In order to add a bit more detail to the scene, I wanted to add moss to some of the tree trunks and rock formations and stuff. On the Woodville piece, the vast majority of the mossy surfaces in the scene are made using a purely textured approach, with only a small amount of geometric moss in specific places. For this scene, I wanted to move to a 100% geometry based moss approach. Geometry based moss simply holds up better close to camera, and even far away from camera, geometric moss with proper transluscency allows for much more realistic lighting responses completely automatically.

I made the pieces of moss by just grabbing a bunch of Megascans atlases of moss bits and cutting out each piece from each atlas as its own little card. The cards aren’t simple rectangles; each card is shaped to cut as closely to the actual moss shape as possible while still maintaining a very low triangle count. Using simple rectangular cards means that the vast majority of each card is actually going to be opacity mapped out, and in most raytraced renderers (including RenderMan), somewhat counter-intuitively we’d rather have a few more triangles to deal with in exchange for cutting down opacity map lookups. The reason is because triangle intersection using a BVH acceleration structure means that the runtime cost of intersecting N triangles is typically less than N times the cost of a single triangle, but having to either interrupt BVH traversal to evaluate an opacity map in-line or having to evaluate opacity as part of shading and then generate continuation rays for hits shading points where opacity is less than 1 is really expensive in the aggregate. Here’s what some of the moss cards looked like:

At rendertime each moss card is actually subdivided and displacement mapped as well, so the total number of triangles per moss card actually can end up being relatively high, allowing for more detail. This card-based technique produced several dozen unique pieces of moss, which when combined with some random hue shifts in the material and some flipping and scaling was more than enough to create the impression of tons of unique moss bits when instanced and scattered.

To get an initial sense of how well this approach would work, I did a small-scale test on a single rock before expanding to the whole scene, using a simplified version of the rock for the underlying geometry to grow the XGen descriptions from and using a simple density mask painted directly in Maya:

One small problem with using XGen Geometry Instancer with Megascans assets is that XGen can be somewhat difficult to use with super high-resolution meshes. The reason high-resolution meshes get tricky to use with XGen comes back to XGen needing to bind a patch to every face in the mesh; with a really high-resolution mesh, this can result in a ton of patches, and since Maya and XGen are not particularly well multithreaded in some operations, have a ton of patches in XGen can result in really slow update times as Maya loops over and updates all patches. XGen in Maya does have a multithreading option, but I found that even with this option enabled, really high-resolution meshes still bogged down instance generation a lot. So, my workaround was to take all of the meshes that I wanted to grow moss on and create highly decimated versions to serve as base geometry for XGen.

Below is a comparison of what the full-resolution geometry that I wanted moss on looks like in the viewport, versus the decimated version of that geometry for XGen. Note that the decimated XGen base geometry obviously isn’t visible in the final render; XGen allows for seprarately hiding the underlying base geometry versus hiding the generated instances.

Figure 15: Full-resolution geometry that I wanted moss on (left) versus the decimated version used as base geometry for XGen Geometry Instancer (right). For a full screen comparison, click here.

For the moss density masks, I reached for Substance Painter instead of just painting in Maya using the basic built-in paint toolset. Painting the masks in Substance Painter made getting in nice bits of detail and complexity much easier. In order to make the Substance Painter workflow slightly easier, I also re-UV’d the decimated meshes to try to reduce the number of different UV islands; this step was much easier to do on low-resolution decimated meshes than it would have been on the full-resolution geometry!

Here’s what the painted masks for the “hero” foreground tree trunks look like in Substance Painter:

In order to use UV textures as masks in XGen, the textures have to be converted to Ptex. Maya’s paint tool actually has functionality for this; when using the paint tool, there’s an option to save and load from UV textures, which XGen then converts to Ptex upon hitting the map save button in the XGen UI.

I also used Substance Painter to paint the moss density masks for all of the rock formations and cliff walls. The viewport screenshot shows that the final resolution of the density masks actually isn’t that high, but in this case that wound up being completely okay:

One downside of using low-resolution decimated geometry as the base to grow XGen instances from while keeping high-resolution geometry for rendering is that there can be mismatches between where XGen instances are and where the high-resolution surface is; this mismatch means instances can either float in space or be embedded too far down in the high-resolution surface. XGen Geometry Instancer allows for specifying a displacement map on the base geometry to help correct for this, but in this case I found that the mismatch didn’t really matter. To address floating geometry I just applied a universal offset back along the surface normal, and for something like moss, pieces being embedded too far down in the high-resolution surface just look like shorter pieces of moss.

Here is a test render of what the moss looks like on just the smaller foreground rocks, which I did to confirm that at least for this use case, mismatches between the low-resolution surface used for XGen and the high-resolution render surface weren’t important:

Rendering everything together at this point produced the following image; things were starting to look pretty good at this point! I think the moss really helps with adding in a ton of additional detail, which in turn makes the whole scene look and feel more believable. In retrospect I suppose I should have adjust the moss so that it only grows in areas that stay in the shade longer, but even without taking that additional real-world factor into account, I think the moss looks pretty convincing:

This scene really shows the importance of having a robust instancing system in a production renderer- with all of the XGen instances added in, the scene at this stage took up about 50GB of memory in RenderMan for geometry alone, but 50GB is actually not too bad when considering that the total number of visible triangles is around a billion. According to RenderMan’s stats breakdown, of that 50GB of geometry memory, around 33GB is used just for storing instancing information! Most of the triangles just belong to instanced geometry, which is crucial to keeping geometry memory under control.

I ran some analytics to count up the total number of XGen instances (debris, grass, lupine flowers, and moss) in this scene; the total tally is 11,433,11. Here’s a comparison of what the final viewport looks like with only non-instanced geometry, and with every single XGen instance drawn as a blue bounding box on top:

Figure 20: Final viewport with only non-instanced geometry (left) and with every single XGen instance drawn as a blue bounding box on top (right). For a full screen comparison, click here.

The original stated goal of this project was to learn to create forest detail that could hold up well when viewed close up, but admittedly the closest tree in the final composition for this project is only in the mid-ground at best. I wanted to see how ell the moss would hold up when viewed closer, so after I finished the rest of the project, I set up an alternate camera much closer up to one of the trees. Here’s what the final version of the scene looks like with the alternate close-up camera. It’s far from perfect; in order to really look convincing this close up, I think the moss actually needs to be even smaller and even denser, but at least at a glance I think it does an okay job of holding up:

Filling out the Clearing

To finish building the scene, I wanted to put something in the clearing that the footpath leads to. Because I started this project with no real plan whatsoever, I was a bit stumped at this point what to put in the clearing! One initial idea I played with was throwing in a hot air balloon- I figured that a hot air balloon could provide a visually interesting splash of color into the background. I actually fully carried out this idea, including adding in some emissive volume flames for the hot air balloon’s burners. I also filled out the background with some more trees:

While I think the hot air balloon concept wasn’t the worse idea, I decided against it after seeing the hot air balloon fully in-context with the rest of the scene. Visually the balloon stood out, but I felt that it stood out too much and distracted from the overall composition. Also, in terms of the internal logic of the scene’s world-building, I thought that the balloon didn’t make a whole lot of sense- the clearing is pretty small, which when combined with all of the surrounding tall trees, makes for a hazardous landing site for a hot air balloon! I moved on to trying to find something more appropriate to the scene both visually and logically. I did keep the additional trees though.

Since the scene’s original layout principle of “compression and release” comes from the architecture world, I figured it would be fun to turn this scene into a bit of an architectural scene (which tend to be my favorite type of scene to make anyway). Evermotion released a wonderful 15th Anniversary scene earlier this year that includes two small cabins; I thought that something along the lines of those cabins would fit the bill well for the clearing in my scene. I was looking for something that would be small and relatively unassuming, but also still be visually interesting; I didn’t really have any preference for what the style of the building would be, but something super modern felt like a nice change of pace after working on the Woodville project. I didn’t want to just grab one of the Evermotion 15th Anniversary cabins and drop it into the scene as-is though. To keep things more visually interesting, I wanted to be able to see into the cabin more than is possible in the vanilla Evermotion 15th Anniversary cabins, and I also wanted the cabin to look cozy and inviting on the inside. To build the cabin for this scene, I kitbashed together parts from the two Evermotion 15th Anniversary cabins and made the front of my cabin one gigantic glass sheet:

To give the giant front window a bit more realism, I added a subtle amount of unevenness and wobbliness to the glass sheet, much like how giant glass sheets on skyscrapers often have small amounts of visible unevenness. In the following test render, note the back window behind the staircase; the wobbliness of the front glass is most noticeable by the effect it has on how the back window’s straight frame looks:

To integrate the cabin in the scene, I expanded the footpath further to reach the cabin and expand out into a small sort of pad surrounding the cabin. One major mistake I made in putting together this scene though was that I didn’t keep track of the scale of anything, so finding the appropriate scale for the cabin to sit well in the scene took a few tries; for future projects I’ll have to keep track of everything’s scale more closely.

Overall I think compared with the hot air balloon idea, the cabin does a much better job of integrating into the scene without overly drawing attention to itself while still being visually interesting:

Much like in the Shipshape Art Challenge project, I used PxrSurface exclusively as the shading model for everything in this project. However, for the cabin, using PxrSurface became kind of cumbersome because the shader parameterization that Evermotion assets come with doesn’t match up directly with PxrSurface’s parameterization. Specifically, Evermotion assets typically come with Vray, Corona, or generic “PBR” materials. Corona and generic “PBR” materials are the easiest to adapt to other renderers because they both follow the roughness/metallic/basecolor workflow that was originally introduced by the Disney BRDF [Burley 2012] and has since become the de-facto norm across many popular and widely-used physically based shading systems [Karis 2013, Kulla and Estevez 2017, Georgiev et al. 2019, Häussler et al. 2020, Noguer et al. 2021, Stockner 2022, Andersson et al. 2024]¹. PxrSurface, however, notably does not follow the roughness/metallic/basecolor parameterization [Hery et al. 2017]; roughness is the same in PxrSurface as in pretty much every other modern physical shading model, but instead of metallic, PxrSurface uses a combination of face color, edge color, and fresnel exponent to control how metallic a material looks. In order to convert metallic/basecolor maps into something PxrSurface could use, the approach I took was:

\[diffuseColor_{PxrSurface} = baseColor_{PBR} * (1-metallic_{PBR}) \]

\[specFaceColor_{PxrSurface} = baseColor_{PBR} * metallic_{PBR} \]

\[specEdgeColor_{PxrSurface} = 1 \]

When creating textures from scratch using Substance Painter, as I did for the Shipshape project, PxrSurface’s parameterization isn’t an issue because Substance Painter has built in export templates to save out textures in PxrSurface’s expected parameterization even though Substance Painter uses a standard metallic/basecolor workflow. When bringing in existing metallic/basecolor textures though, the above conversion has to be done within the material’s node network. I had to wire up a small node subnetwork to do this for every single Evermotion material that had metallic, which got really old really quickly. At least as of RenderMan 23 (the current version while I was working on this project), RenderMan doesn’t have any utility node to do this conversion from metallic/basecolor to PxrSurface’s diffuse/specular face/specular edge color paramterization²; such a node would be really helpful! Alternatively, a full implementation of the Disney BSDF [Burley 2015] (as opposed to the reduced-paramter Disney BRDF that is already available) would be useful too since a full implementation of the Disney BSDF would both be able to accept roughness/metallic/basecolor parameterized inputs and provide the full range of abilities that a production-quality shading model needs.

Mist and Atmospherics

With the surface geometry and shading for the scene complete, the last things to do were atmospherics and final lighting. Over the last two RenderMan Art Challenges, I’ve evolved my atmospherics and volume rendering workflow a lot. For the Woodville Art Challenge, I relied on a combination of homogeneous mist and heterogeneous fog made procedurally using Maya’s volume noise nodes plugged into the density field of a PxrVolume. This workflow was simple, but ultimately challenging to control and slow to iterate on since seeing what anything looked like required doing a full render. For the Shipshape Art Challenge, I moved to an entirely VDB-based volumes workflow, even for mist and stuff. Using an entirely VDB-based workflow allowed for more direct control, and because RenderMan for Maya can preview VDBs in Maya’s viewport, I was able to get a good sense of how volumes were placed and how they massed together in the scene without having to wait for full renders.

For this project, the volumes workflow was similar to the one on the Shipshape Art Challenge; everything is VDB-based, even the background haze. In making the mist and atmospherics in this scene, I knew I wanted to break things down into three distinct types of volumes: background haze, foreground godrays and atmospherics, and some mist low down clinging to the ground. I broke out and rendered these three types of volumes in their own passes/layers to give more control during compositing.

Here is the background haze layer, which is mostly uniform but does have a small amount of variation that creates a bit of a cascading effect:

For the main foreground godrays and atmospherics layer, I use similar types of volumes as in the background haze layer, but with a different lighting scheme. The main lighting setup for the entire scene is basically just a single super high-resolution skydome IBL with a sun baked in. However, I wanted the godrays to be more prominent than could be achieved by just using the skydome IBL as-is. For for this layer, I created a separate distant light with an orientation matched to the IBL’s sun position, cranked up the exposure of the distant light, and twiddled with the size (angle) of the distant light until the godrays became more visually prominent:

The low ground-hugging mist is made up of VDBs with higher-frequency detail than the previous two layers. There are only a handful of unique VDBs here, but I instanced them around a bunch using native Maya instancing to cover the entire valley floor. I really like how the low mist settles down in between the lupine flowers and floods the footpath area:

Finally, here’s what the three volumetrics layers look like stacked together, with the per-layer adjustments used in the final composite. One thing that I think is kind of neat about looking at the composited volumes together is that we can actually clearly see the shadow cast by the cabin; this detail isn’t actually necessarily clear in any of the volumes layers in isolation, but becomes apparently obvious once all three are stacked since each layer brings something different into the whole:

The clouds in the background in the final render are just part of the skydome IBL; in the future I’m hoping to build background cloudscapes from scratch, but that’ll have to be for a future art exercise.

Final Lighting

The final lighting setup for this scene was super simple; effectively the whole scene is lit by just a simple sun + skydome IBL setup. My goal was to keep the lighting as naturalistic as possible, so a simple sun + sky setup fit the bill, and even with the simple setup, the complexity of the forest geometry still provided a lot of interesting areas of highlight and shadow. For final lighting I did split out the sun as a distant light from the skydome IBL, much like what I did for the volumetrics passes, and I painted out the baked-in sun in the skydome IBL. The reason for splitting out the sun in this case wasn’t to adjust its brightness independently from the sky though, but instead was to allow for slightly adjusting the color temperature of the sun without affecting the fill provided by the sky. The sun is just a touch warmer in the final lighting than it was in all of the in-progress renders up to this point.

There are a bunch of practical lights in the cabin as well; the interior of the cabin is actually lit only using the practical lights and contains no other stage lighting. I initially calibrated all of the practical lights against the sun’s brightness, but that wound up making the interior of the cabin a bit too dark to be visible in a bright sunny day, so I cheated and applied a uniform exposure boost to all of the cabin’s practical lights to brighten up the inside of the cabin.

Putting everything together with the volumetrics produced the following main daylight version of the scene. I also did a small amount of color correction in Lightroom, but in this case the color correction was limited to some basic contrast and saturation adjustments. The color correction for this piece wasn’t nearly as extensive as what I did for the Woodville and Shipshape projects. Here is the final piece, followed by a 50% grey clay shaded version with the same final lighting, comp, and color correction:

I had some fun with making some alternate time-of-day versions of the final lighting setup. These were quick and just for fun, so the sun angle is the same as in the daylight version even though in reality that would make zero sense. I kept the sun angle the same in the variants since they use the same volumetrics passes, just color corrected differently. These variants were made quickly on a whim, which is why I didn’t put in the additional effort to redo everything to work with more plausible sun angles.

Here’s a late afternoon golden hour variant, followed by a 50% grey clay shaded version. A lot more of the look of this variant was achieved by extensive color correction in Lightroom. I think this variant shows off the godray volumetrics better than the main daylight variant, even if the angle of the godrays and by extension the angle of the sun don’t make much sense for golden hour:

The other variant I played with is a cooler overcast morning. A lot of the look of this variant also comes from Lightroom, possibly even more so that for the golden hour variant. I also boosted the contribution of the low ground-hugging mist in this variant to make it more visible, since low mist is typically characterstic of overcast mornings. I left the godrays in, even though having godrays at all on an overcast morning is fairly implausible; again, these two variants were quick and just for fun.

Here’s the overcast morning variant, followed by a 50% grey clay shaded version:

Conclusion

I learned a lot from this project; the goal of the project was to figure out how to put together a detailed forest environment that holds up well at relatively close camera distances, and I think the end result accomplishes that well enough. The final image in this project is decent- it’s not the most narratively interesting image, but it makes for a passable archviz type piece, but to be honest a nice looking image was somewhat secondary to the goal of figuring out how to build a scene like this. Maya’s version of the XGen toolset makes building this type of scene relatively easy and the XGen toolset is super powerful, even if it doesn’t feel particularly native to the rest of Maya. XGen certainly feels a lot more scalable for this type of project than MASH, which I’ve run into performance problems with on scenes much smaller than this one. At the same time though, in my opinion, Maya’s version of XGen isn’t nearly as easy to use and isn’t nearly as featureful as tools like ForestPack for 3ds Max, which remains the gold standard for instancing/scatter tools, or the Geo-Scatter plugin for Blender, or Houdini’s native toolset.

At the end of the (metaphorical) day, I had a lot of fun making this piece! The process was much more experimental and aimless compared with working on RenderMan Art Challenge pieces, which was a nice change of pace for an art project. I learned a bunch of techniques that I’m sure I’ll make use of again in the future, and I got in more practice, which is ultimately the most important part of getting better at anything. I’m really happy that I’ve more or less figured out a reliable way of making good trees now, given how much time I spent a long time ago trying to figure out how to make decent trees.

Finally, here is a progression video I put together from all of the test and in-progress renders that I made throughout this entire project:

Lupine Forest Cabin Progression Reel

Figure 36: Progression reel made from test and in-progress renders leading up to my final image.

References

Zap Andersson, Paul Edmondson, Julien Guertault, Adrien Herubel, Alan King, Peter Kutz, Andréa Machizaud, Jamie Portsmouth, Frédéric Servant, and Jonathan Stone. 2024. OpenPBR Surface Specification. Academy Software Foundation white paper.

Autodesk. 2020. XGen Geometry Instancer. Autodesk Maya 2020 User Documentation.

Autodesk. 2020. XGen Interactive Grooming. Autodesk Maya 2020 User Documentation.

Brent Burley and Dylan Lacewell. 2008. Ptex: Per-face Texture Mapping for Production Rendering. Computer Graphics Forum (Proc. of Eurographics Symposium on Rendering) 27, 4 (Jun. 2008), 1155-1164.

Brent Burley. 2012. Physically Based Shading at Disney. In ACM SIGGRAPH 2012 Course Notes: Practical Physically-Based Shading in Film and Game Production.

Brent Burley. 2015. Extending the Disney BRDF to a BSDF with Integrated Subsurface Scattering. In ACM SIGGRAPH 2015 Course Notes: Physically Based Shading in Theory and Practice.

Brent Burley. 2019. On Histogram-Preserving Blending for Randomized Texture Tiling. Journal of Computer Graphics Techniques 8, 4 (Nov. 2019), 31-53.

Iliyan Georgiev, Jamie Portsmouth, Zap Andersson, Adrien Herubel, Alan King, Shinji Ogaki, Frederic Servant. 2019. Autodesk Standard Surface. Autodesk white paper.

Tobias Häussler, Holger Dammertz, and Bastian Sdorra. 2020. Enterprise PBR Shading Model. Dassault Systèmes white paper.

Eric Heitz and Fabrice Neyret. 2018. High-Performance By-Example Noise using a Histogram-Preserving Blending Operator. Proceedings of the ACM on Computer Graphics and Interactive Techniques (Proc. of High Performance Graphics) 1, 2 (Aug. 2018), Article 31.

Christophe Hery, Ryusuke Villemin, and Junyi Ling. 2017. Pixar’s Foundation for Materials: PxrSurface and PxrMarschnerHair. In ACM SIGGRAPH 2017 Course Notes: Physically Based Shading in Theory and Practice.

Brian Karis. 2013. Real Shading in Unreal Engine 4. In ACM SIGGRAPH 2013 Course Notes: Physically-Based Shading in Theory and Practice.

Christopher Kulla and Alejandro Conty Estevez. 2017. Revisiting Physically Based Shading at Imageworks. In ACM SIGGRAPH 2017 Course Notes: Physically Based Shading in Theory and Practice.

Jeremie Noguer, Paul Edmondson, Michael Bond, and Peter Kutz. 2021. Adobe Standard Material Specification. Adobe white paper.

Lukas Stockner. 2022. The new Principled BSDF Model in Cycles. In BCON 2022: Blender Conference 2022.

Thomas V. Thompson, Ernest J. Petti, and Chuck Tappan. 2003. XGen: Arbitrary Primitive Generator. In ACM SIGGRAPH 2003 Sketches & Applications.

Michael Todd. 2013. Interactive Grooming System: User Friendly Hair & Fur. Michael Todd Portfolio.

Walt Disney Animation Studios. 2011. SeExpr.

Footnotes

¹ Additional post-2020 references added in update to post in May 2025. keyboard_return

² As of RenderMan 24, released in 2021, RenderMan does now have a PxrMetallicWorkflow node that converts metallic/basecolor inputs into the diffuse/specular face/specular edge color inputs that PxrSurface expects. RenderMan 24 also includes a new implementation of the full version of the Disney BSDF. keyboard_return

RenderMan Art Challenge: Shipshape

2020-07-31T00:00:00+00:00

1. Introduction
2. Initial Explorations
3. Layout and Framing
4. UV Unwrapping
5. Texturing the Ship

6. Shading the Ship
7. Shading and Texturing the Robots
8. The Wet Shader
9. Additional Props and Elements
10. Rain FX

11. Lighting and Compositing
12. Conclusion
13. References

Introduction

Last year, I participated in one of Pixar’s RenderMan Art Challenges as a way to learn more about modern RenderMan [Christensen et al. 2018] and as a way to get some exposure to tools outside of my normal day-to-day toolset (Disney’s Hyperion Renderer professionally, Takua Renderer as a hobby and learning exercise). I had a lot of fun, and wound up doing better in the “Woodville” art challenge contest than I expected to! Recently, I entered another one of Pixar’s RenderMan Art Challenges, “Shipshape”. This time around I entered just for fun; since I had so much fun last time, I figured why not give it another shot! That being said though, I want to repeat the main point I made in my post about the previous “Woodville” art challenge: I believe that for rendering engineers, there is enormous value in learning to use tools and renderers that aren’t the ones we work on ourselves. Our field is filled with brilliant people on every major rendering team, and I find both a lot of useful information/ideas and a lot of joy in seeing the work that friends and peers across the field have put into commercial renderers such as RenderMan, Arnold, Vray, Corona, and others.

As usual for the RenderMan Art Challenges, Pixar supplied some base models without any uvs, texturing, shading, lighting or anything else, and challenge participants had to start with the base models and come up with a single compelling image for a final entry. I had a lot of fun spending evenings and weekends throughout the duration of the contest to create my final image, which is below. I got to explore and learn a lot of new things that I haven’t tried before, which this post will go through. To my enormous surprise, this time around my entry won first place in the contest!

Initial Explorations

For this competition, Pixar provided five models: a futuristic scifi ship based on an Ian McQue concept, a robot based on a Ruslan Safarov concept, an old wooden boat, a butterfly, and a sextant. The fact that one of the models was based on an Ian McQue concept was enough to draw me in; I’ve been a big fan of Ian McQue’s work for many years now! I like to start these challenges by just rendering the provided assets as-is from a number of different angles, to try to get a sense of what I like about the assets and how I will want to showcase them in my final piece. I settled pretty quickly on wanting to focus on the scifi ship and the robot, and leave the other three models aside. I did find an opportunity to bring in the sextant in my final piece as well, but wound up dropping the old wooden boat and the butterfly altogether. Here are some simple renders showing what was provided out of the box for the scifi ship and the robot:

I initially had a lot of trouble settling on a concept and idea for this project; I actually started blocking out an entirely different idea before pivoting to the idea that eventually became my final image. My initial concept included the old wooden boat in addition the scifi ship and the robot; this initial concept was called “River Explorer”. My initial instinct was to try to show the scifi ship from a top-down view, in order to get a better view of the deck-boards and the big VG engine and the crane arm. I liked the idea of putting the camera at roughly forest canopy height, since forest canopy height is a bit of an unusual perspective for most photographs due to canopy height being this weird height that is too high off the ground for people to shoot from, but too low for helicopters or drones to be practical either. My initial idea was about a robot-piloted flying patrol boat exploring an old forgotten river in a forest; the ship would be approaching the old sunken boat in the river water. With this first concept, I got as far as initial compositional blocking and initial time-of-day lighting tests:

If you’ve followed my blog for a while now, those pine trees might look familiar. They’re actually the same trees from the forest scene I used a while back, ported from Takua’s shading system to RenderMan’s PxrSurface shader.

I wasn’t ever super happy with the “River Explorer” concept; I think the overall layout was okay, but it lacked a sense of dynamism and overall just felt very static to me, and the robot on the flying scifi ship felt kind of lost in the overall composition. Several other contestants wound up also going for similar top-down-ish views, which made me worry about getting lost in a crowd of similar-looking images. After a week of trying to get the “River Explorer” concept to work better, I started to play with some completely different ideas; I figured that this early in the process, a better idea was worth more than a week’s worth of sunk time.

Layout and Framing

I had started UV unwrapping the ship already, and whilst tumbling around the ship unwrapping all of the components one-by-one, I got to see a lot more of the ship and a lot more interesting angles, and I suddenly came up with a completely different idea for my entry. The idea that popped into my head was to have a bunch of the little robots waiting to board one of the flying ships at a quay or something of the sort. I wanted to convey a sense of scale between the robots and the flying scifi ship, so I tried putting the camera far away and zooming in using a really long lens. Since long lenses have the effect of flattening perspective a bit, using a long lens helped make the ships feel huge compared to the robots. At this point I was just doing very rough, quick, AO render “sketches”. This is the AO sketch where my eventual final idea started:

I’ve always loved the idea of the mundane fantastical; the flying scifi ship model is fairly fantastical, which led me to want to do something more everyday with them. I thought it would be fun to texture the scifi ship model as if it was just part of a regular metro system that the robots use to get around their world. My wife, Harmony, suggested a fun idea: set the entire scene in drizzly weather and give two of the robots umbrellas, but give the third robot a briefcase instead and have the robot use the briefcase as a makeshift umbrella, as if it had forgotten its umbrella at home. The umbrella-less robot’s reaction to seeing the ship arriving provided the title for my entry- “Oh Good, The Bus Is Here”. Harmony also pointed out that the back of the ship has a lot more interesting geometric detail compared to the front of the ship, and suggested placing the focus of the composition more on the robots than on the ships. To incorporate all of these ideas, I played more with the layout and framing until I arrived at the following image, which is broadly the final layout I used:

I chose to put an additional ship in the background flying away from the dock for two main reasons. First, I wanted to be able to showcase more of the ship, since the front ship is mostly obscured by the foreground dock. Second, the background ship helps fill out and balance the right side of the frame more, which would otherwise have been kind of empty.

In both this project and in the previous Art Challenge, my workflow for assembling the final scene relies heavily on Maya’s referencing capabilities. Each separate asset is kept in its own .ma file, and all of the .ma files are referenced into the main scene file. The only the things the main scene file contains are references to assets, along with scene-level lighting, overrides, and global-scale effects such as volumes and, in the case of this challenge, the rain streaks. So, even though the flying scifi ship appears in my scene twice, it is actually just the same .ma file referenced into the main scene twice instead of two separate ships.

The idea of a rainy scene largely drove the later lighting direction of my entry; from this point I basically knew that the final scene was going to have to be overcast and drizzly, with a heavy reliance on volumes to add depth separation into the scene and to bring out practical lights on the ships. I had a lot of fun modeling out the dock and gangway, and may have gotten slightly carried away. I modeled every single bolt and rivet that you would expect to be there in real life, and I also added lampposts to use later as practical light sources for illuminating the dock and the robots. Once I had finished modeling the dock and had made a few more layout tweaks, I arrived at a point where I was happy to start with shading and initial light blocking. Zoom in if you want to see all of the rivets and bolts and stuff on the dock:

UV Unwrapping

UV unwrapping the ship took a ton of time. For the last challenge, I relied on a combination of manual UV unwrapping by hand in Maya and using Houdini’s Auto UV SOP, but I found that the Auto UV SOP didn’t work as well on this challenge due to the ship and robot having a lot of strange geometry with really complex topology. On the treehouse in the last challenge, everything was more or less some version of a cylinder or a rectangular prism, with some morphs and warps and extra bits and bobs applied. Almost every piece of the ship aside from the floorboards are very complex shapes that aren’t easy to find good seams for, so the Auto UV SOP wound up making a lot of choices for UV cuts that I didn’t like. As a result, I basically manually UV unwrapped this entire challenge in Maya.

A lot of the complex undercarriage type stuff around the back thrusters on the ship was really insane to unwrap. The muffler manifold and mechanical parts of the crane arm were difficult too. Fortunately though, the models came with subdivision creases, and a lot of the subd crease tags wound up proving to be useful hints towards good places to place UV edge cuts. I also found that the new and improved UV tools in Maya 2020 performed way better than the UV tools in Maya 2019. For some meshes, I manually placed UV cuts and then used the unfold tool in Maya 2020, which I found generally worked a lot better than Maya 2019’s version of the same tool. For other meshes, Maya 2020’s auto unwrap actually often provided a useful starting place as long a I rotated the piece I was unwrapping into a more-or-less axis-aligned orientation and froze its transform. After using the auto-unwrap tool, I would then transfer the UVs back onto the piece in its original orientation using Maya’s Mesh Transfer Attributes tool. The auto unwrap tended to cut meshes into too many UV islands, so I would then re-stitch islands together and place new cuts where appropriate.

When UV unwrapping, a good test to see how good the resultant UVs are is to assign some sort of a checkerboard grid texture to the model and look for distortion in the checkerboard pattern. Overall I think I did an okay job here; not terrible, but could be better. I think I managed to hide the vast majority of seams pretty well, and the total distortion isn’t too bad (if you look closely, you’ll be able to pick out some less than perfect areas, but it was mostly okay). I wound up with a high degree of variability in the grid size between different areas, but I wasn’t too worried about that since my plan was to adjust texture resolutions to match.

After UV unwrapping the ship, UV unwrapping the robot proved to be a lot easier in comparison. Many parts of the robot turn out to be the same mesh just duplicated and squash/stretch/scaled/rotated, which means that they share the same underlying topology. For all parts that share the same topology, I was able to just UV unwrap one of them, and then copy the UVs to all of the others. One great example is the robot’s fingers; most components across all fingers shared the same topology. Here’s the checkerboard test applied to my final UVs for the robot:

Texturing the Ship

After trying out Substance Painter for the previous RenderMan Art Challenge and getting fairly good results, I went with Substance Painter again on this project. The overall texturing workflow I used on this project was actually a lot simpler compared with the workflow I used for the previous Art Challenge. Last time I tried to leave a lot of final decisions about saturation and hue and whatnot as late as possible, which meant moving those decisions into the shader so that they could be changed at render-time. This time around, I decided to make those decisions upfront in Substance Painter; doing so makes the Substance Painter workflow much simpler since it means I can just paint colors directly in Substance Painter like a normal person would, as opposed to painting greyscale or desaturated maps in Substance Painter that are expected to be modulated in the shader later. Also, because of the nature of the objects in this project, I actually used very little displacement mapping; most detail was brought in through normal mapping, which makes more sense for hard surface metallic objects. Not having to worry about any kind of displacement mapping simplified the Substance Painter workflow a bit more too, since that was one fewer texture map type I had to worry about managing.

One the last challenge I relied on a lot of Quixel Megascans surfaces as starting points for texturing, but this time around I (unintentionally) found myself relying on Substance smart materials more for starting points. One thing I like about Substance Painter is how it comes with a number of good premade smart materials, and there are even more good smart materials on Substance Source. Importantly though, I believe that smart materials should only serve as a starting point; smart materials can look decent out-of-the-box, but to really make texturing shine, a lot more work is required on top of the out-of-the-box result in order to really create story and character and a unique look in texturing. I don’t like when I see renders online where a smart material was applied and left in its out-of-the-box state; something gets lost when I can tell which default smart material was used at a glance! For every place that I used a smart material in this project, I used a smart material (or several smart materials layered and kitbashed together) as a starting point, but then heavily customized on top with custom paint layers, custom masking, decals, additional layers, and often even heavy custom modifications to the smart material itself.

I was originally planning on using a UDIM workflow for bringing the ship into Substance Painter, but I wound up with so many UDIM tiles that things quickly became unmanageable and Substance Painter ground to a halt with a gigantic file containing 80 (!!!) 4K UDIM tiles. To work around this, I broke up the ship into a number of smaller groups of meshes and brought each group into Substance Painter separately. Within each group I was able to use a UDIM workflow with usually between 5 to 10 tiles.

I had a lot of fun creating custom decals to apply to various parts of the ships and to some of the robots; even though a lot of the details and decals aren’t very visible in the final image, I still put a good amount of time into making them simply to keep things interesting for myself. All of the decals were made in Photoshop and Illustrator and then brought in to Substance Painter along with opacity masks and applied to surfaces using Substance Painter’s projection mode, either in world space or in UV space depending on situation. In Substance Painter, I created a new layer in with a custom paint material and painted the base color for the paint material by projecting the decal, and then masked the decal layer using the opacity mask I made using the same projection that I used for the base color. The “Seneca” logo seen throughout my scene has shown up on my blog before! A few years ago on a Minecraft server that I played a lot on, a bunch of other players and I had a city named Seneca; ever since then, I’ve tried to sneak in little references to Seneca in projects here and there as a small easter egg.

Many of the buses around where I live have an orange and silver color scheme, and while I was searching the internet for reference material, I also found pictures of the Glasgow Subway’s trains, which have an orange and black and white color scheme. Inspired by the above, I picked an orange and black color scheme for the ship’s Seneca Metro livery. I like orange as a color, and I figured that orange would bring a nice pop of color to what was going to be an overall relatively dark image, I made the upper part of the hull orange but kept the lower part of the hull black since the black section was going to be the backdrop that the robots would be in front of in the final image; the idea was that keeping that part of the hull darker would allow the robots to pop a bit more visually.

One really useful trick I used for masking different materials was to just follow edgeloops that were already part of the model. Since everything in this scene is very mechanical anyway, following straightedges in the UVs helps give everything a manufactured, mechanical look. For example, Figure 12 shows how I used Substance Painter’s Polygon Fill tool to mask out the black paint from the back metal section of the ship’s thrusters. In some other cases, I added new edgeloops to the existing models just so I could follow the edgeloops while masking different layers.

Shading the Ship

For the previous Art Challenge, I used a combination of PxrDisney and PxrSurface shaders; this time around, in order to get a better understanding of how PxrSurface works, I opted to go all-in on using PxrSurface for everything in the scene. Also, for the rain streaks effect (discussed later in this post), I needed some features that are available in the extended Disney Bsdf model [Burley 2015] and in PxrSurface [Hery et al. 2017], but RenderMan 23 only implements the base Disney Brdf [Burley 2012] without the extended Bsdf features; this basically meant I had to use PxrSuface.

One of the biggest differences I had to adjust to was how metallic color is controlled in PxrSurface. The Disney Bsdf drives the diffuse color and metallic color using the same base color parameter and shifts energy between the diffuse/spec and metallic lobes using a “metallic” parameter, but PxrSurface separates the diffuse and metallic colors entirely. PxrSurface uses a “Specular Face Color” parameter to directly drive the metallic lobe and has a separate “Specular Edge Color” control; this parameterization reminds me a lot of Framestore’s artist-friendly metallic fresnel parameterization [Gulbrandsen 2014], but I don’t know if this is actually what PxrSurface is doing under the hood. PxrSurface also has two different modes for its specular controls: an “artistic” mode and a “physical” mode; I only used the artistic mode. To be honest, while PxrSurface’s extensive controls are extremely powerful and offer an enormous degree of artistic control, I found trying to understand what every control did and how they interacted with each other to be kind of overwhelming. I wound up paring back the set of controls I used back to a small subset that I could mentally map back to what the Disney Bsdf or VRayMtl or Autodesk Standard Surface [Georgiev et al. 2019] models do.

Fortunately, converting from the Disney Bsdf’s baseColor/metallic parameterization to PxrSurface’s diffuse/specFaceColor is very easy:

\[ diffuse = baseColor * (1 - metallic) \\ specFaceColor = baseColor * metallic \]

The only gotcha to look out for is that everything needs to be in linear space first. Alternatively, Substance Painter already has a output template for PxrSurface as well. Once I had the maps in the right parameterization, for the most part all I had to do was plug the right maps into the right parameters in PxrSurface and then make minor manual adjustments to dial in the look. In addition to two different specular parameterization modes, PxrSurface also supports choosing from a few different microfacet models for the specular lobes; by default PxrSurface is set to use the Beckmann model [Beckmann and Spizzichino 1963], but I selected the GGX model [Walter et al. 2007] for everything in this scene since GGX is what I’m more used to.

For the actual look of the ship, I didn’t want to go with the dilapidated look that a lot of the other contestants went with. Instead, I wanted the ship to look like it was a well maintained working vehicle, but with all of the grime and scratches that build up over daily use. So, there are scratches and dust and dirt streaks on the boat, but nothing is actually rusting. I also did modeled some glass for the windows at the top of the tower superstructure, and added some additional lamps to the top of the ship’s masts and on the tower superstructure for use in lighting later. After getting everything dialed, here is the “dry” look of the ship:

Here’s a close-up render of the back engine section of the ship, which has all kinds of interesting bits and bobs on it. The engine exhaust kind of looks like it could be a volume, but it’s not. I made the engine exhaust by making a bunch of cards, arranging them into a truncated cone, and texturing them with a blue gradient in the diffuse slot and a greyscale gradient in PxrSurface’s “presence” slot. The glow effect is done using the glow parameter in PxrSurface. The nice thing about using this more cheat-y approach instead of a real volume is that it’s way faster to render!

Most of the ship’s metal components are covered over using a black, semi-matte paint material, but in areas that I thought would be subjected to high temperatures, such as exhaust vents or the inside of the thrusters or the many floodlights on the ship, I chose to use a beaten copper material instead. Basically wherever I wound up placing a practical light, the housing around the practical light is made of beaten copper. Well, I guess it’s actually some kind of high-temperature copper alloy or copper-colored composite material, since real copper’s melting point is lower than real steel’s melting point. The copper color had an added nice effect of making practical lights look more yellow-orange, which I think helps sell the look of engine thrusters and hot exhaust vents more.

Each exhaust vent and engine thruster actually contains two practical lights: one extremely bright light near the back of the vent or thruster pointing into the vent or thruster, and one dimmer but more saturated light pointing outwards. This setup produces a nice effect where areas deeper into the vent or thruster look brighter and yellower, while areas closer to the outer edge of the vent or thruster look a bit dimmer and more orange. The light point outwards also casts light outside of the vent or thruster, providing some neat illumination on nearby surfaces or volumes. Later in this post, I’ll write more about how I made use of this in the final image.

Here’s a turntable video of the ship, showcasing all of the texturing and shading that I did. I had a lot of fun taking care of all of the tiny details that are part of the ship, even though many of them aren’t actually visible in my final image. The dripping wet rain effect is discussed later in this post.

Shipshape Art Challenge Ship Turntable

Figure 16: Turntable of the ship showing both dry and wet variants.

Shading and Texturing the Robots

For the robots, I used the same Substance Painter based texturing workflow and the same PxrSurface based shading workflow that I used for the ship. However, since the robot has far fewer components than the ship, I was able to bring all of the robot’s UDIM tiles into Substance Painter at once. The main challenge with the robots wasn’t the sheer quantity of parts that had to be textured, but instead was in the variety of robot color schemes that had to be made. In order to populate the scene and give my final image a sense of life, I wanted to have a lot of robots on the ships, and I wanted all of the robots to have different paint and color schemes.

I knew from an early point that I wanted the robot carrying the suitcase to be yellow, and I knew I wanted a robot in some kind of conductor’s uniform, but aside from that, I didn’t much pre-planned for the robot paint schemes. As a result, coming up with different robot paint schemes was a lot of fun and involved a lot of just goofing around and improvisation in Substance Painted until I found ideas that I liked. To help unify how all of the robots looked and to help with speeding up the texturing process, I came up with a base metallic look for the robot’s legs and arms and various functional mechanical parts. I alternated between steel and copper parts to help bring some visual variety to all of the mechanical parts. The metallic parts are the same across all of the robots; the parts that vary between robots are the body shell and various outer casing parts on the arms:

I wanted very different looks for the other two robots that are on the dock with the yellow robot. I gave one of them a more futuristic looking white glossy shell with a subtle hexagon imprint pattern and red accents. The hexagon imprint pattern is created using a hexagon pattern in the normal map. The red stripes use the same edgeloop-following technique that I used for masking some layers on the ship. I made the other robot a matte green color, and I thought it would be fun make him into a sports fan. He’s wearing the logo and colors of the local in-world sports team, the Seneca Senators! Since the robots don’t wear clothes per se, I guess maybe the sports team logo and numbers are some kind of temporary sticker? Or maybe this robot is such a bit fan that he had the logo permanently painted on… I don’t know! Since I knew these two robots would be seen from the back in the final image, I made sure to put all of the interesting stuff on their sides and back.

For the conductor robot, I chose a blue and gold color scheme based on real world conductor uniforms I’ve seen before. I made the conductor robot overall a bit more cleaned up compared to the other robots, since I figured the conductor robot should look a bit more crisp and professional. I also gave the conductor robot a gold mustache, for a bit of fun! To complete the look, I modeled a simple conductor’s hat for the conductor robot to wear. I also made a captain robot, which has a white/black/gold color scheme derived from the conductor robot. The white/black/gold color scheme is based on old-school ship’s captain uniforms. The captain robot required a bit of a different hat from the conductor hat; I made the captain hat a little bigger and a little bit more elaborate, complete with gold stitching on the front around the Seneca Metro emblem. In the final scene you don’t really see the captain robots, since they wound up inside of the wheelhouse at the top of the ship’s tower superstructure, but hey, at least the captain robots were fun to make, and at least I know that they’re there!

As a bit of a joke, I tried making a poncho for one of the robots. I thought it would look very silly, which for me was all the more reason to try! To make the poncho, I made a big flat disc in Maya and turned it into nCloth, and just let it fall onto the robot with the robot’s geometry acting as a static collider. This approach basically worked out-of-the-box, although I made some manual edits to the geometry afterwards just to get the poncho to billow a bit more on the bottom. The poncho’s shader is a simple glass PxrSurface shader, with the bottom frosted section and smooth diamond-shaped window section both driven using just roughness. The crinkly plastic sheet appearance is achieved entirely through a wrinkle normal map. The poncho bot is also not really visible in the final image, but somewhere in the final image, this robot is in the background on the deck of the front ship behind some other robots!

Don’t worry, I didn’t forget about the fact that the robots have antennae! For the poncho robot, I modeled a hole into the poncho for the antenna to pass through, and I modeled similar holes into the captain robot and conductor robot’s hats as well. Again, this is a detail that isn’t visible in the final image at all, but is there mostly just so that I can know that it’s there:

In total I created 12 different unique robot variants, which some variants duplicated in the final image. All 12 variants are actually present in the scene! Most of them are in the background (and a few variants are only on the background ship), so most of them aren’t very visible in the final image. You, the reader, have probably noticed a theme in this post now where I put a lot of effort into things that aren’t actually visible in the final image… for me, a large part of this project wasn’t necessarily about the final image and was instead just about having fun and getting some practice with the tools and workflows.

Here is a turntable showcasing all 12 robot variants. In the turntable, only the yellow robot has both a wet and dry variant, since all of the other robots in the scene remembered their umbrellas and were therefore able to stay dry. The green sports fan robot does have a variant with a wet right arm though, since in the final image the green sports fan robot’s right arm is extended beyond the umbrella to wave at the incoming ship.

Shipshape Art Challenge Robots Turntable

Figure 24: Turntable of the robots, with all 12 robot variants.

The Wet Shader

Going into the shading process, the single problem that worried me the most was how I was going to make everything in the rain look wet. Having a good wet look is extremely important for selling the overall look of a rainy scene. I actually wasn’t too worried about the base dry shading, since hard metal/plastic surfaces are one of the things that CG is really good at by default. By contrast, getting a good wet rainy look took an enormous amount of experimentation and effort, and wound up even involving some custom tools.

From a cursory search online, I found some techniques for creating a wet rainy look that basically work by modulating the primary specular lobe and applying a normal map to the base normal of the surface. However, I didn’t really like how this looked; in some cases, this approach basically makes it look like the underlying surface itself has rivulets and dots in it, not like there’s water running on top of the surface. My hunch was to use PxrSurface’s clearcoat lobe instead, since from a physically motivated perspective, water streaks and droplets behave more like an additional transparent refractive coating layer on top of a base surface. A nice bonus from trying to use the clearcoat lobe is that PxrSurface supports using different normal maps for each specular lobe; this way, I could have a specific water droplets and streaks normal map plugged into the bump normal parameter for the clearcoat lobe without having to disturb whatever normal map I had plugged into the bump normal parameter to the base diffuse and primary specular lobes. My idea was to create a single shading graph for creating the wet rainy look, and then plug this graph into the clearcoat lobe parameters for any PxrSurface that I wanted a wet appearance for. Here’s what the final graph looked like:

In the graph above, note how the input textures are fed into PxrRemap nodes for ior, edge color, thickness, and roughness; this is so I can rescale the 0-1 range inputs from the textures to whatever they need to be for each parameter. The node labeled “mastercontrol” allows for disabling the entire wet effect by feeding 0.0 into the clearcoat edge color parameter, which effectively disables the clearcoat lobe.

Having to manually connect this graph into all of the clearcoat parameters in each PxrSurface shader I used was a bit of a pain. Ideally I would have preferred if I could have just plugged all of the clearcoat parameters into a PxrLayer, disabled all non-clearcoat lobes in the PxrLayer, and then plugged the PxrLayer into a PxrLayerSurface on top of underlying base layers. Basically, I wish PxrLayerSurface supported enabling/disabling layers on a per-lobe basis, but this ability currently doesn’t exist in RenderMan 23. In Disney’s Hyperion Renderer, we support this functionality for sparsely layering Disney Bsdf parameters [Burley 2015], and it’s really really useful.

There are only four input maps required for the entire wet effect: a greyscale rain rivulets map, a corresponding rain rivulets normal map, a greyscale droplets map, and a corresponding droplets normal map. The rivulets maps are used for the sides of a PxrRoundCube projection node, while the droplets maps are used for the top of the PxrRoundCube projection node; this makes the wet effect look more like rain drop streaks the more vertical a surface is, and more like droplets splashing on a surface the more horizontal a surface is. Even though everything in my scene is UV mapped, I chose to use PxrRoundCube to project the wet effect on everything in order to make the wet effect as automatic as possible; to make sure that repetitions in the wet effect textures weren’t very visible, I used a wide transition width for the PxrRoundCube node and made sure that the PxrRoundCube’s projection was rotated around the Y-axis to not be aligned with any model in the scene.

To actually create the maps, I used a combination of Photoshop and a custom tool that I originally wrote for Takua Renderer. I started in Photoshop by kit-bashing together stuff I found online and hand-painting on top to produce a 1024 by 1024 pixel square example map with all of the characteristics I wanted. While in Photoshop, I didn’t worry about making sure that the example map could tile; tiling comes in the next step. After initial work in Photoshop, this is what I came up with:

Next, to make the maps repeatable and much larger, I used a custom tool I previously wrote that implements a practical form of histogram-blending hex tiling [Burley 2019]. Hex tiling with histogram preserving blending, originally introduced by Heitz and Neyret [2018], is one of the closest things to actual magic in recent computer graphics research; using hex tiling instead of normal rectilinear tiling basically completely hides obvious repetitions in the tiling from the human eye, and the histogram preserving blending makes sure that hex tile boundaries blend in a way that makes them completely invisible as well. I’ll write more about hex tiling and make my implementation publicly available in a future post. What matters for this project is hex tiling allowed me to convert my exemplar map from Photoshop into a much larger 8K seamlessly repeatable texture map with no visible repetition patterns. Below is a cropped section from each 8K map:

For the previous Art Challenge, I also made some custom textures that had to be tileable. Last time though, I used Substance Designer to make the textures tileable, which required setting up a big complicated node graph and produced results where obvious repetition was still visible. Conversely, hex tiling basically works automatically and doesn’t require any kind of manual setup or complex graphs or anything.

To generate the normal maps, I used Photoshop’s “Generate Normal Map” filter, which is found under “Filter > 3D”. For generating normal maps from simple greyscale heightmaps, this Photoshop feature works reasonably well. Because of the deterministic nature of the hex tiling implementation though, I could have also generated normal maps from the grey scale exemplars and then fed the normal map exemplars through the hex tiling tool with the same parameters as how I fed in the greyscale maps, and I would have gotten the same result as below.

For the wet effect’s clearcoat lobe, I chose to use the physical mode instead of the artistic mode (unlike for the base dry shaders, where I only used the artistic mode). The reason I used the physical mode for the wet effect is because of the layer thickness control, which darkens the underlying base shader according to how thick the clearcoat layer is supposed to be. I wanted this effect, since wet surfaces appear darker than their dry counterparts in real life. Using the greyscale wet map, I modulated the layer thickness control according to how much water there was supposed to be at each part of the surface.

Finally, after wiring everything together in Maya’s HyperShade editor, everything just worked! I think the wet look my approach produces looks reasonable convincing, especially from the distances that everything is from the camera in my final piece. Up close the effect still holds up okay, but isn’t as convincing as using real geometry for the water droplets with real refraction and caustics drive by manifold next event estimation [Hanika et al. 2015]. In the future, if I need to do close up water droplets, I’ll likely try an MNEE based approach instead; fortunately, RenderMan 23’s PxrUnified integrator already comes with an MNEE implementation as an option, along with various other strategies for handling caustic cases [Hery et al. 2016]. However, the approach I used for this project is far cheaper from a render time perspective compare to using geometry and MNEE, and from a mid to far distance, I’m pretty happy with how it turned out!

Below are some comparisons of the ship and robot with and without the wet effect applied. The ship renders are from the same camera angles as in Figures 13, 14, and 15. drag the slider left and right to compare:

Figure 29: Wide view of the ship with (left) and without (right) the wet shader applied. For a full screen comparison, click here.

Figure 30: Back view of the ship with (left) and without (right) the wet shader applied. For a full screen comparison, click here.

Figure 31: Side view of the ship with (left) and without (right) the wet shader applied. For a full screen comparison, click here.

Figure 32: Main yellow robot with (left) and without (right) the wet shader applied. For a full screen comparison, click here.

Additional Props and Set Elements

In addition to texturing and shading the flying scifi ship and robot models, I had to create from scratch several other elements to help support the story in the scene. By far the single largest new element that had to be created was the entire dock structure that the robots stand on top of. As mentioned earlier, I wound up modeling the dock to a fairly high level of detail; the dock model contains every single bolt and rivet and plate that would be necessary for holding together a similar real steel frame structure. Part of this level of detail is justifiable by the fact that the dock structure is in the foreground and therefore relatively close to camera, but part of having this level of detail is just because I could and I was having fun while modeling. To model the dock relatively quickly, I used a modular approach where I first modeled a toolkit of basic reusable elements like girders, connection points, bolts, and deckboards. Then, from these basic elements, I assembled larger pieces such as individual support legs and crossbeams and such, and then I assembled these larger pieces into the dock itself.

Shading the dock was relatively fast and straightforward; I created a basic galvanized metal material and applied it using a PxrRoundCube projection. To get a bit more detail and break up the base material a bit, I added a dirt layer on top that is basically just low-frequency noise multiplied by ambient occlusion. I did have to UV map the gangway section of the dock in order to add the yellow and black warning stripe at the end of the gangway; however, since the dock is made up almost entirely of essentially rectangular prisms oriented at 90 degree angles to each other, just using Maya’s automatic UV unwrapping provided something good enough to just use as-is. The yellow and black warning stripe uses the same thick worn paint material that the warning stripes on the ship uses. On top of all of this, I then applied my wet shader clearcoat lobe.

The metro sign on the dock is just a single rectangular prism with a dark glass material applied. The glowing text is a color texture map plugged into PxrSurface’s glow parameter; whereever there is glowing text, I also made the material diffuse instead of glass, with the diffuse color matching the glow color. To balance the intensity of the glow, I had to cheat a bit; turning the intensity of the glow down enough so that the text and colors read well means that the glow is no longer bright enough to show up in reflections or cast enough light to show up in a volume. My solution was to turn down the glow in the PxrSurface shader, and then add a PxrRectLight immediately in front of the metro sign driven by the same texture map. The PxrRectLight is set to be invisible to the camera. I suppose I could have done this in post using light path expressions, but cheating it this way was simpler and allowed for everything to just look right straight out of the render.

The suitcase was a really simple prop to make. Basically it’s just a rounded cube with some extra bits stuck on to it for the handles and latch; the little rivets are actually entirely in shading and aren’t part of the geometry at all. I threw on a basic burlap material for the main suitcase, multiplied on some noise to make it look a bit dirtier and worn, and applied basic brass and leather materials to the latch and handle, and that was pretty much it. Since the suitcase was going to serve as the yellow robot’s makeshift umbrella, making sure that the suitcase looked good with the wet effect applied turned out to be really important. Here’s a lookdev test render of the suitcase, with and without the wet effect applied (slide left and right to compare):

Figure 35: Suitcase with (left) and without (right) the wet shader applied. For a full screen comparison, click here.

From early on, I was fairly worried about making the umbrellas look good; I knew that making sure the the umbrellas looked convincingly wet was going to be really important for selling the overall rainy day setting. I originally was going to make the umbrellas opaque, but realized that opaque umbrellas were going to cast a lot of shadows and block out a lot of parts of the frame. Switching to transparent umbrellas made out of clear plastic helped a lot with brightening up parts of the frame and making sure that large parts of the ship weren’t completely blocked out in the final image. As a bonus, I think the clear umbrellas also help the overall setting feel slightly more futuristic. I modeled the umbrella canopy as a single-sided mesh, so the “thin” setting in PxrSurface’s glass parameters was really useful here. Since the umbrella canopy is transparent with refraction roughness, having the wet effect work through the clearcoat lobe proved really important here since doing so allowed for the rain droplets and rivulets to have sharp specular highlights while simultaneously preserving the more blurred refraction in the underlying umbrella canopy material. In the end, lighting turned out to be really important for selling the look of the wet umbrella as well; I found that having tons of little specular highlights coming from all of the rain drops helped a lot.

As a bit of an aside, settling on a final umbrella canopy shape took a surprising amount of time! I started with a much flatter umbrella canopy, but eventually made it more bowed after looking at various umbrellas I have sitting around at home. Most clear umbrella references I found online are of these Japanese bubble umbrellas which are actually far more bowed than a standard umbrella, but I wanted a shape that more closely matched a standard opaque umbrella.

One late addition I made to the umbrella was the small lip at the bottom edge of the umbrella canopies; for much of the development process, I didn’t have this small lip and kept feeling like something was off about the umbrellas. I eventually realized that some real umbrellas have a bit of a lip to help catch and guide water runoff; adding this feature to the umbrellas helped them feel a bit more correct.

Shortly before the due date for the final image, I made a last-minute addition to my scene: I took the sextant that came with Pixar’s base models and made the white/red robot on the dock hold it. Since the green and yellow robots were both doing something a bit more dynamic than just standing around, I wanted the middle white/red robot to be doing something as well. Maybe the white/red robot is going to navigation school! I did a very quick-and-dirty shading job on the sextant using Maya’s automatic UVs; overall the sextant prop is not shaded to the same level of detail as most of the other elements in my scene, but considering how small the sextant is in the final image, I think it holds up okay. I still tried to add a plausible amount of wear and age to the metal materials on the sextant, but I didn’t have time to put in carved numbers and decals and grippy textures and stuff. There are also a few small areas where you can see visible texture stretching at UV seams, but again, in the final image, it didn’t matter too much.

Rain FX

Having a good wet surface look was one half of getting my scene to look convincingly rainy; the other major problem to solve was making the rain itself! My initial, extremely naive plan was to simulate all of the rainfall as one enormous FLIP sim in Houdini. However, I almost immediately realized what a bad idea that was, due to the scale of the scene. Instead, I opted to simulate the rain as nParticles in Maya.

To start, I first duplicated all of the geometry that I wanted the rain to interact with, combined it all into one single huge mesh, and then decimated the mesh heavily and simplified as much as I could. This single mesh acted as a proxy for the full scene for use as a passive collider in the nParticles network. Using a decimated proxy for the collider instead of the full scene geometry was very important for making sure that the sim ran fast enough for me to be able to get in a good number of different iterations and attempts to find the look that I wanted. I mostly picked geometry that was upward facing for use in the proxy collider:

Next, I set up a huge volume nParticle emitter node above the scene, covering the region visible in the camera frustum. The only forces I set up were gravity and a small amount of wind, and then I ran the nParticles system and let it run until rain had filled all parts of the scene visible to the camera. To give the impression of fast moving motion-blurred rain droplets, I set the rendering mode of the nParticles to ‘multistreak’, which makes each particle look like a set of lines with lengths varying according to velocity. I had to play with the collider proxy mesh’s properties a bit to get the right amount of raindrops bouncing off of surfaces and to dial in how high raindrops bounced. I initially tried allowing particles to collide with each other as well, but this slowed the entire sim down to basically a halt, so for the final scene I have particle-to-particle collision disabled.

After a couple of rounds of iteration, I started getting something that looked reasonably like rain! Using the proxy collision geometry wa really useful for creating “rain shadows”, which are areas that rain isn’t present due to being stopped by something else. I also tuned the wind speed a lot in order to get rain particles bouncing off of the umbrellas to look like they were being blown aside in the wind. After getting a sim that I liked, I baked out the frame of the sim that I wanted for my final render using Maya’s nCache system, which caches the nParticle simulation to disk so that it can be rapidly loaded up later without having to re-run the entire simulation.

To add just an extra bit of detail and storytelling, near the end of the competition period I revisited my original idea for making the rain in Houdini using a FLIP solver. I wanted to add in some “hero” rain drops around the foreground robots, running off of their umbrellas and suitcases and stuff. To create these “hero” droplets, I brought the umbrella canopies and suitcase into Houdini and built a basic FLIP simulation, meshed the result, and brought it back into Maya to integrate back into the scene.

Dialing in the look of the rain required a lot of playing with both the width of the rain drop streaks and with the rain streak material. I was initially very wary of making the rain in my scene heavy, since I was concerned about how much a heavy rain look would prevent me from being able to pull good detail and contrast from the ships. However, after some successful initial tests, I felt a bit more confident about a heavier rain look. I took the test from yesterday with more rain, and tried increasing the amount of rain by around 10x. I originally started working on the sim with only around a million particles, but by the end I had bumped up the particle count to around 10 million. In order to prevent the increased amount of rain from completely washing out the scene, I made each rain drop streak on the thinner and shorter side, and also tweaked the material to be slightly more forward scattering. My rain material is basically a mix of a rough glass and grey diffuse, with the reasoning being rain needs to be a glass material since rain is water, but since the rain droplet streaks are meant to look motion blurred, throwing in some diffuse just helps them show up better in camera; making the rain material more forwards scattering in this case just means changing the ratio of glass/diffuse to be more glass. I eventually arrived at a ratio of 60% diffuse light grey to 40% glass, which I found helped the rain show up in the camera and catch light a bit better. I also used the “presence” parameter (which is really just opacity) in PxrShader to make final adjustments to balance how visible the rain was with how much it was washing out other details. For the “hero” droplets, I used a completely bog-standard glass material.

Figuring out how to simulate the rain and make it look good was by far the single largest source of worries for me in this whole project, so I was incredibly relieved at the end when it all came together and started looking good. Here’s a 2K crop from my final image showing the “hero” droplets and all of the surrounding rain streaks around the foreground robots.

Lighting and Compositing

Lighting this scene proved to be very interesting and very different from what I did for the previous challenge! Looking back, I think I actually may have “overlit” the scene in the previous challenge; I tend to prefer a slightly more naturalistic look, but while in the thick of lighting, it’s easy to get carried away and push things far beyond the point of looking naturalistic. Another aspect of this scene that it made it very different from anything I’ve tried before is both the sheer number of practical lights in the scene and the fact that practical lights are the primary source of all lighting in this scene!

The key lighting in this scene is provided by the overhead lampposts on the dock, which illuminate the foreground robots. I initially had a bunch of additional invisible PxrRectLights providing additional illumination and shaping on the robots, but I got rid of all of them and in the final image I relied only on the actual lights on the lampposts. To prevent the visible light surfaces themselves from blowing out an aliasing, I used two lights for every lamppost: one visible-to-camera PxrRectLight set to a low intensity that wouldn’t alias in the render, and one invisible-to-camera PxrRectLight set to a relatively higher intensity for providing the actual lighting. The visible-to-camera PxrRectLight is rendered out as the only element on a separate render layer, which can then be added back in to the main key lighting render layer.

To better light the ships, I added a number of additional floodlights to the ship that weren’t part of the original model; you can see these additional floodlights mounted on top of the various masts of the ships and also on the sides of the tower superstructure. These additional floodlights illuminate the decks of the ships and help provide specular highlights to all of the umbrellas on the deck of the foreground ship, which enhances the rainy water droplet covered look. For the foreground robots on the dock, the ship floodlights also act as something of a rim light. Each of the ship floodlights is modeled as a visible-to-camera PxrDiscLight behind a glass lens with a second invisible-to-camera PxrDiscLight in front of the glass lens. The light behind the glass lens is usually lower in intensity and is there to provide the in-camera look of the physical light, while the invisible light in front of the lens is usually higher in intensity and provides the actual illumination in the scene.

In general, one of the major lessons I learned on this project was that when lighting using practical lights that have to be be visible in camera, a good approach is to use two different lights: one visible-to-camera and one invisible-to-camera. This approach allows for separating how the light itself looks versus what kind of lighting it provides.

The overall fill lighting and time of day is provided by the skydome, which is of an overcast sky at dusk. I waffled back and forth for a while between a more mid-day setting versus a dusk setting, but eventually settled on the dusk skydome since the overall darker time of day allows the practical lights to stand out more. I think allowing the background trees to fade almost completely to black actually helps a lot in keeping the focus of the image on the main story elements in the foreground. One feature of RenderMan 23 that really helped in quickly testing different lighting setups and iterating on ideas was RenderMan’s IPR mode, which has come a long way since RendermMan first moved to path tracing. In fact, throughout this whole project, I used the IPR mode extensively for both shading tests and for the lighting process. I have a lot of thoughts about the huge, compelling improvements to artist workflows that will be brought by even better interactivity (RenderMan XPU is very exciting!), but writing all of those thoughts down is probably better material for a different blog post in the future.

In total I had five lighting render layers: the key from the lampposts, the foreground rim and background fill from the floodlights, overall fill from the skydome, and two practicals layers for the visible-to-camera parts of all of the practical lights. Below are the my lighting render layers, although with the two practicals layers merged:

I used a number of PxrRodLightFilters to knock down some distractingly bright highlights in the scene (especially on the foreground robots’ umbrellas in the center of the frame). As a rendering engineer, rod light filters are a constant source of annoyance due to the sampling problems they introduce; rods allow for arbitrarily increasing or decreasing the amount of light going through an area, which throws off energy conservation, which can mess up importance sampling strategies that depend on a degree of energy conservation. However, as a user, rod light filters have become one of my favorite go-to tools for shaping and adjusting lighting on a local basis, since they offer an enormous amount of localized artistic control.

To convey the humidity of a rainstorm and to provide volumetric glow around all of the practical lights in the scene, I made extensive use of volume rendering on this project as well. Every part of the scene visible in-camera has some sort of volume in it! There are generally two types of volumes in this scene: a group of thinner, less dense volumes to provide atmospherics, and then a group of thicker, denser “hero” volumes that provide some of the more visible mist below the foreground ship and swirling around the background ship. All of these volumes are heterogeneous volumes brought in as VDB files.

One odd thing I found with volumes was some major differences in sampling behavior between RenderMan 23’s PxrPathtracer and PxrUnified integrators. I found that by default, whenever I had a light that was embedded in a volume, areas in the volume near the light were extremely noisy when rendered using PxrUnified but rendered normally when using PxrPathtracer. I don’t know enough about the details of how PxrUnified and PxrPathtracer’s volume integration [Fong et al. 2017] approaches differ, but it almost looks to me like PxrPathtracer is correctly using RenderMan’s equiangular sampling implementation [Kulla and Fajardo 2012] in these areas and PxrUnified for some reason is not. As a result, for rendering all volume passes I relied on PxrPathtracer, which did a great job with quickly converging on all passes.

An interesting unintended side effect of filling the scene with volumes was in how the volumes interacted with the orange thruster and exhaust vent lights. I had originally calibrated the lights in the thrusters and exhaust vents to provide an indication of heat coming from those areas of the ship without being so bright as to distract from the rest of the image, but the orange glows these lights produced in the volumes made the entire bottom of the image orange, which was distracting anyway. As a result, I had to re-adjust the orange thruster and exhaust vent lights to be considerably dimmer than I had originally had them, so that when interacting with the volumes, everything would be brought up to the apparent image-wide intensity that I had originally wanted.

In total I had eight separate render passes for volumes; each of the consolidated lighting passes from above had two corresponding volume passes. Within the two volume passes for each consolidated lighting pass, one volume pass was for the atmospherics and one was for the heavier mist and fog. Below are the volume passes consolidated into four images, with each image showing both the atmospherics and mist/fog in one image:

One final detail I added in before final rendering was to adjust the bokeh shape to something more interesting than a uniform circle. RenderMan 23 offers a variety of controls for customizing the camera’s aperture shape, which in turn controls the bokeh shape when using depth of field. All of the depth of field in my final image is in-render, and because of all of the tiny specular hits from all of the raindrops and from the wet shader, there is a lot of visible bokeh going on. I wanted to make sure that all of this bokeh was interesting to look at! I picked a rounded 5-bladed aperture with a significant amount of non-uniform density (that is, the outer edges of the bokeh are much brighter than the center core).

For final compositing, I used a basic Photoshop and Lightroom workflow like I did in the previous challenge, mostly because Photoshop is a tool I already know extremely well and I don’t have Nuke at home. I took a relatively light-handed approach to compositing this time around; adjustments to layers were limited to just exposure adjustments. All of the layers shown above already have the exposure adjustments I made baked in. After making adjustments in Photoshop and flattening out to a single layer, I then brought the image into Lightroom for final color grading. For the final color grade, I tried push the overall look to be a bit moodier and a bit more contrast-y, with the goal of having the contrast further draw the viewer’s eye to the foreground robots where the main story is. Figure 50 is a gif that visualizes the compositing process for my final image by showing how all of the successive layers are added on top of each other. Figure 51 shows what all of the lighting, comp, and color grading looks like applied to a 50% grey clay shaded version of the scene, and if you don’t want to scroll all the way back to the top of this post to see the final image, I’ve included it again as Figure 52.

Conclusion

On a whole, I’m happy with how this project turned out! I think a lot of what I did on this project represents a decent evolution over and applies a lot of lessons learned on the previous RenderMan Art Challenge. I started this project mostly as an excuse to just have fun, but along the way I still learned a lot more, and going forward I’m definitely hoping to be able to do more pure art projects alongside my main programming and technical projects.

Here is a progression video I put together from all of the test and in-progress renders that I made throughout this entire project:

Shipshape Art Challenge Progression Reel

Figure 53: Progression reel made from test and in-progress renders leading up to my final image.

My wife, Harmony Li, deserves an enormous amount of thanks on this project. First off, the final concept I went with is just as much her idea as it is mine, and throughout the entire project she provided valuable critiques and suggestions and direction. As usual with the RenderMan Art Challenges, Leif Pederson from Pixar’s RenderMan group provided a lot of useful tips, advice, feedback, and encouragement as well. Many other entrants in the Art Challenge also provided a ton of support and encouragement; the community that has built up around the Art Challenges is really great and a fantastic place to be inspired and encouraged. Finally, I owe an enormous thanks to all of the judges for this RenderMan Art Challenge, because they picked my image for first place! Winning first place in a contest like this is incredibly humbling, especially since I’ve never really considered myself as much of an artist. Various friends have since pointed out that with this project, I no longer have the right to deny being an artist! If you would like to see more about my contest entry, check out the work-in-progress thread I kept on Pixar’s Art Challenge forum, and I also made an Artstation post for this project.

As a final bonus image, here’s a daylight version of the scene. My backup plan in case I wasn’t able to pull off the rainy look was to just go for a boring daylight setup; I figured that the lighting would be a lot more boring, but the additional visible detail would be an okay consolation prize for myself. Thankfully, the rainy look worked out and I didn’t have to go to my backup plan! After the contest wrapped up, I went back and made a daylight version out of curiosity:

References

Petr Beckmann and André Spizzichino. 1963. The Scattering of Electromagnetic Waves from Rough Surfaces.

Brent Burley. 2012. Physically Based Shading at Disney. In ACM SIGGRAPH 2012 Course Notes: Practical Physically-Based Shading in Film and Game Production.

Brent Burley. 2015. Extending the Disney BRDF to a BSDF with Integrated Subsurface Scattering. In ACM SIGGRAPH 2015 Course Notes: Physically Based Shading in Theory and Practice.

Brent Burley. 2019. On Histogram-Preserving Blending for Randomized Texture Tiling. Journal of Computer Graphics Techniques 8, 4 (Nov. 2019), 31-53.

Johannes Hanika, Marc Droske, and Luca Fascione. 2015. Manifold Next Event Estimation. Computer Graphics Forum (Proc. of Eurographics Symposium on Rendering) 34, 4 (Jun. 2015), 87-97.

Christophe Hery, Ryusuke Villemin, and Florian Hecht. 2016. Towards Bidirectional Path Tracing at Pixar. In ACM SIGGRAPH 2016 Course Notes: Physically Based Shading in Theory and Practice.

Julian Fong, Magnus Wrenninge, Christopher Kulla, and Ralf Habel. 2017. Production Volume Rendering. In ACM SIGGRAPH 2017 Courses. Article 2.

Iliyan Georgiev, Jamie Portsmouth, Zap Andersson, Adrien Herubel, Alan King, Shinji Ogaki, Frederic Servant. 2019. Autodesk Standard Surface. Autodesk white paper.

Ole Gulbrandsen. 2014. Artistic Friendly Metallic Fresnel. Journal of Computer Graphics Techniques 3, 4 (Dec. 2014), 64-72.

Christopher Kulla and Marcos Fajardo. 2012. Important Sampling Techniques for Path Tracing in Participating Media. Computer Graphics Forum (Proc. of Eurographics Symposium on Rendering) 31, 4 (Jun. 2012), 1519-1528.

Bruce Walter, Steve Marschner, Hongsong Li, and Kenneth E. Torrance. 2007. Microfacet Models for Refraction through Rough Surfaces. In Proc. of Eurographics Symposium on Rendering (Rendering Techniques 2007), 195-206.

Shadow Terminator in Takua

2020-02-09T00:00:00+00:00

I recently implemented two techniques in Takua for solving the harsh shadow terminator problem; I implemented both the Disney Animation solution [Chiang et al. 2019] that we published at SIGGRAPH 2019, and the Sony Imageworks technique [Estevez et al. 2019] published in Ray Tracing Gems. We didn’t show too many comparisons between the two techniques (which I’ll refer to as the Chiang and Estevez approaches, respectively) in our SIGGRAPH 2019 presentation, and we didn’t show comparisons on any actual “real-world” scenes, so I thought I’d do a couple of my own renders using Takua as a bit of a mini-followup and share a handful of practical implementation tips. For a recap of the harsh shadow terminator problem, please see either the Estevez paper or the slides from the Chiang talk, which both do excellent jobs of describing the problem and why it happens in detail. Here’s a small scene that I made for this post, thrown together using some Evermotion assets that I had sitting around:

In this scene, all of the blankets and sheets and pillows on the bed use a fabric material that uses extremely high-frequency, high-resolution normal maps to achieve the fabric-y fiber-y look. Because of these high-frequency normal maps, the bedding is susceptible to the harsh shadow terminator problem. All of the bedding also has diffuse transmission and a very slight amount of high roughness specularity to emulate the look of a sheen lobe, making the material (and therefore this comparison) overall more interesting than just a single diffuse lobe.

Since the overall scene is pretty brightly lit and the bed is lit from all directions either by direct illumination from the window or bounce lighting from inside of the room, the shadow terminator problem is not as apparent in this scene; it’s still there, but it’s much more subtle than in the examples we showed in our talk. Below are some interactive comparisons between renders using Chiang 2019, Estevez 2019, and no shadow terminator fix; drag the slider left and right to compare:

Figure 2: The bedroom scene rendered in Takua Renderer using Chiang 2019 (left) and no harsh shadow terminator fix (right). For a full screen comparison, click here.

Figure 3: The bedroom scene rendered in Takua Renderer using Chiang 2019 (left) and Estevez 2019 (right). For a full screen comparison, click here.

Figure 4: The bedroom scene rendered in Takua Renderer using no normal mapping (left) and normal mapping with no harsh shadow terminator fix (right). For a full screen comparison, click here.

If you would like to compare the 4K renders directly, they are located here: Chiang 2019, Estevez 2019, No Fix, No Normal Mapping. As mentioned above, due to this scene being brightly lit, differences between the two techniques and not having any harsh shadow terminator fix at all will be a bit more subtle. However, differences are still visible, especially in brighter areas of the blanket and white pillows. Note that in this scenario, the difference between Chiang 2019 and Estevez 2019 is fairly small, while the difference between using either shadow terminator fix and not having a fix is more apparent. Also note how both Chiang 2019 and Estevez 2019 produce results that come pretty close to matching the reference image with no normal mapping; this is good, since we would expect fix techniques to match the reference image more closely than not having a fix!

If we remove the bedroom set and put the bed onto more of a studio lighting setup with two area lights and a seamless grey backdrop, we can start seeing more prominent differences between the two techniques and between either technique and no fix. Seeing how everything plays out in this type of a lighting setup is useful, since this is the type of render that one often sees as part of a standard lookdev department’s workflow:

Figure 5: The bed in a studio lighting setup, rendered in Takua Renderer using Chiang 2019 (left) and no harsh shadow terminator fix (right). For a full screen comparison, click here.

Figure 6: The bed in a studio lighting setup, rendered in Takua Renderer using Chiang 2019 (left) and Estevez 2019 (right). For a full screen comparison, click here.

Figure 7: The bed in a studio lighting setup, rendered in Takua Renderer using no normal mapping (left) and normal mapping with no harsh shadow terminator fix (right). For a full screen comparison, click here.

If you would like to compare the 4K renders directly for the studio lighting setup, they are located here: Chiang 2019, Estevez 2019, No Fix, No Normal Mapping. In this setup, we can now see differences between the four images much more clearly. Compared to the no normal mapping reference, the render with no fix produces considerably more darkening on silhouettes, and the harsh sudden transition from bright to shadowed areas is much more apparent. In the render with no fix, the bedding suddenly looks a lot less soft and starts to look a little more like a hard solid surface instead of like fabric.

Chiang 2019 and Estevez 2019 both restore more of the soft fabric look by softening out the harsh shadow terminator areas, but the differences between Chiang 2019 and Estevez 2019 become more apparent and interesting in this setting. Chiang 2019 produces an overall softer look that has shadow terminators that more closely match the reference with no normal mapping, but Chiang 2019 produces a slightly darker look overall compared to Estevez 2019. Estevez 2019 doesn’t match the reference’s shadow terminators quite as closely as Chiang 2019, but manages to preserve more of the overall energy. In Figure 5 in the Chiang 2019 paper, we explain where this difference comes from: for small shading normal deviations, Estevez 2019 produces less shadowing than our method, whereas for larger shading normal deviations, Estevez 2019 produces more shadowing than our method. As a result, Estevez 2019 generally produces a higher contrast look compared to Chiang 2019.

All of these differences are more apparent in a close-up crop of the full 4K render. Here are comparisons of the same studio lighting setup from above, but cropped in; pay close attention to slightly right of center of the image, where the white blanket overhangs the edge of the bed:

Figure 8: Crop of the studio lighting setup render from earlier, using Chiang 2019 (left) and no harsh shadow terminator fix (right). For a larger comparison, click here.

Figure 9: Crop of the studio lighting setup render from earlier, using Chiang 2019 (left) and Estevez 2019 (right). For a larger comparison, click here.

Figure 10: Crop of the studio lighting setup render from earlier, using no normal mapping (left) and normal mapping with no harsh shadow terminator fix (right). For a larger comparison, click here.

Of course, the scenario that makes the harsh shadow terminator problem the most apparent is when there is a single strong light source and we are viewing the scene from an angle from which we can see areas where the light hits surfaces at a glancing angle. These types of lighting setups are often used for checking silhouettes and backlighting and whatnot in modeling and lookdev turntable renders. In the comparisons below, the differences are most noticeable in the folds and on the shadowed sides of all of the bedding:

Figure 11: The bed lit with a single very bright light, rendered in Takua Renderer using Chiang 2019 (left) and no harsh shadow terminator fix (right). For a full screen comparison, click here.

Figure 12: The bed lit with a single very bright light, rendered in Takua Renderer using Chiang 2019 (left) and Estevez 2019 (right). For a full screen comparison, click here.

Figure 13: The bed lit with a single very bright light, rendered in Takua Renderer using no normal mapping (left) and normal mapping with no harsh shadow terminator fix (right). For a full screen comparison, click here.

If you would like to compare the 4K renders directly for the single light source renders, they are located here: Chiang 2019, Estevez 2019, No Fix, No Normal Mapping. With a single light source, the differences between the four images are now very clear, since a single light setup produces strong contrast between the lit and shadowed parts of the image. The harsh shadow terminator problem is especially visible in the folds of the blanket, where we can see one side of the fold fully lit and one side of the fold in shadow (although because the bedding all has diffuse transmission, the harsh shadow terminator is still not as prevalent as it would be for a purely diffuse reflecting surface). Something else that is interesting is how the bedding with no shadow terminator fix overall appears slightly brighter than the bedding with no normal mapping; this is because the shading normals “bend” more light towards the light source. Chiang 2019 restores the overall brightness of the bedding back to something closer to the reference with no normal mapping but softens out more of the fine detail from the normal mapping, while Estevez 2019 preserves more of the fine details but has a brightness level closer to the render with no fix.

Just like in the studio lighting renders, differences become more apparent in close-up crops of the full 4K render. Here are some cropped in comparisons, this time centered more on the top of the bed than on the edge. In these crops, the glancing light angles make the shadow terminators more apparent in the folds of the blankets and such:

Figure 14: Crop of the single light render from earlier, using Chiang 2019 (left) and no harsh shadow terminator fix (right). For a larger comparison, click here.

Figure 15: Crop of the single light render from earlier, using Chiang 2019 (left) and Estevez 2019 (right). For a larger comparison, click here.

Figure 16: Crop of the single light render from earlier, using no normal mapping (left) and normal mapping with no harsh shadow terminator fix (right). For a larger comparison, click here.

In the end, I don’t think either approach is better than the other, and from a physical basis there really isn’t a “right” answer since nothing about shading normals is physical to begin with; I think it’s up to a matter of personal preference and the requirements of the art direction on a given project. Our artists at Walt Disney Animation Studios generally prefer the look of Chiang 2019 because of the lighting setups they usually work with, but I know that other artists prefer the look of Estevez 2019 because they have different requirements to meet.

Fortunately, both Chiang 2019 and Estevez 2019 are both really easy to implement! Both techniques can be implemented in a handful of lines of code, and are easy to apply to any modern physically based shading model. We didn’t actually include source code in our SIGGRAPH talk, mostly because we figured that translating the math from our short paper into code should be very straightforward and thus, including source code that is basically a direct transcription of the math into C++ would almost be insulting to the intelligence of the reader. However, since then, I’ve gotten a surprising number of emails asking for source code, so here’s the math and the corresponding C++ code from my implementation in Takua Renderer. Let G’ be the additional shadow terminator term that we will multiply the Bsdf result with:

\[ G = \min\bigg[1, \frac{\langle\omega_g,\omega_i\rangle}{\langle\omega_s,\omega_i\rangle\langle\omega_g,\omega_s\rangle}\bigg] \]

\[ G' = - G^3 + G^2 + G \]

float calculateChiang2019ShadowTerminatorTerm(const vec3& outputDirection,
                                              const vec3& shadingNormal,
                                              const vec3& geometricNormal) {
    float NDotL = max(0.0f, dot(shadingNormal, outputDirection));
    float NGeomDotL = max(0.0f, dot(geometricNormal, outputDirection));
    float NGeomDotN = max(0.0f, dot(geometricNormal, shadingNormal));
    if (NDotL == 0.0f || NGeomDotL == 0.0f || NGeomDotN == 0.0f) {
        return 0.0f;
    } else {
        float G = NGeomDotL / (NDotL * NGeomDotN);
        if (G <= 1.0f) {
            float smoothTerm = -(G * G * G) + (G * G) + G; // smoothTerm is G' in the math
            return smoothTerm;
        }
    }
    return 1.0f;
}

That’s all there is to it! Source code for Estevez 2019 is provided as part of the Ray Tracing Gems Github repository, but for the sake of completeness, my implementation is included below. My implementation is just the sample implementation streamlined into a single function:

float calculateEstevez2019ShadowTerminatorTerm(const vec3& outputDirection,
                                               const vec3& shadingNormal,
                                               const vec3& geometricNormal) {
    float cos_d = min(abs(dot(geometricNormal, shadingNormal)), 1.0f);
    float tan2_d = (1.0f - cos_d * cos_d) / (cos_d * cos_d);
    float alpha2 = clamp(0.125f * tan2_d, 0.0f, 1.0f);

    float cos_i = max(abs(dot(geometricNormal, outputDirection)), 1e-6f);
    float tan2_i = (1.0f - cos_i * cos_i) / (cos_i * cos_i);
    float spi_shadow_term = 2.0f / (1.0f + sqrt(1.0f + alpha2 * tan2_i));
    return spi_shadow_term;
}

Finally, I have a handful of small implementation notes. First, to apply either Chiang 2019 or Estevez 2019 to your existing physically based shading model, just multiply the additional shadow terminator term with the contribution for each lobe that needs adjusting. Technically speaking G’ is an adjustment to the G shadowing term in a standard microfacet model, but multiplying there versus multiplying with the overall lobe contribution works out to be the same thing. If your Bsdf supports multiple shading normals for different specular lobes, you’ll need to calculate a separate shadow terminator term for each shading normal. Second, note that both Chiang 2019 and Estevez 2019 are described with respect to unidirectional path tracing from the camera. This frame of reference is very important; both techniques work specifically based on the outgoing direction being the direction towards a potential light source, meaning that this technique actually isn’t reciprocal by default. The Estevez 2019 paper found that the shadow terminator term can be made reciprocal by just applying the term to both incoming and outgoing directions, but they also found that this adjustment can make edges too dark. Instead, in order to make both techniques compatible with bidirectional path tracing integrators, I add in a check for whether the incoming or outgoing direction is pointed at a light, and feed the appropriate direction into the shadow terminator function. Doing this check is enough to make my bidirectional renders match my unidirectional ones; intuitively this approach is similar to the check one has to carry out when applying adjoint Bsdf adjustments [Veach 1996] for shading normals and refraction.

That’s pretty much it! If you want the details for how these two techniques are derived and why they work, I strongly encourage reading the Estevez 2019 chapter in Ray Tracing Gems and reading through both the short paper and the presentation slides / notes for the Chiang 2019 SIGGRAPH talk.

References

Matt Jen-Yuan Chiang, Yining Karl Li, and Brent Burley. 2019. Taming the Shadow Terminator. In ACM SIGGRAPH 2019 Talks. Article 71.

Alejandro Conty Estevez, Pascal Lecocq, and Clifford Stein. 2019. A Microfacet-Based Shadowing Function to Solve the Bump Terminator Problem. Ray Tracing Gems (2019), 149-158.

Eric Veach. 1996. Non-Symmetric Scattering in Light Transport Algorithms. In Proc. of Eurographics Workshop on Rendering (Rendering Techniques 1996). 82-91.

Errata

Thanks to Matt Pharr for noticing and pointing out a minor bug in the calculateChiang2019ShadowTerminatorTerm() implementation; the code has been updated with a fix.

RenderMan Art Challenge: Woodville

2019-11-30T00:00:00+00:00

1. Introduction
2. Layout and Framing
3. UVs and Geometry

4. Texturing and Shading
5. Set Dressing the Treehouse
6. Building the Background Forest

7. Lighting and Compositing
8. Conclusion

Introduction

Every once in a while, I make a point of spending some significant personal time working on a personal project that uses tools outside of the stuff I’m used to working on day-to-day (Disney’s Hyperion renderer professionally, Takua Renderer as a hobby). A few times each year, Pixar’s RenderMan group holds an art challenge contest where Pixar provides a un-shaded un-uv’d base model and contestants are responsible for layout, texturing, shading, lighting, additional modeling of supporting elements and surrounding environment, and producing a final image. I thought the most recent RenderMan art challenge, “Woodville”, would make a great excuse for playing with RenderMan 22 for Maya; here’s the final image I came up with:

One big lesson I have learned since entering the rendering world is that there is no such thing as the absolute best overall renderer- there are only renderers that are the best suited for particular workflows, tasks, environments, people, etc. Every in-house renderer is the best renderer in the world for the particular studio that built that renderer, and every commercial renderer is the best renderer in the world for the set of artists that have chosen that renderer as their tool of choice. Another big lesson that I have learned is that even though the Hyperion team at Disney Animation has some of the best rendering engineers in the world, so do all of the other major rendering teams, both commercial and in-house. These lessons are humbling to learn, but also really cool and encouraging if you think about it- these lessons means that for any given problem that arises in the rendering world, as an academic field and as an industry, we get multiple attempts to solve it from many really brilliant minds from a variety of background and a variety of different contexts and environments!

As a result, something I’ve come to strongly believe is that for rendering engineers, there is enormous value in learning to use outside renderers that are not the one we work on day-to-day ourselves. At any given moment, I try to have at least a working familiarity with the latest versions of Pixar’s RenderMan, Solid Angle (Autodesk)’s Arnold, and Chaos Group’s Vray and Corona renderers. All of these renderers are excellent, cutting edge tools, and when new artists join our studio, these are the most common commercial renderers that new artists tend to know how to use. Therefore, knowing how these four renderers work and what vocabulary is associated with them tends to be useful when teaching new artists how to use our in-house renderer, and for providing a common frame of reference when we discuss potential improvements and changes to our in-house renderer. All of the above is the mindset I went into this project with, so this post is meant to be something of a breakdown of what I did, along with some thoughts and observations made along the way. This was a really fun exercise, and I learned a lot!

Layout and Framing

For this art challenge, Pixar supplied a base model without any sort texturing or shading or lighting or anything else. The model is by Alex Shilt, based on a concept by Vasylina Holod. Here is a simple render showing what is provided out of the box:

I started with just scouting for some good camera angles. Since I really wanted to focus on high-detail shading for this project, I decided from close to the beginning to pick a close-up camera angle that would allow for showcasing shading detail, at the trade-off of not depicting the entire treehouse. A nice (lazy) bonus is that picking a close-up camera angle meant that I didn’t need to shade the entire treehouse; just the parts in-frame. Instead of scouting using just the GL viewport in Maya, I tried using RenderMan for Maya 22’s IPR mode, which replaces the Maya viewport with a live RenderMan render. This mode wound up being super useful for scouting; being able to interactively play with depth of field settings and see even basic skydome lighting helped a lot in getting a feel for each candidate camera angle. Here are a couple of different white clay test renders I did while trying to find a good camera position and framing:

I wound up deciding to go with the camera angle and framing in Figure 6 for several reasons. First off, there are just a lot of bits that looked fun to shade, such as the round tower cabin on the left side of the treehouse. Second, I felt that this angle would allow me to limit how expansive of an environment I would need to build around the treehouse. I decided around this point to put the treehouse in a big mountainous mixed coniferous forest, with the reasoning being that tree trunks as large as the ones in the treehouse could only come from huge redwood trees, which only grow in mountainous coniferous forests. With this camera angle, I could make the background environment a single mountainside covered in trees and not have to build a wider vista.

UVs and Geometry

The next step that I took was to try to shade the main tree trunks, since the scale of the tree trunks worried me the most about the entire project. Before I could get to texturing and shading though, I first had to UV-map the tree trunks, and I quickly discovered that before I could even UV-map the tree trunks, I would have to retopologize the meshes themselves, since the tree trunk meshes came with some really messy topology that was basically un-UV-able. I retoplogized the mesh in ZBrush and exported it lower res than the original mesh, and then brought it back into Maya, where I used a shrink-wrap deformer to conform the lower res retopologized mesh back onto the original mesh. The reasoning here was that a lower resolution mesh would be easier to UV unwrap and that displacement later would restore missing detail. Figure 7 shows the wireframe of the original mesh on the left, and the wireframe of my retopologized mesh on the right:

In previous projects, I’ve found a lot of success in using Wenzel Jakob’s Instance Meshes application to retopologize messy geometry, but this time around I used ZBrush’s ZRemesher tool since I wanted as perfect a quad grid as possible (at the expense of losing some mesh fidelity) to make UV unwrapping easier. I UV-unwrapped the remeshed tree trunks by hand; the general approach I took was to slice the tree trunks into a series of stacked cylinders and then unroll each cylinder into as rectangular of a UV shell as I could. For texturing, I started with some photographs of redwood bark I found online, turned them greyscale in Photoshop and adjusted levels and contrast to produce height maps, and then took the height maps and source photographs into Substance Designer, where I made the maps tile seamlessly and also generated normal maps. I then took the tileable textures into Substance Painter and painted the tree trunks using a combination of triplanar projections and manual painting. At this point, I had also blocked in a temporary forest in the background made from just instancing two or three tree models all over the place, which I found useful for being able to help get a sense of how the shading on the treehouse was working in context:

Next up, I worked on getting base shading done for the cabins and various bits and bobs on the treehouse. The general approach I took for the entire treehouse was to do base texturing and shading in Substance Painter, and then add wear and tear, aging, and moss in RenderMan through procedural PxrLayerSurface layers driven by a combination of procedural PxrRoundCube and PxrDirt nodes and hand-painted dirt and wear masks. First though, I had to UV-unwrap all of the cabins and stuff. I tried using Houdini’s Auto UV SOP that comes with Houdini’s Game Tools package… the result (for an example, see Figure 9) was really surprisingly good! In most cases I still had to do a lot of manual cleanup work, such as re-stitching some UV shells together and re-laying-out all of the shells, but the output from Houdini’s Auto UV SOP provided a solid starting point. For each cabin, I grouped surfaces that were going to have a similar material into a single UDIM tile, and sometimes I split similar materials across multiple UDIM tiles if I wanted more resolution. This entire process was… not really fun… it took a lot of time and was basically just busy-work. I vastly prefer being able to paint Ptex instead of having to UV-unwrap and lay out UDIM tiles, but since I was using Substance Painter, Ptex wasn’t an option on this project.

Texturing in Substance Painter and Shading

In Substance Painter, the general workflow I used was to start with multiple triplanar projections of (heavily edited) Quixel Megascans surfaces masked and oriented to different sections of a surface, and then paint on top. Through this process, I was able to get bark to flow with the curves of each log and whatnot. Then, in RenderMan for Maya, I took all of the textures from Substance Painter and used them to drive the base layer of a PxrLayeredSurface shader. All of the textures were painted to be basically greyscale or highly desaturated, and then in Maya I used PxrColorCorrect and PxrVary nodes to add in color. This way, I was able to iteratively play with and dial in colors in RenderMan’s IPR mode without having to roundtrip back to Substance Painter too much. Since the camera in my frame is relatively close to the treehouse, having lots of detail was really important. I put high-res displacement and normal maps on almost everything, which I found helpful for getting that extra detail in. I found that setting micropolygon length to be greater than 1 polygon per pixel was useful for getting extra detail in with displacement, at the cost of a bit more memory usage (which was perfectly tolerable in my case).

One of the unfortunate things about how I chose to UV-unwrap the tree trunks is that UV seams cut across parts of the tree trunks that are visible to the camera; as a result, if you zoom into the final 4K renders, you can see tiny line artifacts in the displacement where UV seams meet. These artifacts arise from displacement values not interpolating smoothly across UV seams when texture filtering is in play; this problem can sometimes be avoided by very carefully hiding UV seams, but sometimes there is no way. The problem in my case is somewhat reduced by expanding displacement values beyond the boundaries of each UV shell in the displacement textures (most applications like Substance Painter can do this natively), but again, this doesn’t completely solve the problem, since expanding values beyond boundaries can only go so far until you run into another nearby UV shell and since texture filtering widths can be variable. This problem is one of the major reasons why we use Ptex so heavily at Disney Animation; Ptex’s robust cross-face filtering functionality sidesteps this problem entirely. I really wish Substance Painter could output !

For dialing in the colors of the base wood shaders, I created versions of the wood shader base color textures that looked like newer wood and older sun-bleached wood, and then I used a PxrBlend node in each wood shader to blend between the newer and older looking wood, along with procedural wear to make sure that the blend wasn’t totally uniform. Across all of the various wood shaders in the scene, I tied all of the blend values to a single PxrToFloat node, so that I could control how aged all wood across the entire scene looks with a single value. For adding moss to everything, I used a PxrRoundCube triplanar to set up a base mask for where moss should go. The triplanar mask was set up so that moss appears heavily on the underside of objects, less on the sides, and not at all on top. The reasoning for making moss appear on undersides is because in the type of conifer forest I set my scene in, moss tends to grow where moisture and shade are available, which tends to be on the underside of things. The moss itself was also driven by a triplanar projection and was combined into each wood shader as a layer in PxrLayerSurface. I also did some additional manual mask painting in Substance Painter to get moss into some more crevices and corners and stuff on all of the wooden sidings and the wooden doors and whatnot. Finally, the overall amount of moss across all of the cabins is modulated by another single PxrToFloat node, allowing me to control the overall amount of moss using another single value. Figure 10 shows how I could vary the age of the wood on the cabins, along with the amount of moss.

The spiral staircase initially made me really worried; I originally thought I was going to have to UV unwrap the whole thing, and stuff like the railings are really not easy to unwrap. But then, after a bit of thinking, I realized that the spiral staircase is likely a fire escape staircase, and so it could be wrought iron or something. Going with a wrought iron look allowed me to handle the staircase mostly procedurally, which saved a lot of time. Going along with the idea of the spiral staircase being a fire escape, I figured that the actual main way to access all of the different cabins in the treehouse must be through staircases internal to the tree trunks. This idea informed how I handled that long skinny window above the front door; I figured it must be a window into a stairwell. So, I put a simple box inside the tree behind that window, with a light at the top. That way, a hint of inner space would be visible through the window:

In addition to shading everything, I also had to make some modifications to the provided treehouse geometry. I that in the provided model, the satellite dish floats above its support pole without any actual connecting geometry, so I modeled a little connecting bit for the satellite dish. Also, I thought it would be fun to put some furniture in the round cabin, so I decided to make the walls into plate glass. Once I made the walls into plate glass, I realized that I needed to make a plausible interior for the round cabin. Since the only way into the round cabin must be through a staircase in the main tree trunk, I modeled a new door in the back of the round cabin. With everything shaded and the geometric modifications in place, here is how everything looked at this point:

Set Dressing the Treehouse

The next major step was adding some story elements. I wanted the treehouse to feel lived in, like the treehouse is just somebody’s house (a very unusual house, but a house nonetheless). To help convey that feeling, my plan was to rely heavily on set dressing to hint at the people living here. So the goal was to add stuff like patio furniture, potted plants, laundry hanging on lines, furniture visible through windows, the various bits and bobs of life, etc.

I started by adding a nice armchair and a lamp to the round tower thing. Of course the chair is an Eames Lounge Chair, and to match, the lamp is a modern style tripod floor lamp type thing. I went with a chair and a lamp because I think that round tower would be a lovely place to sit and read and look out the window at the surrounding nature. I thought it would be kind of fun to make all of the furniture kind of modern and stylish, but have all of the modern furniture be inside of a more whimsical exterior. Next, I extended the front porch part of the main cabin, so that I could have some room to place furniture and props and stuff. Of course any good front porch should have some nice patio furniture, so I added some chairs and a table. I also put in a hanging round swing chair type thing with a bit poofy blue cushion; this entire area should be a fun place to sit around and talk in. Since the entire treehouse sits on the edge of a pond, I figured that maybe the people living here like to sit out on the front porch, relax, shoot the breeze, and fish from the pond. Since my scene is set in the morning, I figured maybe it’s late in the morning and they’ve set up some fishing lines to catch some fish for dinner later. To help sell the idea that it’s a lazy fishing morning, I added a fishing hat on one of the chairs and put a pitcher of ice tea and some glasses on the table. I also added a clothesline with some hanging drying laundry, along with a bunch of potted and hanging plants, just to add a bit more of that lived-in feel. For the plants and several of the furniture pieces that I knew I would want to tweak later, I built in controls to their shading graphs using PxrColorCorrect nodes to allow me to adjust hue and saturation later. Many of the furniture, plant and prop models are highly modified, kitbashed, re-textured versions of assets from Evermotion and CGAxis, although some of them (notably the Eames Lounge Chair) are entirely my own.

Building the Background Forest

The last step before final lighting was to build a more proper background forest, as a replacement for the temporary forest I had used up until this point for blocking purposes. For this step, I relied heavily on Maya’s MASH toolset, which I found to provide a great combination of power and ease-of-use; for use cases involving tons of instanced geometry, I certainly found it much easier than Maya’s older Xgen toolset. MASH felt a lot more native to Maya, as opposed to Xgen, which requires a bunch of specific external file paths and file formats and whatnot. I started with just getting some kind of reasonable base texturing down onto the groundplane. In all of the in-progress renders up until this point, the ground plane was just white… you can actually tell if you look closely enough! I eventually got to a place I was happy with using a bunch of different PxrRoundCubes with various rotations, all blended on top of each other using various noise projections. I also threw in some rocks from Quixel Megascans, just to add a bit of variety. I then laid down some low-level ground vegetation, which was meant to peek through the larger trees in various areas. The base vegetation was made up of various ferns, shrubs, and small sapling-ish young conifers placed using Maya’s MASH Placer node:

In the old temporary background forest, the entire forest is made up of only three different types of trees, and it really shows; there was a distinct lack of color variation or tree diversity. So, for the new forest, I decided to use a lot more types of trees. Here is a rough lineup (not necessarily to scale with each other) of how all of the new tree species looked:

For the main forest, I hand-placed trees onto the mountain slope as instanced. One cool thing I built in to the forest was PxrColorCorrect nodes in all of the tree shading graphs, with all controls wired up to single master controls for hue/saturation/value so that I could shift the entire forest’s colors easily if necessary. This tool proved to be very useful for tuning the overall vegetation colors later while still maintaining a good amount of variation. I also intentionally left gaps in the forest around the rock formations to give some additional visual variety. Building up the entire under-layer of shrubs and saplings and stuff also paid off, since a lot of that stuff wound up peeking through various gaps between the larger trees:

The last step for the main forest was adding some mist and fog, which is common in Pacific Northwest type mountainous conifer forests in the morning. I didn’t have extensive experience working with volumes in RenderMan before this, so there was definitely something of a learning curve for me, but overall it wasn’t too hard to learn! I made the mist by just having a Maya Volume Noise node plug into the density field of a PxrVolume; this isn’t anything fancy, but it provided a great start for the mist/fog:

Lighting and Compositing

At this point, I think the entire image together was starting to look pretty good, although, without any final shot lighting, the overall vibe felt more like a spread out of an issue of National Geographic than a more cinematic still out of a film. Normally my instinct is to go with a more naturalistic look, but since part of the objective for this project was to learn to use RenderMan’s lighting toolset for more cinematic applications, I wanted to push the overall look of the image beyond this point:

From this point onwards, following a tutorial made by Jeremy Heintz, I broke out the volumetric mist/fog into a separate layer and render pass in Maya, which allowed for adjusting the mist/fog in comp without having to re-render the entire scene. This strategy proved to be immensely useful and a huge time saver in final lighting. Before starting final lighting, I made a handful of small tweaks, which included reworking the moss on the front cabin’s lower support frame to get rid of some visible repetition, tweaking and adding dirt on all of the windows, and dialing in saturation and hue on the clothesline and potted plants a bit more. I also changed the staircase to have aged wooden steps instead of all black cast iron, which helped blend the staircase into the overall image a bit more, and finally added some dead trees in the background forest. Finally, in a last-minute change, I wound up upgrading a lot of the moss on the main tree trunk and on select parts of the cabins to use instanced geometry instead of just being a shading effect. The geometric moss used atlases from Quixel Megascans, bunched into little moss patches, and then hand-scattered using the Maya MASH Placer tool. Upgrading to geometric moss overall provided only a subtle change to the overall image, but I think helped enormously in selling some of the realism and detail; I find it interesting how small visual details like this often can have an out-sized impact on selling an overall image.

For final lighting, I added an additional uniform atmospheric haze pass to help visually separate the main treehouse from the background forest a bit more. I also added a spotlight fog pass to provide some subtle godrays; the spotlight is a standard PxrRectLight oriented to match the angle of the sun. The PxrRectLight also has the cone modified enabled to provide the spot effect, and also has a PxrCookieLightFilter applied with a bit of a cucoloris pattern applied to provide the breakup effect that godrays shining through a forest canopy should have. To provide a stronger key light, I rotated the skydome until I found something I was happy with, and then I split out the sun from the skydome into separate passes. I split out the sun by painting the sun out of the skydome texture and then creating a PxrDistantLight with an exposure, color, and angle matched to what the sun had been in the skydome. Splitting out the sun then allowed me to increase the size of the sun (and decrease the exposure correspondingly to maintain overall the same brightness), which helped soften some otherwise pretty harsh sharp shadows. I also used a good number of PxrRodLightFilters to help take down highlights in some areas, lighten shadows in others, and provide overall light shaping to areas like the right hand side of the right tree trunk. I’ve conceptually known why artists like rods for some time now (especially since rods are heavily used feature in Hyperion at my day job at Disney Animation), but I think this project helped me really understand at a more hands-on level why rods are so great for hitting specific art direction.

After much iteration, here is the final set of render passes I wound up with going into final compositing:

In final compositing, since I had everything broken out into separate passes, I was able to quickly make a number of adjustments that otherwise would have been much slower to iterate on if I had done them in-render. I tinted the sun pass to be warmer (which is equivalent to changing the sun color in-render and re-rendering) and tweaked the exposures of the sun pass up and some of the volumetric passes down to balance out the overall image. I also applied a color tint to the mist/fog pass to be cooler, which would have been very slow to experiment with if I had changed the actual fog color in-render. I did all of the compositing in Photoshop, since I don’t have a Nuke license at home. Not having a node-based compositing workflow was annoying, so next time I’ll probably try to learn DaVinci Resolve Fusion (which I hear is pretty good).

For color grading, I mostly just fiddled around in Lightroom. I also added in a small amount of bloom by just duplicating the sun pass, clipping it to only really bright highlight values by adjusting levels in Photoshop, applying a Gaussian blur, exposing down, and adding back over the final comp. Finally, I adjusted the gamma by 0.8 and exposed up by half a stop to give some additional contrast and saturation, which helped everything pop a bit more and feel a bit more moody and warm. Figure 25 shows what all of the lighting, comp, and color grading looks like applied to a 50% grey clay shaded version of the scene, and if you don’t want to scroll all the way back to the top of this post to see the final image, I’ve included it again as Figure 26.

Conclusion

Overall, I had a lot of fun on this project, and I learned an enormous amount! This project was probably the most complex and difficult art project I’ve ever done. I think working on this project has shed a lot of light for me on why artists like certain workflows, which is an incredibly important set of insights for my day job as a rendering engineer. I won’t grumble as much about having to support rods in production rendering now!

Here is a neat progression video I put together from all of the test and in-progress renders that I saved throughout this entire project:

Woodville Art Challenge Progression

I owe several people an enormous debt of thanks on this project. My wife, Harmony Li, deserves all of my gratitude for her patience with me during this project, and also for being my art director and overall sanity checker. My coworker at Disney Animation, lighting supervisor Jennifer Yu, gave me a lot of valuable critiques, advice, and suggestions, and acted as my lighting director during the final lighting and compositing stage. Leif Pederson from Pixar’s RenderMan group provided a lot of useful tips and advice on the RenderMan contest forum as well.

Finally, my final image somehow managed to score an honorable mention in Pixar’s Art Challenge Final Results, which was a big, unexpected, pleasant surprise, especially given how amazing all of the other entries in the contest are! Since the main purpose of this project for me was as a learning exercise, doing well in the actual contest was a nice bonus, and kind of makes me think I’ll likely give the next RenderMan Art Challenge a shot too with a more serious focus on trying to put up a good showing. If you’d like to see more about my contest entry, check out the work-in-progress thread I kept up in Pixar’s Art Challenge forum; some of the text for this post was adapted from updates I made in my forum thread.

Frozen 2

2019-11-14T00:00:00+00:00

The 2019 film from Walt Disney Animation Studios is, of course, Frozen 2, which really does not need any additional introduction. Instead, here is a brief personal anecdote. I remember seeing the first Frozen in theaters the day it came out, and at some point halfway through the movie, it dawned on me that what was unfolding on the screen was really something special. By the end of the first Frozen, I was convinced that I had to somehow get myself a job at Disney Animation some day. Six years later, here we are, with Frozen 2’s release imminent, and here I am at Disney Animation. Frozen 2 is my fourth credit at Disney Animation, but somehow seeing my name in the credits at the wrap party for this film was even more surreal than seeing my name in the credits on my first film. Working with everyone on Frozen 2 was an enormous privilege and thrill; I’m incredibly proud of the work we have done on this film!

Under team lead Dan Teece’s leadership, for Frozen 2 we pushed Disney’s Hyperion Renderer the hardest and furthest yet to date, and I think the result really shows in the final film. Frozen 2 is stunningly beautiful to look at it; seeing it for the first time in its completed form was a humbling experience, since there were many moments where I realized I honestly had no idea how our artists had managed to push the renderer as far as they did. During the production of Frozen 2, we also welcomed three superstar rendering engineers to the rendering team: Mark Lee, Joe Schutte, and Wei-Feng Wayne Huang; their contributions to our team and to Frozen 2 simply cannot be overstated!

On Frozen 2, I got to play a part on several fun and interesting initiatives! Hyperion’s modern volume rendering system saw a number of major improvements and advancements for Frozen 2, mostly centered around rendering optically thin volumes. Hyperion’s modern volume rendering system is based on null-collision tracking theory [Kutz et al. 2017], which is exceptionally well suited for dense volumes dominated by high-order scattering (such as clouds and snow). However, as anyone with experience developing a volume rendering system knows, optically thin volumes (such as mist and fog) are a major weak point for null-collision techniques . Wayne was responsible for a number of major advancements that allowed us to efficiently render mist and fog on Frozen 2 using the modern volume rendering system, and Wayne was kind enough to allow me to play something of an advisory / consulting role on that project. Also, Frozen 2 is the first feature film on which we’ve deployed Hyperion’s path guiding implementation into production; this project was the result of some very tight collaboration between Disney Animation and Disney Research Studios. Last summer, I worked with Peter Kutz, our summer intern Laura Lediaev, and with Thomas Müller from ETH Zürich / Disney Research Studios to prototype an implementation of Practical Path Guiding [Müller et al. 2017] in Hyperion. Joe Schutte then took on the massive task (as one of his first tasks on the team, no less!) of turning the prototype into a production-quality feature, and Joe worked with Thomas to develop a number of improvements to the original paper [Müller 2019]. Finally, I worked on some lighting / shading improvements for Frozen 2, which included developing a new spot light implementation for theatrical lighting, and, with Matt Chiang and Brent Burley, a solution to the long-standing normal / bump mapped shadow terminator problem [Chiang et al. 2019]. We also benefited from more improvements in our denoising tech [Dahlberg et al. 2019] which arose as a joint effort between our own David Adler, ILM, Pixar and the Disney Research Studios rendering team.

I think Frozen projects provide an interesting window into how far rendering has progressed at Disney Animation over the past six years. We’ve basically had some Frozen project going on every few years, and each Frozen project upon completion has represented the most cutting edge rendering capabilities we’ve had at the time. The original Frozen in 2013 was the studio’s last project rendered using Renderman, and also the studio’s last project to not use path tracing. Frozen Fever in 2015, by contrast, was one of the first projects (alongside Big Hero 6) to use Hyperion and full path traced global illumination. The jump in visual quality between Frozen and Frozen Fever was enormous, especially considering that they were released only a year and a half apart. Olaf’s Frozen Adventure, which I’ve written about before, served as the testbed for a number of enormous changes and advancements that were made to Hyperion in preparation for Ralph Breaks the Internet. Frozen 2 represents the full extent of what Hyperion can do today, now that Hyperion is a production-hardened, mature renderer backed by a team that is now very experienced. The original Frozen looked decent when it first came out, but since it was the last non-path-traced film we made, it looked dated visually just a few years later. Comparing the original Frozen with Frozen 2 is like night and day; I’m very confident that Frozen 2 will still look visually stunning and hold up well long into the future. A great example is in all of the clothing in Frozen 2; when watching the film, take a close look at all of the embroidery on all of the garments. In the original Frozen, a lot of the embroidery work is displacement mapped or even just normal mapped, but in Frozen 2, all of the embroidery is painstakingly constructed from actual geometric curves [Liu et al. 2020], and as a result every bit of embroidery is rendered in incredible detail!

One particular thing in Frozen 2 that makes me especially happy is how all of the water looks in the film, and especially how the water looks in the dark seas sequence. On Moana, we really struggled with getting whitewater and foam to look appropriately bright and white. Since that bright white effect comes from high-order scattering in volumes and at the time we were still using our old volume rendering system that couldn’t handle high-order scattering well, the artists on Moana wound up having to rely on a lot of ingenious trickery to get whitewater and foam to look just okay. I think Moana is a staggeringly beautiful film, but if you know where to look, you may be able to tell that the foam looks just a tad bit off. On Frozen 2, however, we were able to do high-order scattering, and as a result, all of the whitewater and foam in the dark seas sequence looks just absolutely amazing. No spoilers, but all I’ll say is that there’s another part in the movie that isn’t in any trailer where my jaw was just on the floor in terms of water rendering; you’ll know it when you see it. A similar effect has been done before in a previous CG Disney Animation movie, but the effect in Frozen 2 is on a far grander, far more impressive, far more amazing scale [Tollec et al. 2020].

In addition to the rendering tech advancements we made on Frozen 2, there are a bunch of other cool technical initiatives that I’d recommend reading about! Each of our films has its own distinct world and look, and the style requirements on Frozen 2 often required really cool close collaborations between the lighting and look departments and the rendering team; the “Show Yourself” sequence near the end of the film was a great example of the amazing work these collaborations can produce [Sathe et al. 2020]. Frozen 2 had a lot of characters that were actually complex effects, such as the Wind Spirit [Black et al. 2020] and the Nokk water horse [Hutchins et al. 2020]; these characters required tight collaborations between a whole swath of departments ranging from animation to simulation to look to effects to lighting. Even the forest setting of the film required new tech advancements; we’ve made plenty of forests before, but integrating huge-scale effects into the forest resulted in some cool new workflows and techniques [Joseph et al. 2020].

To give a sense of just how gorgeous Frozen 2 looks, below are some stills from the movie pulled from Blu-ray, in no particular order, 100% rendered using Hyperion. If you love seeing cutting edge rendering in action, I strongly encourage going to see Frozen 2 on the biggest screen you can find! The film has wonderful songs, a fantastic story, and developed, complex, funny characters, and of course there is not a single frame in the movie that isn’t stunningly beautiful.

Here is the part of the credits with Disney Animation’s rendering team, kindly provided by Disney! I always encourage sitting through the credits for movies, since everyone in the credits put so much hard work and passion into what you see onscreen, but I especially recommend it for Frozen 2 since there’s also a great post-credits scene.

All images in this post are courtesy of and the property of Walt Disney Animation Studios.

References

Cameron Black, Trent Correy, and Benjamin Fiske. 2020. Frozen 2: Creating the Wind Spirit. In ACM SIGGRAPH 2020 Talks. Article 22.

Matt Jen-Yuan Chiang, Yining Karl Li, and Brent Burley. 2019. Taming the Shadow Terminator. In ACM SIGGRAPH 2019 Talks. Article 71.

Henrik Dahlberg, David Adler, and Jeremy Newlin. 2019. Machine-Learning Denoising in Feature Film Production. In ACM SIGGRAPH 2019 Talks. Article 21.

David Hutchins, Cameron Black, Marc Bryant, Richard Lehmann, and Svetla Radivoeva. 2020. “Frozen 2”: Creating the Water Horse . In ACM SIGGRAPH 2020 Talks. Article 23.

Norman Moses Joseph, Vijoy Gaddipati, Benjamin Fiske, Marie Tollec, and Tad Miller. 2020. Frozen 2: Effects Vegetation Pipeline. In ACM SIGGRAPH 2020 Talks. Article 7.

Ying Liu, Jared Wright, and Alexander Alvarado. 2020. Making Beautiful Embroidery for “Frozen 2”. In ACM SIGGRAPH 2020 Talks. Article 73.

Thomas Müller. 2019. Practical Path Guiding in Production. In ACM SIGGRAPH 2019 Course Notes: Path Guiding in Production. 37-50.

Amol Sathe, Lance Summers, Matt Jen-Yuan Chiang, and James Newland. 2020. The Look and Lighting of “Show Yourself” in “Frozen 2”. In ACM SIGGRAPH 2020 Talks. Article 71.

Marie Tollec, Sean Jenkins, Lance Summers, and Charles Cunningham-Scott. 2020. Deconstructing Destruction: Making and Breaking of ”Frozen 2”’s Dam. In ACM SIGGRAPH 2020 Talks. Article 24.

SIGGRAPH 2019 Talk- Taming the Shadow Terminator

2019-08-01T00:00:00+00:00

This year at SIGGRAPH 2019, Matt Jen-Yuan Chiang, Brent Burley, and I had a talk that presents a technique for smoothing out the harsh shadow terminator problem that often arises when high-frequency bump or normal mapping is used in ray tracing. We developed this technique as part general development on Disney’s Hyperion Renderer for the production of Frozen 2. This work is mostly Matt’s; Matt was very kind in allowing me to help out and play a small role on this project.

This work is contemporaneous with the recent work on the same shadow terminator problem that was carried out and published by Estevez et al. from Sony Pictures Imageworks and published in Ray Tracing Gems. We actually found out about the Estevez et al. technique at almost exactly the same time that we submitted our SIGGRAPH talk, which proved to be very fortunate, since after our talk was accepted, we were than able to update our short paper with additional comparisons between Estevez et al. and our technique. I think this is a great example of how having multiple rendering teams in the field tackling similar problems and sharing results provides a huge benefit to the field as a whole- we now have two different, really good solutions to what used to be a big shading problem!

Here is the paper abstract:

A longstanding problem with the use of shading normals is the discontinuity introduced into the cosine falloff where part of the hemisphere around the shading normal falls below the geometric surface. Our solution is to add a geometrically derived shadowing function that adds minimal additional shadowing while falling smoothly to zero at the terminator. Our shadowing function is simple, robust, efficient and production proven.

The paper and related materials can be found at:

Matt Chiang presented the paper at SIGGRAPH 2019 in Los Angeles as part of the “Lucy in the Sky with Diamonds - Processing Visuals” Talks session. A pdf version of the presentation slides, along with presenter notes, are available on my project page for the paper. I’d also recommend getting the author’s version of the short paper instead of the official version as well, since the author’s version includes some typo fixes made after the official version was published.

Work on this project started early in the production of Frozen 2, when our look artists started to develop the shading of the dresses and costumes in Frozen 2. Because intricate woven fabrics and patterns are an important part of the Scandinavian culture that Frozen 2 is inspired by, the shading in Frozen 2 pushed high-resolution high-frequency displacing and normal mapping further than we ever had before with Hyperion in order to make convincing looking textiles. Because of how high-frequency the normal mapping was pushed, the bump/normal mapped shadow terminator problem became worse and worse and proved to be a major pain point for our look and lighting artists. In the past, our look and lighting artists have worked around shadow terminator issues using a combination of techniques, such as falling back to full displacement, or using larger area lights to try to soften the shadow terminator. However, these techniques can be problematic when they are in conflict with art direction, and force artists to think about an additional technical dimension when they otherwise would rather be focused on the artistry.

Our search for a solution began with Peter Kutz looking at “Microfacet-based Normal Mapping for Robust Monte Carlo Path Tracing” by Schüssler et al., which focused on addressing energy loss when rendering shading normals. The Schüssler et al. 2017 technique solved the energy loss problem by constructing a microfacet surface comprised of two facets per shading point, instead the the usual one. The secondary facet is used to account for things like inter-reflections between the primary and secondary facets. However, the Schüssler et al. 2017 technique wound up not solving the shadow terminator problems we were facing; using their shadowing function produced a look that was too flat.

Matt Chiang then realized that the secondary microfacet approach could be used to solve the shadow terminator problem using a different secondary microfacet configuration; instead of using a vertical second facet as in Schüssler, Matt made the secondary facet perpendicular to the shading normal. By making the secondary facet perpendicular, as a light source slowly moves towards the grazing angle relative to the microfacet surface, peak brightness is maintained when the light is parallel to the shading normal, while additional shadowing is introduced beyond the parallel angle. This solution worked extremely well, and is the technique presented in our talk / short paper.

The final piece of the puzzle was addressing a visual discontinuity produced by Matt’s technique when the light direction reaches and moves beyond the shading normal. Instead of falling smoothly to zero, the shape of the shadow terminator undergoes a hard shift from a cosing fall-off formed by the dot product of the shading normal and light direction to a linear fall-off. Matt and I played with a number of different interpolation schemes to smooth out this transition, and eventually settled on a custom smooth-step function. During this process, I made the observation that whatever blending function we used needed to introduce C1 continuity in order to remove the visual discontinuity. This observation led Brent Burley to realize that instead of a complex custom smooth-step function, a simple Hermite interpolation would be enough; this Hermite interpolation is the one presented in the talk / short paper.

For a much more in-depth view at all of the above, complete with diagrams and figures and examples, I highly recommend looking at Matt’s presentation slides and presenter notes.

Here is a test render of the Iduna character’s costume from Frozen 2, from before we had this technique implemented in Hyperion. The harsh shadow terminator produces an illusion that makes her arms and torso look boxier than the actual underlying geometry is:

…and here is the same test render, but now with our soft shadow terminator fix implemented and enabled. Note how her arms and torso now look properly rounded, instead of boxy!

This technique is now enabled by default across the board in Hyperion, and any article of clothing or costume you see in Frozen 2 is using this technique. So, through this project, we got to play a small role in making Elsa, Anna, Kristoff, and everyone else look like themselves!

Hyperion Publications

2019-07-30T00:00:00+00:00

1. Introduction
2. Publications
3. Additional Resources

4. Contributors
5. Made with Hyperion

Introduction

Every year at SIGGRAPH (and sometimes at other points in the year), members of the Hyperion team inevitably get asked if there is any publicly available information about Disney’s Hyperion Renderer. The answer is: yes, there is actually a lot of publicly available information!

One amazing aspect of working at Walt Disney Animation Studios is the huge amount of support and encouragement we get from our managers and the wider studio for publishing and sharing our work with the wider academic world and industry. As part of this sharing, the Hyperion team has had the opportunity to publish a number of papers over the years detailing various interesting techniques used in the renderer.

I think it’s very important to mention here that another one of my favorite aspects of working on the Hyperion team is the deep collaboration we get to engage in with our sister rendering team at DisneyResearch|Studios (formerly known as Disney Research Zürich). The vast majority of the Hyperion team’s publications are joint works with DisneyResearch|Studios, and I personally think it’s fair to say that all of Hyperion’s most interesting advanced features are just as much the result of research and work from DisneyResearch|Studios as they are the result of our team’s own work. Without a doubt, Hyperion, and by extension, our movies, would not be what they are today without DisneyResearch|Studios. Of course, we also collaborate closely with our sister rendering teams at Pixar Animation Studios and Industrial Light & Magic as well, and there are numerous examples where collaboration between all of these teams has advanced the state of the art in rendering for the whole industry.

So without further ado, below are all of the papers that the Hyperion team has published or worked on or had involvement with over the years, either by ourselves or with our counterparts at DisneyResearch|Studios, Pixar, ILM, and other research groups. If you’ve ever been curious to learn more about Disney’s Hyperion Renderer, here are 52 publications with a combined 571 pages of material! For each paper, I’ll link to a preprint version, link to the official publisher’s version, and link any additional relevant resources for the paper. Please see publisher’s version or project page links for any additional supplemental materials that go with each paper. I’ll also give the citation information, give a brief description, list the teams involved, and note how the paper is relevant to Hyperion. This post is meant to be a living document; I’ll come back and update it down the line with future publications. Publications are listed in chronological order.

Publications

Ptex: Per-Face Texture Mapping for Production Rendering

Brent Burley and Dylan Lacewell. Ptex: Per-face Texture Mapping for Production Rendering. Computer Graphics Forum (Proceedings of Eurographics Symposium on Rendering 2008), 27(4), June 2008.
- Preprint Version
- Official Publisher’s Version
- Open Source Project
Internal project from Disney Animation. This paper describes per-face textures, a UV-free way of texture mapping. Ptex is the texturing system used in Hyperion for all non-procedural texture maps. Every Disney Animation film made using Hyperion is textured entirely with Ptex. Ptex is also available in many commercial renderers, such as Pixar’s RenderMan!
Physically-Based Shading at Disney

Brent Burley. Physically Based Shading at Disney. In ACM SIGGRAPH 2012 Course Notes: Practical Physically-Based Shading in Film and Game Production, August 2012.
- Preprint Version (Updated compared to official version)
- Official Publisher’s Version
- Physically Based Shading SIGGRAPH 2012 Course
Internal project from Disney Animation. This paper describes the Disney BRDF, a physically principled shading model with a artist-friendly parameterization and layering system. The Disney BRDF is the basis of Hyperion’s entire shading system. The Disney BRDF has also gained widespread industry adoption the basis of a wide variety of physically based shading systems, and has influenced the design of shading systems in a number of other production renderers. Every Disney Animation film made using Hyperion is shaded using the Disney BSDF (an extended version of the Disney BRDF, described in a later paper).
Sorted Deferred Shading for Production Path Tracing

Christian Eisenacher, Gregory Nichols, Andrew Selle, and Brent Burley. Sorted Deferred Shading for Production Path Tracing. Computer Graphics Forum (Proceedings of Eurographics Symposium on Rendering 2013), 32(4), June 2013.
- Preprint Version
- Official Publisher’s Version
Internal project from Disney Animation. Won the Best Paper Award at EGSR 2013! This paper describes the sorted deferred shading architecture that is at the very core of Hyperion. Along with the previous two papers in this list, this is one of the foundational papers for Hyperion; every film rendered using Hyperion is rendered using this architecture.
Residual Ratio Tracking for Estimating Attenuation in Participating Media

Jan Novák, Andrew Selle, and Wojciech Jarosz. Residual Ratio Tracking for Estimating Attenuation in Participating Media. ACM Transactions on Graphics (Proceedings of SIGGRAPH Asia 2014), 33(6), November 2014.
- Preprint Version
- Official Publisher’s Version
- Project Page
Joint project between DisneyResearch|Studios and Disney Animation. This paper described a pair of new, complementary techniques for evaluating transmittance in heterogeneous volumes. These two techniques made up the core of Hyperion’s first and second generation volume rendering implementations, used from Big Hero 6 up through Moana.
Visualizing Building Interiors using Virtual Windows

Norman Moses Joseph, Brett Achorn, Sean D. Jenkins, and Hank Driskill. Visualizing Building Interiors using Virtual Windows. In ACM SIGGRAPH Asia 2014 Technical Briefs, December 2014.
- Preprint Version
- Official Publisher’s Version
Internal project from Disney Animation. This paper describes Hyperion’s “hologram shader”, which is used for creating the appearance of parallaxed, fully shaded, detailed building interiors without adding additional geometric complexity to a scene. This technique was developed for Big Hero 6. Be sure to check out the supplemental materials on the publisher site for a cool video breakdown of the technique.
Path-space Motion Estimation and Decomposition for Robust Animation Filtering

Henning Zimmer, Fabrice Rousselle, Wenzel Jakob, Oliver Wang, David Adler, Wojciech Jarosz, Olga Sorkine-Hornung, and Alexander Sorkine-Hornung. Path-space Motion Estimation and Decomposition for Robust Animation Filtering. Computer Graphics Forum (Proceedings of Eurographics Symposium on Rendering 2015), 34(4), June 2015.
- Preprint Version
- Official Publisher’s Version
- Project Page
Joint project between DisneyResearch|Studios, ETH Zürich, and Disney Animation. This paper describes a denoising technique suitable for animated sequences. Not directly used in Hyperion’s denoiser, but both inspired by and influential towards Hyperion’s first generation denoiser.
Portal-Masked Environment Map Sampling

Benedikt Bitterli, Jan Novák, and Wojciech Jarosz. Portal-Masked Environment Map Sampling. Computer Graphics Forum (Proceedings of Eurographics Symposium on Rendering 2015), 34(4), June 2015.
- Preprint Version
- Official Publisher’s Version
- Project Page
Joint project between DisneyResearch|Studios and Disney Animation. This paper describes an efficient method for importance sampling environment maps. This paper was actually derived from the technique Hyperion uses for importance sampling lights with IES profiles, which has been used on all films rendered using Hyperion.
A Practical and Controllable Hair and Fur Model for Production Path Tracing

Matt Jen-Yuan Chiang, Benedikt Bitterli, Chuck Tappan, and Brent Burley. A Practical and Controllable Hair and Fur Model for Production Path Tracing. In ACM SIGGRAPH 2015 Talks, August 2015.
- Preprint Version
- Official Publisher’s Version
Joint project between DisneyResearch|Studios and Disney Animation. This short paper gives an overview of Hyperion’s fur and hair model, originally developed for use on Zootopia. A full paper was published later with more details. This fur/hair model is Hyperion’s fur/hair model today, used on every film beginning with Zootopia to present.
Extending the Disney BRDF to a BSDF with Integrated Subsurface Scattering

Brent Burley. Extending the Disney BRDF to a BSDF with Integrated Subsurface Scattering. In ACM SIGGRAPH 2015 Course Notes: Physically Based Shading in Theory and Practice, August 2015.
- Preprint Version
- Official Publisher’s Version
- Physically Based Shading SIGGRAPH 2015 Course
Internal project from Disney Animation. This paper describes the full Disney BSDF (sometimes referred to in the wider industry as Disney BRDF v2) used in Hyperion, and also describes a novel subsurface scattering technique called normalized diffusion subsurface scattering. The Disney BSDF is the shading model for everything ever rendered using Hyperion, and normalized diffusion was Hyperion’s subsurface model from Big Hero 6 up through Moana. For a public open-source implementation of the Disney BSDF, check out PBRT v3’s implementation. Also, check out Pixar’s RenderMan for an implementation in a commercial renderer!
Approximate Reflectance Profiles for Efficient Subsurface Scattering

Per H Christensen and Brent Burley. Approximate Reflectance Profiles for Efficient Subsurface Scattering. Pixar Technical Memo, #15-04, August 2015.
- Preprint Version
- Official Pixar Research Version and Project Page
- Updates and Errata
Joint project between Pixar and Disney Animation. This paper presents several useful parameterizations for the normalized diffusion subsurface scattering model presented in the previous paper in this list. These parameterizations are used for the normalized diffusion implementation in Pixar’s RenderMan 21 and later.
Big Hero 6: Into the Portal

David Hutchins, Olun Riley, Jesse Erickson, Alexey Stomakhin, Ralf Habel, and Michael Kaschalk. Big Hero 6: Into the Portal. In ACM SIGGRAPH 2015 Talks, August 2015.
- Preprint Version
- Official Publisher’s Version
Internal project from Disney Animation. This short paper describes some interesting volume rendering challenges that Hyperion faced during the production of Big Hero 6’s climax sequence, set in a volumetric fractal portal world.
Level-of-Detail for Production-Scale Path Tracing

Magdalena Martinek, Christian Eisenacher, and Marc Stamminger. Level-of-Detail for Production-Scale Path Tracing. In VMV 2015: Proceedings of the 20th International Symposium on Vision, Modeling, and Visualization, October 2015.
- Preprint Version
- Official Publisher’s Version
Joint project between Disney Animation and the University of Erlangen-Nurmberg. This paper gives an overview of a SVO-based level-of-detail system for use in production path tracing. This system was originally prototyped in an early version of Hyperion and informed the automatic shading level-of-detail system that was used on Big Hero 6; automatic shading level-of-detail has since been removed from Hyperion.
A Practical and Controllable Hair and Fur Model for Production Path Tracing

Matt Jen-Yuan Chiang, Benedikt Bitterli, Chuck Tappan, and Brent Burley. A Practical and Controllable Hair and Fur Model for Production Path Tracing. Computer Graphics Forum (Proceedings of Eurographics 2016), 35(2), May 2016.
- Preprint Version
- Official Publisher’s Version
- Project Page
- Implementation Guide by Matt Pharr
Joint project between DisneyResearch|Studios and Disney Animation. This paper gives an overview of Hyperion’s fur and hair model, originally developed for use on Zootopia. This fur/hair model is Hyperion’s fur/hair model today, used on every film beginning with Zootopia to present. This paper is now also implemented in the open source PBRT v3 renderer, and also serves as the basis of the hair/fur shader in Chaos Group’s V-Ray Next renderer.
Subdivision Next-Event Estimation for Path-Traced Subsurface Scattering

David Koerner, Jan Novák, Peter Kutz, Ralf Habel, and Wojciech Jarosz. Subdivision Next-Event Estimation for Path-Traced Subsurface Scattering. In Proceedings of EGSR 2016, Experimental Ideas & Implementations, June 2016.
- Preprint Version
- Official Publisher’s Version
- Project Page
Joint project between DisneyResearch|Studios, University of Stuttgart, Dartmouth College, and Disney Animation. This paper describes a method for accelerating brute force path traced subsurface scattering; this technique was developed during early experimentation in making path traced subsurface scattering practical for production in Hyperion.
Nonlinearly Weighted First-Order Regression for Denoising Monte Carlo Renderings

Benedikt Bitterli, Fabrice Rousselle, Bochang Moon, José A. Iglesias-Guitian, David Adler, Kenny Mitchell, Wojciech Jarosz, and Jan Novák. Nonlinearly Weighted First-Order Regression for Denoising Monte Carlo Renderings. Computer Graphics Forum (Proceedings of Eurographics Symposium on Rendering 2016), 35(4), July 2016.
- Preprint Version
- Official Publisher’s Version
- Project Page
Joint project between DisneyResearch|Studios, Edinburgh Napier University, Dartmouth College, and Disney Animation. This paper describes a high-quality, stable denoising technique based on a thorough analysis of previous technique. This technique was developed during a larger project to develop a state-of-the-art successor to Hyperion’s first generation denoiser.
Practical and Controllable Subsurface Scattering for Production Path Tracing

Matt Jen-Yuan Chiang, Peter Kutz, and Brent Burley. Practical and Controllable Subsurface Scattering for Production Path Tracing. In ACM SIGGRAPH 2016 Talks, July 2016.
- Preprint Version
- Official Publisher’s Version
Internal project from Disney Animation. This short paper describes the novel parameterization and multi-wavelength sampling strategy used to make path traced subsurface scattering practical for production. Both of these are implemented in Hyperion’s path traced subsurface scattering system and have been in use on all shows beginning with Olaf’s Frozen Adventure to present.
Efficient Rendering of Heterogeneous Polydisperse Granular Media

Thomas Müller, Marios Papas, Markus Gross, Wojciech Jarosz, and Jan Novák. Efficient Rendering of Heterogeneous Polydisperse Granular Media. ACM Transactions on Graphics (Proceedings of SIGGRAPH Asia 2016), 35(6), November 2016.
- Preprint Version
- Official Publisher’s Version
- Project Page
External project from DisneyResearch|Studios, ETH Zürich, and Dartmouth College, inspired in part by production problems encountered at Disney Animation related to rendering things like sand, snow, etc. This technique uses shell transport functions to accelerate path traced rendering of massive assemblies of grains. Thomas Müller implemented an experimental version of this technique in Hyperion, along with an interesting extension for applying the shell transport theory to volume rendering.
Practical Path Guiding for Efficient Light-Transport Simulation

Thomas Müller, Markus Gross, and Jan Novák. Practical Path Guiding for Efficient Light-Transport Simulation. Computer Graphics Forum (Proceedings of Eurographics Symposium on Rendering 2017), 36(4), July 2017.
- Preprint Version (Updated compared to official version)
- Official Publisher’s Version
- Project Page
External joint project between DisneyResearch|Studios and ETH Zürich, inspired in part by challenges with handling complex light transport efficiently in Hyperion. Won the Best Paper Award at EGSR 2017! This paper describes a robust, unbiased technique for progressively learning complex indirect illumination in a scene during a render and intelligently guiding paths to better sample difficult indirect illumination effects. Implemented in Hyperion, along with a number of interesting improvements documented in a later paper. In use on Frozen 2 and future films.
Kernel-predicting Convolutional Networks for Denoising Monte Carlo Renderings

Steve Bako, Thijs Vogels, Brian McWilliams, Mark Meyer, Jan Novák, Alex Harvill, Pradeep Sen, Tony DeRose, and Fabrice Rousselle. Kernel-predicting Convolutional Networks for Denoising Monte Carlo Renderings. ACM Transactions on Graphics (Proceedings of SIGGRAPH 2017), 36(4), July 2017.
- Preprint Version
- Official Publisher’s Version
- Project Page
External joint project between University of California Santa Barbara, DisneyResearch|Studios, ETH Zürich, and Pixar, with project support from Disney Animation. Developed as part of the larger effort to develop a successor to Hyperion’s first generation denoiser. This paper describes a supervised learning approach for denoising filter kernels using deep convolutional neural networks. This technique is the basis of the modern Disney-Research-developed second generation deep-learning denoiser in use by the rendering teams at Pixar and ILM, and by the Hyperion iteam at Disney Animation.
Production Volume Rendering

Julian Fong, Magnus Wrenninge, Christopher Kulla, and Ralf Habel. Production Volume Rendering. In ACM SIGGRAPH 2017 Courses, July 2017.
- Preprint Version (Updated compared to official version)
- Official Publisher’s Version
- Production Volume Rendering SIGGRAPH 2017 Course
Joint publication from Pixar, Sony Pictures Imageworks, and Disney Animation. This course covers volume rendering in modern path tracing renderers, from basic theory all the way to practice. The last chapter details the inner workings of Hyperion’s first and second generation transmittance estimation based volume rendering system, used from Big Hero 6 up through Moana.
Spectral and Decomposition Tracking for Rendering Heterogeneous Volumes

Peter Kutz, Ralf Habel, Yining Karl Li, and Jan Novák. Spectral and Decomposition Tracking for Rendering Heterogeneous Volumes. ACM Transactions on Graphics (Proceedings of SIGGRAPH 2017), 36(4), July 2017.
- Preprint Version
- Official Publisher’s Version
- Project Page
Joint project between DisneyResearch|Studios and Disney Animation. This paper describes two complementary new null-collision tracking techniques: decomposition tracking and spectral tracking. The paper also introduces to computer graphics an extended integral formulation of null-collision algorithms, originally developed in the field of reactor physics. These two techniques are the basis of Hyperion’s modern third generation null-collision tracking based volume rendering system, in use beginning on Olaf’s Frozen Adventure to present.
The Ocean and Water Pipeline of Disney’s Moana

Sean Palmer, Jonathan Garcia, Sara Drakeley, Patrick Kelly, and Ralf Habel. The Ocean and Water Pipeline of Disney’s Moana. In ACM SIGGRAPH 2017 Talks, July 2017.
- Preprint Version
- Official Publisher’s Version
Internal project from Disney Animation. This short paper describes the water pipeline developed for Moana, including the level set compositing and rendering system that was implemented in Hyperion. This system has since found additional usage on shows since Moana.
Recent Advancements in Disney’s Hyperion Renderer

Brent Burley, David Adler, Matt Jen-Yuan Chiang, Ralf Habel, Patrick Kelly, Peter Kutz, Yining Karl Li, and Daniel Teece. Recent Advancements in Disney’s Hyperion Renderer. In ACM SIGGRAPH 2017 Course Notes: Path Tracing in Production Part 1, August 2017.
- Preprint Version (Updated compared to official version)
- Official Publisher’s Version
- Path Tracing in Production SIGGRAPH 2017 Course
Publication from Disney Animation. This paper describes various advancements in Hyperion since Big Hero 6 up through Moana, with a particular focus towards replacing multiple scattering approximations with true, brute-force path-traced solutions for both better artist workflows and improved visual quality.
Denoising with Kernel Prediction and Asymmetric Loss Functions

Thijs Vogels, Fabrice Rousselle, Brian McWilliams, Gerhard Rothlin, Alex Harvill, David Adler, Mark Meyer, and Jan Novák. Denoising with Kernel Prediction and Asymmetric Loss Functions. ACM Transactions on Graphics (Proceedings of SIGGRAPH 2018), 37(4), August 2017.
- Preprint Version
- Official Publisher’s Version
- Project Page
Joint project between DisneyResearch|Studios, Pixar, and Disney Animation. This paper describes a variety of improvements and extensions made to the 2017 Kernel-predicting Convolutional Networks for Denoising Monte Carlo Renderings paper; collectively, these improvements comprise the modern Disney-Research-developed second generation deep-learning denoiser in use in production at Pixar, ILM, and Disney Animation. At Disney Animation, used experimentally on Ralph Breaks the Internet and in full production beginning with Frozen 2.
Plausible Iris Caustics and Limbal Arc Rendering

Matt Jen-Yuan Chiang and Brent Burley. Plausible Iris Caustics and Limbal Arc Rendering. ACM SIGGRAPH 2018 Talks, August 2018.
- Preprint Version
- Official Publisher’s Version
Internal project from Disney Animation. This paper describes a technique for rendering realistic, physically based eye caustics using manifold next-event estimation combined with a plausible procedural geometric eye model. This realistic eye model is implemented in Hyperion and used on all projects beginning with Encanto.
The Design and Evolution of Disney’s Hyperion Renderer

Brent Burley, David Adler, Matt Jen-Yuan Chiang, Hank Driskill, Ralf Habel, Patrick Kelly, Peter Kutz, Yining Karl Li, and Daniel Teece. The Design and Evolution of Disney’s Hyperion Renderer. ACM Transactions on Graphics, 37(3), August 2018.
- Preprint Version
- Official Publisher’s Version
- Project Page
Publication from Disney Animation. This paper is a systems architecture paper for the entirety of Hyperion. The paper describes the history of Disney’s Hyperion Renderer, the internal architecture, various systems such as shading, volumes, many-light sampling, emissive geometry, path simplification, fur rendering, photon-mapped caustics, subsurface scattering, and more. The paper also describes various challenges that had to be overcome for practical production use and artistic controllability. This paper covers everything in Hyperion beginning from Big Hero 6 up through Ralph Breaks the Internet.
Clouds Data Set

Walt Disney Animation Studios. Clouds Data Set, August 2018.
- Official Page
- License
Publicly released data set for rendering research, by Disney Animation. This data set was produced by our production artists as part of the development process for Hyperion’s modern third generation null-collision tracking based volume rendering system.
Moana Island Scene Data Set

Walt Disney Animation Studios. Moana Island Scene Data Set, August 2018.
- Official Page
- License
Publicly released data set for rendering research, by Disney Animation. This data set is an actual production scene from Moana, originally rendered using Hyperion and ported to PBRT v3 for the public release. This data set gives a sense of the typical scene complexity and rendering challenges that Hyperion handles every day in production.
Denoising Deep Monte Carlo Renderings

Delio Vicini, David Adler, Jan Novák, Fabrice Rousselle, and Brent Burley. Denoising Deep Monte Carlo Renderings. Computer Graphics Forum, 38(1), February 2019.
- Preprint Version
- Official Publisher’s Version
- Project Page
Joint project between DisneyResearch|Studios and Disney Animation. This paper presents a technique for denoising deep (meaning images with multiple depth bins per pixel) renders, for use with deep-compositing workflows. This technique was developed as part of general denoising research from DisneyResearch|Studios and the Hyperion team.
The Challenges of Releasing the Moana Island Scene

Rasmus Tamstorf and Heather Pritchett. The Challenges of Releasing the Moana Island Scene. In Proceedings of EGSR 2019, Industry Track, July 2019.
- Preprint Version
- Official Publisher’s Version
Short paper from Disney Animation’s research department, discussing some of the challenges involved in preparing a production Hyperion scene for public release. The Hyperion team provided various support and advice to the larger studio effort to release the Moana Island Scene.
Practical Path Guiding in Production

Thomas Müller. Practical Path Guiding in Production. In ACM SIGGRAPH 2019 Course Notes: Path Guiding in Production, July 2019.
- Preprint Version
- Official Publisher’s Version
- Path Guiding in Production SIGGRAPH 2019 Course
Joint project between DisneyResearch|Studios and Disney Animation. This paper presents a number of improvements and extensions made to Practical Path Guiding developed by in Hyperion by Thomas Müller and the Hyperion team. In use in production on Frozen 2.
Machine-Learning Denoising in Feature Film Production

Henrik Dahlberg, David Adler, and Jeremy Newlin. Machine-Learning Denoising in Feature Film Production. In ACM SIGGRAPH 2019 Talks, July 2019.
- Preprint Version
- Official Publisher’s Version
Joint publication from Pixar, Industrial Light & Magic, and Disney Animation. Describes how the modern Disney-Research-developed second generation deep-learning denoiser was deployed into production at Pixar, ILM, and Disney Animation.
Taming the Shadow Terminator

Matt Jen-Yuan Chiang, Yining Karl Li, and Brent Burley. Taming the Shadow Terminator. In ACM SIGGRAPH 2019 Talks, August 2019.
- Preprint Version (Updated compared to official version)
- Official Publisher’s Version
- Project Page
Internal project from Disney Animation. This short paper describes a solution to the long-standing “shadow terminator” problem associated with using shading normals. The technique in this paper is implemented in Hyperion and has been in use in production starting on Frozen 2 through present.
On Histogram-Preserving Blending for Randomized Texture Tiling

Brent Burley. On Histogram-Preserving Blending for Randomized Texture Tiling. Journal of Computer Graphics Techniques, 8(4), November 2019.
- Preprint Version
- Official Publisher’s Version
Internal project from Disney Animation. This paper describes some modiciations to the histogram-preserving hex-tiling algorithm of Heitz and Neyret; these modifications make implementing the Heitz and Neyret technique easier and more efficient. This paper describes Hyperion’s implementation of the technique, in use in production starting on Frozen 2 through present.
The Look and Lighting of “Show Yourself” in “Frozen 2”

Amol Sathe, Lance Summers, Matt Jen-Yuan Chiang, and James Newland. The Look and Lighting of “Show Yourself” in “Frozen 2”. In ACM SIGGRAPH 2020 Talks, August 2020.
- Preprint Version
- Official Publisher’s Version
Internal project from Disney Animation. This paper describes the process that went into achieving the final look and lighting of the “Show Yourself” sequence in Frozen 2, including a new tabulation-based approach implemented in Hyperion to maintain energy conservation in rough dielectric reflection and transmission.
Practical Hash-based Owen Scrambling

Brent Burley. Practical Hash-based Owen Scrambling. Journal of Computer Graphics Techniques, 9(4), December 2020.
- Preprint Version
- Official Publisher’s Version
Internal project from Disney Animation. This paper describes a new version of Owen scrambling for the Sobol sequence that is both simple to implement, efficient to evaluate, and broadly applicable to various problems.
Unbiased Emission and Scattering Importance Sampling For Heterogeneous Volumes

Wei-Feng Wayne Huang, Peter Kutz, Yining Karl Li, and Matt Jen-Yuan Chiang. Unbiased Emission and Scattering Importance Sampling For Heterogeneous Volumes. In ACM SIGGRAPH 2021 Talks, August 2021.
- Preprint Version
- Official Publisher’s Version
- Project Page
Internal project from Disney Animation. This paper describes a pair of new unbiased distance-sampling methods for production volume path tracing, with a specific focus on sampling emission and scattering. First used on Raya and the Last Dragon.
The Atmosphere of Raya and the Last Dragon

Marc Bryant, Ryan DeYoung, Wei-Feng Wayne Huang, Joe Longson, and Noel Villegas. The Atmosphere of Raya and the Last Dragon. In ACM SIGGRAPH 2021 Talks, August 2021.
- Preprint Version
- Official Publisher’s Version
Internal project from Disney Animation. This paper describes the various rendering and workflow improvements that went into rendering atmospheric volumes to produce the highly atmospheric lighting in Raya and the Last Dragon.
Practical Multiple-Scattering Sheen Using Linearly Transformed Cosines

Tizian Zeltner, Brent Burley, and Matt Jen-Yuan Chiang. Practical Multiple-Scattering Sheen Using Linearly Transformed Cosines. In ACM SIGGRAPH 2022 Talks, August 2022.
- Preprint Version
- Official Publisher’s Version
- Project Page
Joint project between École Polytechnique Fédérale de Lausanne (EPFL) and Disney Animation. This paper descibes the new multiple-scattering sheen model used in the Disney Principled BSDF starting with the production of Strange World.
Deep Adaptive Sampling and Reconstruction Using Analytic Distributions

Farnood Salehi, Marco Manzi, Gerhard Rothlin, Romann Weber, Christopher Schroers, and Marios Papas. Deep Adaptive Sampling and Reconstruction Using Analytic Distributions. ACM Transactions on Graphics (Proceedings of SIGGRAPH Asia 2022), 41(6), December 2022.
- Preprint Version
- Official Publisher’s Version
- Project Page
External project from DisneyResearch|Studios, with project support from Disney Animation. This paper extends Disney’s deep learning denoising technology to also drive adaptive sampling during the rendering process. Part of a larger joint research project between DisneyResearch|Studios, Disney Animation, Pixar, and Industrial Light & Magic on denoising techniques.
“Encanto” - Let’s Talk About Bruno’s Visions

Corey Butler, Brent Burley, Wei-Feng Wayne Huang, Yining Karl Li, and Benjamin Huang. “Encanto” - Let’s Talk About Bruno’s Visions. In ACM SIGGRAPH 2022 Talks, August 2022.
- Preprint Version
- Official Publisher’s Version
- Project Page
Internal project from Disney Animation. This paper describes the process of creating the holographic prophecy shards from Encanto, including a new teleportation shader in Hyperion that was developed specifically to support this effect.
Fracture-Aware Tessellation of Subdivision Surfaces

Brent Burley and Francisco Rodriguez. Fracture-Aware Tessellation of Subdivision Surfaces. In ACM SIGGRAPH 2022 Talks, August 2022.
- Preprint Version
- Official Publisher’s Version
Internal project from Disney Animation. This paper describes a new tessellation algorithm for fractured subdivision surfaces, used as part of Disney Animation’s destruction FX pipeline and implemented in Hypeprion. First used in production on Encanto.
Deep Compositional Denoising on Frame Sequences

Xianyao Zhang, Gerhard Rothlin, Marco Manzi, Markus Gross, and Marios Papas. Deep Compositional Denoising on Frame Sequences. In EGSR 2023: Proceedings of the 34th Eurographics Symposium on Rendering, June 2023.
- Preprint Version
- Official Publisher’s Version
- Project Page
External project from DisneyResearch|Studios, with project support from Disney Animation. This paper unifies previously separate approaches used in Disney’s deep learning denoising system for single-frame compositional denoising and multi-frame non-compositional denoising. Part of a larger joint research project between DisneyResearch|Studios, Disney Animation, Pixar, and Industrial Light & Magic on denoising techniques.
Progressive Null-Tracking for Volumetric Rendering

Zackary Misso, Yining Karl Li, Brent Burley, Daniel Teece, and Wojciech Jarosz. Progressive Null Tracking for Volumetric Rendering. SIGGRAPH ‘23: ACM SIGGRAPH 2023 Conference Proceedings, August 2023.
- Preprint Version
- Official Publisher’s Version
- Project Page
Joint project between Dartmouth College and Disney Animation. This paper describes a new method to progressively learn bounding majorants when using null-tracking techniques to perform unbiased rendering of general heterogeneous volumes with unknown bounding majorants.
Splat: Developing a ‘Strange’ Shader

Kendall Litaker, Brent Burley, Dan Lipson, and Mason Khoo. Splat: Developing a ‘Strange’ Shader. In ACM SIGGRAPH 2023 Talks, August 2023.
- Preprint Version
- Official Publisher’s Version
Internal project from Disney Animation. This paper describes the unusual challenges encountered when developing the unique shading and look for the Splat character from Strange World.
Neural Denoising for Deep-Z Monte Carlo Renderings

Xianyao Zhang, Gerhard Rothlin, Shilin Zhu, Tunç Ozan Aydin, Farnood Salehi, Markus Gross, Marios Papas. Neural Denoising for Deep-Z Monte Carlo Renderings. Computer Graphics Forum (Proceedings of Eurographics 2024), 43(2), April 2024.
- Preprint Version
- Official Publisher’s Version
- Project Page
External joint project between DisneyResearch|Studios and Pixar, with project support from Disney Animation. This paper describes an extension to Disney’s deep learning denoising technology to add support for deep-Z images and deep compositing workflows. Part of a larger joint research project between DisneyResearch|Studios, Disney Animation, Pixar, and Industrial Light & Magic on denoising techniques.
Cache Points for Production-Scale Occlusion-Aware Many-Lights Sampling and Volumetric Scattering

Yining Karl Li, Charlotte Zhu, Gregory Nichols, Peter Kutz, Wei-Feng Wayne Huang, David Adler, Brent Burley, and Daniel Teece. Cache Points for Production-Scale Occlusion-Aware Many-Lights Sampling and Volumetric Scattering. In DigiPro ‘24: Proceedings of the Digital Production Symposium 2024, July 2024.
- Preprint Version
- Official Publisher’s Version
- Project Page
Internal project from Disney Animation. This paper describes Hyperion’s unique many-lights importance sampling system. Used on every project rendered using Hyperion to date, this paper contains deep implementation details and notes from a decade of production experience.
Dynamic Screen Space Textures for Coherent Stylization

Brent Burley, Brian Green, and Daniel Teece. Dynamic Screen Space Textures for Coherent Stylization. In ACM SIGGRAPH 2024 Talks, July 2024.
- Preprint Version
- Official Publisher’s Version
Internal project from Disney Animation. This paper describes a novel new dynamic screen space texturing system that makes up a key part of the stylized watercolor look of Wish.
Volume Scattering Probability Guiding

Kehan Xu, Sebastian Herholz, Marco Manzi, Marios Papas, and Markus Gross. Volume Scattering Probability Guiding. ACM Transactions on Graphics (Proceedings of SIGGRAPH Asia 2024), 43(6), December 2024.
- Preprint Version
- Official Publisher’s Version
- Project Page
External joint project between DisneyResearch|Studios and Intel, with project support from Disney Animation. This paper describes an improvement to volume path guiding that enables direct control over volume scattering probability. Part of a larger joint research project between DisneyResearch|Studios, Disney Animation, Pixar, and Industrial Light & Magic on path guiding techniques.
Neural Resampling with Optimized Candidate Allocation

Alexander Rath, Marco Manzi, Farnood Salehi, Sebastian Weiss, Tiziano Portenier, Saeed Hadadan, and Marios Papas. Neural Resampling with Optimized Candidate Allocation. In EGSR 2025: Proceedings of the 36th Eurographics Symposium on Rendering, June 2025.
- Preprint Version
- Official Publisher’s Version
Joint project between DisneyResearch|Studios and Disney Animation. This paper presents an experimental implementation of GPU-based neural learning system inside of Hyperion’s CPU-based architecture, suitable for use as both a path guiding system and for radiance caching. Part of a larger joint research project between DisneyResearch|Studios, Disney Animation, Pixar, and Industrial Light & Magic on path guiding techniques.
A Texture Streaming Pipeline for Real-Time GPU Ray Tracing

Mark Lee, Nathan Zeichner, and Yining Karl Li. A Texture Streaming Pipeline for Real-Time GPU Ray. In ACM SIGGRAPH 2025 Talks, August 2025.
- Preprint Version
- Official Publisher’s Version
- Project Page
Internal project from Disney Animation. This paper describes our implementation of a streaming textures system on the GPU for Ptex, which is in use in our in-house GPU path tracing previsualization renderer.
Path Guiding Surfaces and Volumes in Disney’s Hyperion Renderer: A Case Study

Lea Reichardt, Brian Green, Yining Karl Li, and Marco Manzi. Path Guiding Surfaces and Volumes in Disney’s Hyperion Renderer: A Case Study. In ACM SIGGRAPH 2025 Course Notes: Path Guiding in Production and Recent Advancements, August 2025.
- Preprint Version (Hyperion Chapter)
- Preprint Version (Full Course Notes)
- Official Publisher’s Version
- Path Guiding in Production and Recent Advancements SIGGRAPH 2025 Course
Joint project from DisneyResearch|Studios and Disney Animation. This paper describes how we’ve implemented Hyperion’s second-generation path guiding system, and what we’ve learned from bridging cutting edge research into production usage.

Additional Resources

Again, this post is meant to be a living document; any new publications with involvement from the Hyperion team will be added here. Of course, the Hyperion team is not the only team at Disney Animation that regularly publishes; for a full list of publications from Disney Animation, see the official Disney Animation publications page. The Disney Animation Technology website is also worth keeping an eye on if you want to keep up on what our engineers and TDs are working on!

If you’re just getting started and want to learn more about rendering in general, the must-read text that every rendering engineer has on their desk or bookshelf is Physically Based Rendering 3rd Edition by Matt Pharr, Wenzel Jakob, and Greg Humphreys (now available online completely for free!). Also, the de-facto standard beginner’s text today is the Ray Tracing in One Weekend series by Peter Shirley, which provides a great, gentle, practical introduction to ray tracing, and is also available completely for free. Also take a look at Real-Time Rendering 4th Edition, Ray Tracing Gems (also available online for free), The Graphics Codex by Morgan McGuire, and Eric Haines’s Ray Tracing Resources page.

Many other amazing rendering teams at both large studios and commercial vendors also publish regularly, and I highly recommend keeping up with all of their work too! For a good starting point into exploring the wider world of production rendering, check out the ACM Transactions on Graphics Special Issue on Production Rendering, which is edited by Matt Pharr and contains extensive, detailed systems papers on Pixar’s RenderMan, Weta Digital’s Manuka, Solid Angle (Autodesk)’s Arnold, Sony Picture Imageworks’ Arnold, and of course Disney Animation’s Hyperion. A sixth paper that I would group with five above is the High Performance Graphics 2017 paper detailing the architecture of DreamWorks Animation’s MoonRay.

For even further exploration, extensive course notes are available from SIGGRAPH courses every year. Particularly good recurring courses to look at from past years are the Path Tracing in Production course (2017, 2018, 2019), the absolutely legendary Physically Based Shading course (2010, 2012, 2013, 2014, 2015, 2016, 2017), the various incarnations of a volume rendering course (2011, 2017, 2018), and now due to the dawn of ray tracing in games, Advances in Real-Time Rendering and Open Problems in Real-Time Rendering. Also, Stephen Hill typically collects links to all publicly available course notes, slides, source code, and more for SIGGRAPH each year after the conference on his blog; both his blog and the blogs listed on the sidebar of his website are essentially mandatory reading in the rendering world. Also, interesting rendering papers are always being published in journals and at conferences. The major journals to check are ACM Transactions on Graphics (TOG), Computer Graphics Forum (CGF), and the Journal of Computer Graphics Techniques (JCGT); the major academic conferences where rendering stuff appears are SIGGRAPH, SIGGRAPH Asia, EGSR (Eurographics Symposium on Rendering), HPG (High Performance Graphics), MAM (Workshop on Material Appearance Modeling), EUROGRAPHICS, and i3D (ACM SIGGRAPH Symposium on Interactive 3D Graphics and Games); another three industry conferences where interesting stuff often appears are DigiPro, GDC (Game Developers Conference) and GTC (Graphics Technology Conference). A complete listing of the contents for all of these conferences every year, along with links to preprints, is compiled by Ke-Sen Huang.

Contributors

A large number of people have contributed directly to Hyperion’s development since the beginning of the project, in a variety of capacities ranging from core developers to TDs and support staff and all the way to notable interns. In no particular order, including both present and past: Daniel Teece, Brent Burley, David Adler, Yining Karl Li, Mark Lee, Charlotte Zhu, Brian Green, Andrew Bauer, Lea Reichardt, Mackenzie Thompson, Wei-Feng Wayne Huang, Matt Jen-Yuan Chen, Joe Schutte, Andrew Gartner, Jennifer Yu, Peter Kutz, Ralf Habel, Patrick Kelly, Gregory Nichols, Andrew Selle, Christian Eisenacher, Jan Novák, Ben Spencer, Doug Lesan, Lisa Young, Tami Valdez, Andrew Fisher, Noah Kagan, Benedikt Bitterli, Thomas Müller, Tizian Zeltner, Zackary Misso, Magdalena Martinek, Mathijs Molenaar, Laura Lediav, Guillaume Loubet, David Koerner, Simon Kallweit, Gabor Liktor, Ulrich Muller, Norman Moses Joseph, Stella Cheng, Marc Cooper, Tal Lancaster, and Serge Sretschinsky. Over the years, our closest research partners at DisneyResearch|Studios, Pixar Animation Studios, Industrial Light & Magic, and elsewhere have included (in no particular order): Marios Papas, Marco Manzi, Tiziano Portenier, Alexander Rath, Rajesh Sharma, Rasmus Tamstorf, Jan Novák, Gerhard Roethlin, Per Christensen, Julian Fong, Mark Meyer, André Mazzone, Wojciech Jarosz, Fabrice Rouselle, Christophe Hery, Ryusuke Villemin, and Magnus Wrenninge. Invaluable support from studio leadership over the years has been provided by (again, in no particular order): Nick Cannon, Munira Tayabji, Rebecca Bever, Bettina Martin, Laura Franek, Collin Larkins, Golriz Fanai, Rajesh Sharma, Chuck Tappan, Sean Jenkins, Darren Robinson, Alex Nijmeh, Hank Driskill, Kyle Odermatt, Adolph Lusinsky, Ernie Petti, Kelsey Hurley, Shweta Viswanathan, Tad Miller, Mark Hammel, Mohit Kallianpur, Brian Leach, Daniel Rice, Amol Sathe, Alessandro Jacomini, Josh Staub, Steve Goldberg, Scott Kersavage, Andy Hendrickson, Dan Candela, Ed Catmull, and many others. Of course, beyond this enormous list, there is an even more enormous list of countless artists, technical directors, production supervisors, and other technology development teams at Disney Animation who motivated Hyperion, participated in its development, and contributed to its success. If anything in this post has caught your interest, keep an eye out for open position listings on DisneyAnimation.com; maybe these lists can one day include you!

Made with Hyperion

Finally, here is a list of all publicly released and announced projects to date made using Disney’s Hyperion Renderer:

Feature Films: Big Hero 6 (2014), Zootopia (2016), Moana (2016), Ralph Breaks the Internet (2018), Frozen 2 (2019), Raya and the Last Dragon (2021), Encanto (2021), Strange World (2022), Wish (2023), Moana 2 (2024), Zootopia 2 (2025)

Shorts and Featurettes: Feast (2014), Frozen Fever (2015), Inner Workings (2016), Gone Fishing (2017), Olaf’s Frozen Adventure (2017), Myth: A Frozen Tale¹ (2019), Once Upon a Snowman (2020), Us Again (2021), Far From the Tree (2021), Once Upon A Studio (2023), Versa (2025)

Animated Series: At Home With Olaf (2020), Olaf Presents (2021), Baymax! (2022), Zootopia+ (2022)

Short Circuit Shorts: Exchange Student (2020), Just a Thought (2020), Jing Hua (2020), Elephant in the Room (2020), Puddles (2020), Lightning in a Bottle (2020), Zenith (2020), Drop (2020), Fetch (2020), Downtown (2020), Hair-Jitsu (2020), The Race (2020), Lucky Toupée (2020), Cycles² (2020), A Kite’s Tale² (2020), Going Home (2021), Crosswalk (2021), Songs to Sing in the Dark (2021), No. 2 to Kettering (2021), Reflect (2022), Life Drawings (2026), Maddie & the Test (2026)

Intern Shorts: Ventana (2017), Voilà (2018), Maestro (2019), June Bug (2021)

Filmmaker Co-op Shorts: Weeds (2017), Forevergreen (2025)

Disney Parks Rides³: Zootopia: Hot Pursuit (Shanghai Disneyland, 2023), Peter Pan’s Never Land Adventure (Tokyo DisneySea, 2024), Zootopia: Better Zoogether! (Disney’s Animal Kingdom, 2025)

Footnotes

¹ VR project running on Unreal Engine, with shading and textures baked out of Disney’s Hyperion Renderer. keyboard_return

² VR project running on Unity, with shading and textures baked out of Disney’s Hyperion Renderer. keyboard_return

³ CG animation provided by Disney Animation, incorporated into park rides built by Walt Disney Imagineering. keyboard_return

Nested Dielectrics

2019-05-21T00:00:00+00:00

1. Introduction
2. Problems with only Interface Tracking

3. Priority-Based Nested Dielectrics
4. References

Introduction

A few years ago, I wrote a post about attenuated transmission and what I called “deep attenuation” at the time- refraction and transmission through multiple mediums embedded inside of each other, a.k.a. what is usually called nested dielectrics. What I called “deep attenuation” in that post is, in its essence, just pure interface tracking using a stack. This post is meant as a revisit and update of that post; I’ll talk about the problems with the ad-hoc pure interface tracking technique I came up with in that previous post and discuss the proper priority-based nested dielectric technique [Schmidt and Budge 2002] that Takua uses today.

In my 2015 post, I included a diagram showing the overlapping boundaries required to model ice cubes in a drink in a glass, but I didn’t actually include a render of that scenario! In retrospect, the problems with the 2015 post would have become obvious to me more quickly if I had actually done a render like that diagram. Figure 1 shows an actual “ice cubes in a drink in a glass” scene, rendered correctly using Takua Renderer’s implementation of priority-based nested dielectrics. For comparison, Figure 2 shows what Takua produces using the approach in the 2015 post; there are a number of obvious bizarre problems! In Figure 2, the ice cubes don’t properly refract the tea behind and underneath them, and the ice cubes under the liquid surface aren’t visible at all. Also, where the surface of the tea interfaces with the glass teacup, there is a odd bright ring. Conversely, Figure 1 shows a correct liquid-glass interface without a bright ring, shows proper refraction through the ice cubes, and correctly shows the ice cubes under the liquid surface.

Problems with only Interface Tracking

So what exactly is wrong with using only interface tracking without priorities? First, let’s quickly summarize how my old interface tracking implementation worked. Note that here we refer to the side of a surface a ray is currently on as the incident side, and the other side of the surface as the transmit side. For each path, keep a stack of which Bsdfs the path has encountered:

When a ray enters a surface, push the encountered surface onto the stack.
When a ray exits a surface, scan the stack from the top down and pop the first instance of a surface in the stack matching the encountered surface.
When hitting the front side of a surface, the incident properties comes from the top of the stack (or is the empty default if the stack is empty), and the transmit properties comes from surface being intersected.
When hitting the back side of a surface, the incident properties comes from the surface being intersected, and the transmit properties comes from the top of the stack (or is the empty default if the stack is empty).
Only push/pop onto the stack when a refraction/transmission event occurs.

Next, as an example, imagine a case where which surface a ray currently in is ambiguous. A common example of this case is when two surfaces are modeled as being slightly overlapping, as is often done when modeling liquid inside of a glass since modeling perfectly coincident surfaces in CG is either extremely difficult or impossible due to floating point precision problems. Even if we could model perfectly coincident surfaces, rendering perfectly coincident surfaces without artifacts is similarly extremely difficult or impossible, also due to floating point precision problems. Figure 3 shows a diagram of how a glass containing water and ice cubes is commonly modeled; in Figure 3, the ambiguous regions are where the water surface is inside of the glass and inside of the ice cube. When a ray enters this overlapping region, it is not clear whether we should treat the ray as being inside the water or inside if the glass (or ice)!

Using the pure interface tracking algorithm from my old blog post, below is what happens at each path vertex along the path illustrated in Figure 3. In this example, we define the empty default to be air.

Enter Glass.
- Incident/transmit IOR: Air/Glass.
- Push Glass onto stack. Stack after event: (Glass).
Enter Water.
- Incident/transmit IOR: Glass/Water.
- Push Water onto stack. Stack after event: (Water, Glass).
Exit Glass.
- Incident/transmit IOR: Glass/Water.
- Remove Glass from stack. Stack: (Water).
Enter Ice.
- Incident/transmit IOR: Water/Ice.
- Push Ice onto stack. Stack: (Ice, Water).
Exit Water.
- Incident/transmit IOR: Water/Ice.
- Remove Water from stack. Stack: (Ice).
Exit Ice.
- Incident/transmit IOR: Ice/Air.
- Remove Ice from stack. Stack: empty.
Enter Water.
- Incident/transmit IOR: Air/Water.
- Push Water onto stack. Stack after event: (Water).
Enter Glass.
- Incident/transmit IOR: Water/Glass.
- Push Glass onto stack. Stack after event: (Glass, Water).
Reflect off Water.
- Incident/transmit IOR: Water/Glass.
- No change to stack. Stack after event: (Glass, Water).
Reflect off Glass.
- Incident/transmit IOR: Glass/Glass.
- No change to stack. Stack after event: (Glass, Water).
Exit Water.
- Incident/transmit IOR: Water/Glass.
- Remove Water from stack. Stack after event: (Glass).
Exit Glass.
- Incident/transmit IOR: Glass/Air.
- Remove Glass from stack. Stack after event: empty.

Observe events 3 and 5, where the same index of refraction boundary is encountered as in the previous event. These double events are where some of the weirdness in Figure 2 comes from; specifically the bright ring at the liquid-glass surface interface and the incorrect refraction through the ice cube. These double events are not actually physically meaningful; in reality, a ray could never be both inside of a glass surface and inside of a water surface simultaneously. Figure 4 shows a simplified version of the tea cup example above, without ice cubes; even then, the double event still causes a bright ring at the liquid-glass surface interface. Also note how when following the rules from my old blog post, event 10 becomes a nonsense event where the incident and transmit IOR are the same. The fix for this case is to modify the rules so that when a ray exits a surface, the transmit properties come from the first surface on the stack that isn’t the same as the incident surface, but even with this fix, the reflection at event 10 is still physically impossible.

Really what we want is to model overlapping surfaces, but then in overlapping areas, be able to specify which surface a ray should think it is actually inside of. Essentially, this functionality would make overlapping surfaces behave like boolean operators; we would be able to specify that the ice cubes in Figure 3 “cut out” a space from the water they overlap with, and the glass cut out a space from the water as well. This way, the double events never occur since rays wouldn’t see the second event in each pair of double events. One solution that immediately comes to mind is to simply consider whatever surface is at the top of the interface tracking stack as being the surface we are currently inside, but this causes an even worse problem: the order of surfaces that a ray thinks it is in becomes dependent on what surfaces a ray encounters first, which depends on the direction and location of each ray! This produces an inconsistent view of the world across different rays. Instead, a better solution is provided by priority-based nested dielectrics [Schmidt and Budge 2002].

Priority-Based Nested Dielectrics

Priority-based nested dielectrics work by assigning priority values to geometry, with the priority values determining which piece of geometry “wins” when a ray is in a region of space where multiple pieces of geometry overlap. A priority value is just a single number assigned as an attribute to a piece of geometry or to a shader; the convention established by the paper is that lower numbers indicate higher priority. The basic algorithm in [Schmidt and Budge 2002] works using an interior list, which is conceptually similar to an interface tracking stack. The interior list is exactly what it sounds like: a list of all of the surfaces that a path has entered but not exited yet. Unlike the interface tracking stack though, the interior list doesn’t necessarily have to be a stack or have any particular ordering, although implementing it as a list always sorted by priority may provide some minor practical advantages. When a ray hits a surface during traversal, the following rules apply:

If the surface has a higher or equal priority (so lower or equal priority number) than anything else on the interior list, the result is a true hit and a intersection has occured. Proceed with regular shading and Bsdf evaluation.
If the surface has a lower priority (so higher priority number) than the highest-priority value on the interior list, the result is a false hit and no intersection has occured. Ignore the intersection and continue with ray traversal.
If the hit is a false hit OR if the hit both is a true hit and results in a refraction/transmission event:
- Add the surface to the interior list if the ray is entering the surface.
- Remove the surface from the interior list if the ray is exiting the surface.
For a true hit the produces a reflection event, don’t add the surface to the interior list.

Note that this approach only works with surfaces that are enclosed manifolds; that is, every surface defines a finite volume. When a ray exits a surface, the surface it is exiting must already be in the interior list; if not, then the interior list can become corrupted and the renderer may start thinking that paths are in surfaces that they are not actually in (or vice verse). Also note that a ray can only ever enter into a higher-priority surface through finding a true hit, and can only enter into a lower-priority surface by exiting a higher-priority surface and removing the higher-priority surface from the interior list. At each true hit, we can figure out the properties of the incident and transmit sides by examining the interior list. If hitting the front side of a surface, before we update the interior list, the surface we just hit provides the transmit properties and the highest-priority surface on the interior list provides the incident properties. If hitting the back side of a surface, before we update the interior list, the surface we just hit provides the incident properties and the second-highest-priority surface on the interior list provides the transmit properties. Alternatively, if the interior list only contains one surface, then the transmit properties come from the empty default. Importantly, if a ray hits a surface with no priority value set, that surface should always count as a true hit. This way, we can embed non-transmissive objects inside of transmissive objects and have everything work automatically.

Figure 5 shows the same scenario as in Figure 3, but now with priority values assigned to each piece of geometry. The path depicted in Figure 5 uses the priority-based interior list; dotted lines indicate parts of a surface that produce false hits due to being embedded within a higher-priority surface:

The empty default air surrounding everything is defined as having an infinitely high priority value, which means a lower priority than any surface in the scene. Using the priority-based interior list, here are the events that occur at each intersection along the path in Figure 5:

Enter Glass.
- Glass priority (1) is higher than ambient air (infinite), so TRUE hit.
- Incident/transmit IOR: Air/Glass.
- True hit, so evaluate Bsdf and produce refraction event.
- Interior list after event: (Glass:1). Inside surface after event: Glass.
Enter Water.
- Water priority (2) is lower than highest priority in interior list (1), so FALSE hit.
- Incident/transmit IOR: N/A.
- False hit, so do not evaluate Bsdf and just continue straight.
- Interior list after event: (Glass:1, Water:2). Inside surface after event: Glass.
Exit Glass.
- Glass priority (1) is equal to the highest priority in interior list (1), so TRUE hit.
- Incident/transmit IOR: Glass/Water.
- True hit, so evaluate Bsdf and produce refraction event. Remove Glass from interior list.
- Interior list after event: (Water:2). Inside surface after event: Water.
Enter Ice.
- Ice priority (0) is higher than the highest priority in interior list (2), so TRUE hit.
- Incident/transmit IOR: Water/Ice.
- True hit, so evaluate Bsdf and produce refraction event.
- Interior list after event: (Water:2, Ice:0). Inside surface after event: Ice.
Exit Water.
- Ice priority (0) is higher than the highest priority in interior list (2), so TRUE hit.
- Incident/transmit IOR: N/A.
- False hit, so do not evaluate Bsdf and just continue straight. Remove Water from interior list.
- Interior list after event: (Ice:0). Inside surface after event: Ice.
Exit Ice.
- Ice priority is only surface in the interior list, so TRUE hit.
- Incident/transmit IOR: Ice/Air.
- True hit, so evaluate Bsdf and produce refraction event. Remove Ice from interior list.
- Interior list after event: empty. Inside surface after event: air.
Enter Water.
- Water priority (2) is higher than ambient air (infinite), so TRUE hit.
- Incident/transmit IOR: Air/Water.
- True hit, so evaluate Bsdf and produce refraction event.
- Interior list after event: (Water:2). Inside surface after event: Water.
Enter Glass.
- Glass priority (1) is higher than the highest priority in interior list (2), so TRUE hit.
- Incident/transmit IOR: Water/Glass.
- True hit, so evaluate Bsdf and produce refraction event.
- Interior list after event: (Water:2, Glass:1). Inside surface after event: Glass.
Exit Water.
- Water priority (2) is lower than highest priority in interior list (1), so FALSE hit.
- Incident/transmit IOR: N/A.
- False hit, so do not evaluate Bsdf and just continue straight.
- Interior list after event: (Glass:1). Inside surface after event: Glass.
Reflect off Glass.
- Glass priority (1) is equal to the highest priority in interior list (1), so TRUE hit.
- Incident/transmit IOR: Glass/Air.
- True hit, so evaluate Bsdf and produce reflection event.
- Interior list after event: (Glass:1). Inside surface after event: Glass.
Enter Water.
- Water priority (2) is lower than highest priority in interior list (1), so FALSE hit.
- Incident/transmit IOR: N/A.
- False hit, so do not evaluate Bsdf and just continue straight.
- Interior list after event: (Glass:1, Water:2). Inside surface after event: Glass.
Reflect off Glass.
- Glass priority (1) is equal to the highest priority in interior list (1), so TRUE hit.
- Incident/transmit IOR: Glass/Water.
- True hit, so evaluate Bsdf and produce reflection event.
- Interior list after event: (Glass:1, Water:2). Inside surface after event: Glass.
Exit Water.
- Water priority (2) is lower than highest priority in interior list (1), so FALSE hit.
- Incident/transmit IOR: N/A.
- False hit, so do not evaluate Bsdf and just continue straight.
- Interior list after event: (Glass:1). Inside surface after event: Glass.
Exit Glass.
- Glass priority (1) is equal to the highest priority in interior list (1), so TRUE hit.
- Incident/transmit IOR: Glass/Air.
- True hit, so evaluate Bsdf and produce refraction event. Remove Glass from interior list.
- Interior list after event: empty. Inside surface after event: air.

The entire above sequence of events is physically plausible, and produces no weird double-events! Using priority-based nested dielectrics, Takua generates the correct images in Figure 1 and Figure 6. Note how in Figure 6 below, the liquid appears to come right up against the glass, without any bright boundary artifacts or anything else.

For actually implementing priorty-based nested dielectrics in a ray tracing renderer, I think there are two equally plausible places in the renderer where the implementation can take place. The first and most obvious location is as part of standard light transport integration or shading system. The integrator would be in charge of checking for false hits and tracing continuation rays through false hit geometry. A second, slightly less obvious location is actually as part of ray traversal through the scene itself. Including handling of false hits in the traversal system can be more efficient than handling it in the integrator since the false hit checks could be done in the middle of a single BVH tree traversal, whereas handling false hits by firing continuation rays requires a new BVH tree traversal for each false hit encountered. Also, handling false hits in the traversal system removes some complexity from the integrator. However, the downside to handling false hits in the traversal system is that it requires plumbing all of the interior list data and logic into the traversal system, which sets up something of a weird backwards dependency between the traversal and shading/integration systems. I wound up choosing to implement priority-based nested dielectrics in the integration system in Takua, simply to avoid having to do complex, weird plumbing back into the traversal system. Takua uses priority-based nested dielectrics in all integrators, including unidirectional path tracing, BDPT, PPM, and VCM, and also uses the nested dielectrics system to handle transmittance along bidirectional connections through attenuating mediums.

Even though the technique has “nested dielectrics” in the title, this technique is not in principle limited to only dielectrics. In Takua, I now use this technique to handle all transmissive cases, including for both dielectric surfaces and for surfaces with diffuse transmission. Also, in addition to just determining the incident and transmit IORs, Takua uses this system to also determine things like what kind of participating medium a ray is currently inside of in order to calculate attenuation. This technique appears to be more or less the industry standard today; implementations are available for at least Renderman, Arnold, Mantra, and Maxwell Render.

As a side note, during the course of this work, I also upgraded Takua’s attenuation system to use ratio tracking [Novák et al. 2014] instead of ray marching when doing volumetric lookups. This change results in an important improvement to the attenuation system: ratio tracking provides an unbiased estimate of transmittance, whereas ray marching is inherently biased due to being a quadrature-based technique.

Figures 7 and 8 show a fancier scene of liquid pouring into a glass with some ice cubes and such. This scene is the Glass of Water scene from Benedikt Bitterli’s rendering resources page [Bitterli 2016], modified with brighter lighting on a white backdrop and with red liquid. I also had to modify the scene so that the liquid overlaps the glass slightly; providing a clearer read for the liquid-glass interface is why I made the liquid red. One of the neat features of this scene are the cracks modeled inside of the ice cubes; the cracks are non-manifold geometry. To render them correctly, I applied a shader with glossy refraction to the crack geometry but did not set a priority value for them; this works correctly because the cracks, being non-manifold, don’t have a concept of inside or outside anyway, so they should not participate in any interior list considerations.

References

Benedikt Bitterli. 2016. Rendering Resources. Retrieved from https://benedikt-bitterli.me/resources/.

Charles M. Schmidt and Brian Budge. 2002. Simple Nested Dielectrics in Ray Traced Images. Journal of Graphics Tools 7, 2 (Jan. 2002), 1–8.

Some Blog Update Notes

For the past few years, my blog posts covering personal work have trended towards gignormous epic articles tackling huge subjects published only once or twice a year, such as with the bidirectional mipmapping post and its promised but still unfinished part 2. Unfortunately, I’m not the fastest writer when working on huge posts, since writing those posts often involves significant learning and multiple iterations of implementation and testing on my part. Over the next few months, I’m aiming to write more posts similar to this one, covering some relatively smaller topics, so that I can get posts coming out a bit more frequently while I continue to work on several upcoming, gignormous posts on long-promised topics. Or at least, that’s the plan… we’ll see!

Ralph Breaks the Internet

2018-11-15T00:00:00+00:00

The Walt Disney Animation Studios film for 2018 is Ralph Breaks the Internet, which is the sequel to 2012’s Wreck-It Ralph. Over the past two years, I’ve been fortunate enough to work on a number of improvements to Disney’s Hyperion Renderer for Ralph Breaks the Internet; collectively, these improvements make up perhaps the biggest jump in rendering capabilities that Hyperion has seen since the original deployment of Hyperion on Big Hero 6. I got my third Disney Animation credit on Ralph Breaks the Internet!

Over the past two years, the Hyperion team has publicly presented a number of major development efforts and research advancements. Many of these advancements were put into experimental use on Olaf’s Frozen Adventure last year, but Ralph Breaks the Internet is the first time we’ve put all of these new capabilities and features into full-scale production together. I was fortunate enough to be fairly deeply involved in several of these efforts (specifically, traversal improvements and volume rendering). One of my favorite things about working at Disney Animation is how production and technology partner together to make our films; we truly would not have been able to pull off any of Hyperion’s new advancements without production’s constant support and willingness to try new things in the name of advancing the artistry of our films.

Ralph Breaks the Internet is our first feature film to use Hyperion’s new spectral and decomposition tracking [Kutz et al. 2017] based null-collision volume rendering system exclusively. Originally we had planned to use the new volume rendering system side-by-side with Hyperion’s previous residual ratio tracking [Novák 2014] based volume rendering system [Fong 2017], but the results from the new system were so compelling that the show decided to switch over to the new volume rendering exclusively, which in turn allowed us to deprecate and remove the old volume rendering system ahead of schedule. This new volume rendering system is the culmination of two years of work from Ralf Habel, Peter Kutz, Patrick Kelly, and myself. We had the enormous privilege of working with a large number of FX and lighting artists to develop, test, and refine this new system; specifically, I want to call out Jesse Erickson, Henrik Falt, and Alex Nijmeh for really championing the new volume rendering system and encouraging and supporting its development. We also owe an enormous amount to the rest of the Hyperion development team, which gave us the time and resources to spent two years building a new volume rendering system essentially from scratch. Finally, I want to underscore that the research and underpins our new volume rendering system was conducted jointly between us and Disney Research Zürich, and that this could not have happened without our colleagues at Disney Research Zürich (specifically, Jan Novák and Marios Papas); I think this entire project has been a huge shining example of the value and importance of having a dedicated blue-sky research division. Every explosion and cloud and dust plume and every bit of fog and atmospherics you see in Ralf Breaks in the Internet was rendered using the new volume rendering system! Interestingly, we actually found that while the new volume rendering system is much faster and much more efficient at rendering dense volumes (and especially volumes with lots of high-order scattering) compared to the old system, the new system actually has some difficulty rendering thin volumes such as mist and atmospheric fog. This isn’t be surprising, since thin volumes require better transmittance sampling over better distance sampling and null collision volume rendering is really optimized for distance sampling. We were able to work with production to come up with workarounds for this problem on Ralph Breaks the Internet, but this area is definitely a good topic for future research.

Ralph Breaks the Internet is also our first feature film to move to exclusively using brute force path-traced subsurface scattering [Chiang 2016] for all characters, as a replacement for Hyperion’s previous normalized diffusion based subsurface scattering [Burley 2015]. This feature was tested on Olaf’s Frozen Adventure in a limited capacity, but Ralph Breaks the Internet is the first time we’ve switched path-traced subsurface to being to default subsurface mode in the renderer. Matt Chiang, Peter Kutz, and Brent Burley put a lot of effort into developing new sampling techniques to reduce color noise in subsurface scattering, and also into developing a new parameterization that closely matched Hyperion’s normalized diffusion parameterization, which allowed artists to basically just flip a switch between normalized diffusion and path-traced subsurface and get a predictable, similar result. Many more details on Hyperion’s path-traced subsurface implementation are in our recent system architecture paper [Burley 2018]. In addition to making characters we already know, such as Ralph and Vanellope, look better and more detailed, path-traced subsurface scattering also proved critical to hitting the required looks for new characters, such as the slug-like Double Dan character.

When Ralph and Vanellope first enter the world of the internet, there are several establishing shots showing vast vistas of the enormous infinite metropolis that the film depicts the internet as. Early in production, some render tests of the internet metropolis proved to be extremely challenging due to the sheer amount of geometry in the scene. Although instancing was used extensively, the way the scenes had to be built in our production pipeline meant that Hyperion wasn’t able to leverage the instancing in the scene as efficiently as we would have liked. Additionally, the way the instance groups were structured made traversal in Hyperion less ideal than it could have been. After encountering smaller-scale versions of the same problems on Moana, Peter Kutz and I had arrived at an idea that we called “multiple entry points”, which basically lets Hyperion blur the lines between top and bottom level BVHs in a two-level BVH structure. By inserting mid-level nodes from bottom level BVHs in to the top-level BVH, Hyperion can produce a much more efficient top-level BVH, dramatically accelerating rendering of large instance groups and other difficult-to-split pieces of large geometry, such as groundplanes. This idea is very similar to BVH rebraiding [Benthin et al. 2017], but we arrived at our approach independently before the publication of BVH rebraiding. After initial testing on Olaf’s Frozen Adventure proved promising, we enabled multiple entry points by default for the entirety of Ralph Breaks the Internet. Additionally, Dan Teece developed a powerful automatic geometry de-duplication system, which allows Hyperion to reclaim large amounts of memory in cases where multiple instance groups are authored with separate copies of the same master geometry. Greg Nichols and I also developed a new multithreading strategy for handling Hyperion’s ultra-wide batched ray traversal, which significantly improved Hyperion’s multithreaded scalability during traversal to near-linear scaling with number of cores. All of these geometry and traversal improvements collectively meant that by the main production push for the show, render times for the large internet vista shots had dropped from being by far the highest in the show to being indistinguishable from any other normal shot. These improvements also proved to be timely, since the internet set was just the beginning of massive-scale geometry and instancing on Ralph Breaks the Internet; solving the render efficiency problems for the internet set also made other large-scale instancing sequences, such as the Ralphzilla battle [Byun et al. 2019] at the end of the film and the massive crowds [Richards et al. 2019] in the internet, easier to render.

Another major advancement we made on Ralph Breaks the Internet, in collaboration with Disney Research Zürich and our sister studio Pixar Animation Studios, is a new machine-learning based denoiser. To the best of my knowledge, Disney Animation was one of the first studios with a successful widescale deployment of a production denoiser on Big Hero 6. The Hyperion denoiser used from Big Hero 6 through Olaf’s Frozen Adventure is a hand-tuned denoiser based on and influenced by [Li et al. 2012] and [Rousselle et al. 2012], and has since been adopted by the Renderman team as the production denoiser that ships with Renderman today. Midway through production on Ralph Breaks the Internet, David Adler from the Hyperion team in collaboration with Fabrice Rousselle, Jan Novák, Gerhard Röthilin, and others from Disney Research Zürich were able to deploy a new, next-generation machine-learning based denoiser [Vogels et al. 2018] Developed primarily by Disney Research Zürich, the new machine-learning denoiser allowed us to cut render times by up to 75% in some cases. This example is yet another case of basic scientific research at Disney Research leading to unexpected but enormous benefits to production in all of the wider Walt Disney Company’s various animation studios!

In addition to everything above, many more smaller improvements were made in all areas of Hyperion for Ralph Breaks the Internet. Dan Teece developed a really cool “edge” shader module, which was used to create all of the silhouette edge glows in the internet world, and Dan also worked closely with FX artists to develop render-side support for various fracture and destruction workflows [Harrower et al. 2018]. Brent Burley developed several improvements to Hyperion’s depth of field support, including a realistic cat’s eye bokeh effect. Finally, as always, the production of Ralph Breaks the Internet has inspired many more future improvements to Hyperion that I can’t write about yet, since they haven’t been published yet.

The original Wreck-It Ralph is one of my favorite modern Disney movies, and I think Ralph Breaks the Internet more than lives up to the original. The film is smart and hilarious while maintaining the depth that made the first Wreck-It Ralph so good. Ralph and Vanellope are just as lovable as before and grow further as characters, and all of the new characters are really awesome (Shank and Yesss and the film’s take on the Disney princesses are particular favorites of mine). More importantly for a rendering blog though, the film is also just gorgeous to look at. With every film, the whole studio takes pride in pushing the envelope even further in terms of artistry, craftsmanship, and sheer visual beauty. The number of environments and settings in Ralph Breaks the Internet is enormous and highly varied; the internet is depicted as a massive city that pushed the limits on how much visual complexity we can render (and from our previous three feature films, we can already render an unbelievable amount!), old locations from the first Wreck-It Ralph are revisited with exponentially more visual detail and richness than before, and there’s even a full on musical number with theatrical lighting somewhere in there!

Below are some stills from the movie, in no particular order, 100% rendered using Hyperion. If you want to see more, or if you just want to see a really great movie, go see Ralph Breaks the Internet on the biggest screen you can find! There are a TON of easter eggs in the film to look out for, and I highly recommend sticking around after the credits for this one.

Here is the part of the credits with Disney Animation’s rendering team! Also, Ralph Breaks the Internet was my wife Harmony Li’s first credit at Disney Animation (she previously was at Pixar)! This frame is kindly provided by Disney. Every person you see in the credits worked really hard to make Ralph Breaks the Internet an amazing film!

All images in this post are courtesy of and the property of Walt Disney Animation Studios.

References

Carsten Benthin, Sven Woop, Ingo Wald, and Attila T. Áfra. 2017. Improved Two-Level BVHs using Partial Re-Braiding. In Proc. of High Performance Graphics (HPG 2017). Article 7.

Brent Burley. 2015. Extending the Disney BRDF to a BSDF with Integrated Subsurface Scattering. In ACM SIGGRAPH 2015 Course Notes: Physically Based Shading in Theory and Practice.

Dong Joo Byun, Alberto Luceño Ros, Alexander Moaveni, Marc Bryant, Joyce Le Tong, and Moe El-Ali. 2019. Creating Ralphzilla: Moshpit, Skeleton Library and Automation Framework. In ACM SIGGRAPH 2019 Talks. Article 66.

Matt Jen-Yuan Chiang, Peter Kutz, and Brent Burley. 2016. Practical and Controllable Subsurface Scattering for Production Path Tracing. In ACM SIGGRAPH 2016 Talks. Article 49.

Julian Fong, Magnus Wrenninge, Christopher Kulla, and Ralf Habel. 2017. Production Volume Rendering. In ACM SIGGRAPH 2017 Courses. Article 2.

Will Harrower, Pete Kyme, Ferdi Scheepers, Michael Rice, Marie Tollec, and Alex Moaveni. 2018. SimpleBullet: Collaborating on a Modular Destruction Toolkit. In ACM SIGGRAPH 2018 Talks. Article 79.

Tzu-Mao Li, Yu-Ting Wu, and Yung-Yu Chiang. 2012. SURE-based Optimization for Adaptive Sampling and Reconstruction. ACM Transactions on Graphics (Proc. of SIGGRAPH Asia) 31, 6 (Nov. 2012), Article 194.

Josh Richards, Joyce Le Tong, Moe El-Ali, and Tuan Nguyen. 2019. Optimizing Large Scale Crowds in Ralph Breaks the Internet. In ACM SIGGRAPH 2019 Talks. Article 65.

Fabrice Rousselle, Marco Manzi, and Matthias Zwicker. 2013. Robust Denoising using Feature and Color Information. Computer Graphics Forum (Proc. of Eurographics Symposium on Rendering) 32, 7 (Jun. 2013), 121-130.

Mipmapping with Bidirectional Techniques

2018-10-25T00:00:00+00:00

1. Introduction
2. Texture Caches and Mipmaps
3. Mipmap Level Selection and Ray Differentials
4. Ray Differentials and Path Tracing
5. Ray Differentials and Bidirectional Techniques

6. Camera-Based Mipmap Level Selection
7. Results
8. Additional Renders
9. References

Introduction

One major feature that differentiates production-capable renderers from hobby or research renderers is a texture caching system. A well-implemented texture caching system is what allows a production renderer to render scenes with potentially many TBs of textures, but in a reasonable footprint that fits in a few dozen or a hundred-ish GB of RAM. Pretty much every production renderer today has a robust texture caching system; Arnold famously derives a significant amount of performance from an extremely efficient texture cache implementation, and Vray/Corona/Renderman/Hyperion/etc. all have their own, similarly efficient systems.

In this post and the next few posts, I’ll write about how I implemented a tiled, mipmapped texture caching system in my hobby renderer, Takua Renderer. I’ll also discuss some of the interesting challenges I ran into along the way. This post will focus on the mipmapping part of the system. Building a tiled mipmapping system that works well with bidirectional path tracing techniques was particularly difficult, for reasons I’ll discuss later in this post. I’ll also review the academic literature on ray differentials and mipmapping with path tracing, and I’ll take a look at what several different production renderers do. The scene I’ll use as an example in this post is a custom recreation of a forest scene from Evermotion’s Archmodels 182, rendered entirely using Takua Renderer (of course):

Texture Caches and Mipmaps

Texture caching is typically coupled with some form of a tiled, mipmapped [Williams 1983] texture system; the texture cache holds specific tiles of an image that were accessed, as opposed to an entire texture. These tiles are typically lazy-loaded on demand into a cache [Peachey 1990], which means the renderer only needs to pay the memory storage cost for only parts of a texture that the renderer actually accesses.

The remainder of this section and the next section of this post are a recap of what mipmaps are, mipmap level selection, and ray differentials for the less experienced reader. I also discuss a bit about what techniques various production renderers are known to use today. If you already know all of this stuff, I’d suggest skipping down a bit to the section titled “Ray Differentials and Bidirectional Techniques”.

Mipmapping works by creating multiple resolutions of a texture, and for a given surface, only loading the last resolution level where the frequency detail falls below the Nyquist limit when viewed from the camera. Since textures are often much more high resolution than the final framebuffer resolution, mipmapping means the renderer can achieve huge memory savings, since for objects further away from the camera, most loaded mip levels will be significantly lower resolution than the original texture. Mipmaps start with the original full resolution texture as “level 0”, and then each level going up from level 0 is half the resolution of the previous level. The highest level is the level at which the texture can no longer be halved in resolution again.

Below is an example of a mipmapped texture. The texture below is the diffuse albedo texture for the fallen log that is in the front of the scene in Figure 1, blocking off the path into the woods. On the left side of Figure 2 is level 1 of this texture (I have omitted level 0 both for image size reasons and because the original texture is from a commercial source, which I don’t have the right to redistribute in full resolution). On the right side, going from the top on down, are levels 2 through 11 of the mipmap. I’ll talk about the “tiled” part in a later post.

Before diving into details, I need to make a major note: I’m not going to write too much about texture filtering for now, mainly because I haven’t done much with texture filtering in Takua at all. Mipmapping was originally invented as an elegant solution to the problem of expensive texture filtering in rasterized rendering; when a texture had detail that was more high frequency than the distance between neighboring pixels in the framebuffer, aliasing would occur when the texture was sampled. Mipmaps are typically generated with pre-computed filtering for mip levels above the original resolution, allowing for single texture samples to appear antialiased. For a comprehensive discussion of texture filtering, how it relates to mipmaps, and more advanced techniques, see section 10.4.3 in Physically Based Rendering 3rd Edition [Pharr et al. 2016].

For now, Takua just uses a point sampler for all texture filtering; my interest in mipmaps is mostly for memory efficiency and texture caching instead of filtering. My thinking is that in a path tracer that is going to generate hundreds or even thousands of paths for each framebuffer pixel, the need for single-sample antialiasing becomes somewhat lessened, since we’re already basically supersampling. Good texture filtering is still ideal of course, but being lazy and just relying on supersampling to get rid of texture aliasing in primary visibility is… not necessarily the worst short-term solution in the world. Furthermore, relying on just point sampling means each texture sample only requires two texture lookups: one from the integer mip level and one from the integer mip level below the continuous float mip level at a sample point (see the next section for more on this). Using only two texture lookups per texture sample is highly efficient due to minimized memory access and minimized branching in the code. Interestingly, the Moonray team at Dreamworks Animation arrived at more or less the same conclusion [Lee et al. 2017]; they point out in their paper that geometric complexity, for all intents and purposes, has an infinite frequency, whereas pre-filtered mipmapped textures are already band limited. As a result, the number of samples required to resolve geometric aliasing should be more than enough to also resolve any texture aliasing. The Moonray team found that this approach works well enough to be their default mode in production.

Mipmap Level Selection and Ray Differentials

The trickiest part of using mipmapped textures is figuring out what mipmap level to sample at any given point. Since the goal is to find a mipmap level with a frequency detail as close to the texture sampling rate as possible, we need to have a sense of what the texture sampling rate at a given point in space relative to the camera will be. More precisely, we want the differential of the surface parameterization (a.k.a. how uv space is changing) with respect to the image plane. Since the image plane is two-dimensional, we will end up with a differential for each uv axis with respect to each axis of the image plane; we call these differentials dudx/dvdx and dudy/dvdy, where u/v are uv coordinates and x/y are image plane pixel coordinates. Calculating these differentials is easy enough in a rasterizer: for each image plane pixel, take the texture coordinate of the fragment and subtract with the texture coordinates of the neighboring fragments to get the gradient of the texture coordinates with respect to the image plane (a.k.a. screen space), and then scale by the texture resolution. Once we have dudx/dvdx and dudy/dvdy, for a non-fancy box filter all we have to do to get the mipmap level is take the longest of these gradients and calculate its logarithm base 2. A code snippet might look something like this:

float mipLevelFromDifferentialSurface(const float dudx,
                                      const float dvdx,
                                      const float dudy,
                                      const float dvdy,
                                      const int maxMipLevel) {
    float width = max(max(dudx, dvdx), max(dudy, dvdy));
    float level = float(maxMipLevel) + log2(width);
    return level;
}

Notice that the level value is a continuous float. Usually, instead of rounding level to an integer, a better approach is to sample both of the integer mipmap levels above and below the continuous level and blend between the two values using the fractional part of level. Doing this blending helps immensely with smoothing transitions between mipmap levels, which can become very important when rendering an animated sequence with camera movement.

In a ray tracer, however, figuring out dudx/dvdx and dudy/dvdy is not as easy as in a rasterizer. If we are only considering primary rays, we can do something similar to the rasterization case: fire a ray from a given pixel and fire rays from the neighboring pixels, and calculate the gradient of the texture coordinates with respect to screen space (the screen space partial derivatives) by examining the hit points of each neighboring ray that hits the same surface as the primary ray. This approach rapidly falls apart though, for the following reasons and more:

If a ray hits a surface but none of its neighboring rays hit the same surface, then we can’t calculate any differentials and must fall back to point sampling the lowest mip level. This isn’t a problem in the rasterization case, since rasterization will run through all of the polygons that make up a surface, but in the ray tracing case, we only know about surfaces that we actually hit with a ray.
For secondary rays, we would need to trace secondary bounces not just for a given pixel’s ray, but also its neighboring rays. Doing so would be necessary since, depending on the bsdf at a given surface, the distance between the main ray and its neighbor rays can change arbitrarily. Tracing this many additional rays quickly becomes prohibitively expensive; for example, if we are considering four neighbors per pixel, we are now tracing five times as many rays as before.
We would also have to continue to guarantee that neighbor secondary rays continue hitting the same surface as the main secondary ray, which will become arbitrarily difficult as bxdf lobes widen or narrow.

A better solution to these problems is to use ray differentials [Igehy 1999], which is more or less just a ray along with the partial derivative of the ray with respect to screen space. Thinking of a ray differential as essentially similar to a ray with a width or a cone, similar to beam tracing [Heckbert and Hanrahan 1984], pencil tracing [Shinya et al. 1987], or cone tracing [Amanatides 1984], is not entirely incorrect, but ray differentials are a bit more nuanced than any of the above. With ray differentials, instead of tracing a bunch of independent neighbor rays with each camera ray, the idea is to reconstruct dudx/dvdy and dudy/dvdy at each hit point using simulated offset rays that are reconstructed using the ray’s partial derivative. Ray differentials are generated alongside camera rays; when a ray is traced from the camera, offset rays are generated for a single neighboring pixel vertically and a single neighboring pixel horizontally in the image plane. Instead of tracing these offset rays independently, however, we always assume they are at some angular width from main ray. When the main ray hits a surface, we need to calculate for later use the differential of the surface at the intersection point with respect to uv space, which is called dpdu and dpdv. Different surface types will require different functions to calculate dpdu and dpdv; for a triangle in a triangle mesh, the code requires the position and uv coordinates at each vertex:

DifferentialSurface calculateDifferentialSurfaceForTriangle(const vec3& p0,
                                                            const vec3& p1,
                                                            const vec3& p2,
                                                            const vec2& uv0,
                                                            const vec2& uv1,
                                                            const vec2& uv2) {
    vec2 duv02 = uv0 - uv2;
    vec2 duv12 = uv1 - uv2;
    float determinant = duv02[0] * duv12[1] - duv02[1] * duv12[0];

    vec3 dpdu, dpdv;

    vec3 dp02 = p0 - p2;
    vec3 dp12 = p1 - p2;
    if (abs(determinant) == 0.0f) {
        vec3 ng = normalize(cross(p2 - p0, p1 - p0));
        if (abs(ng.x) > abs(ng.y)) {
            dpdu = vec3(-ng.z, 0, ng.x) / sqrt(ng.x * ng.x + ng.z * ng.z);
        } else {
            dpdu = vec3(0, ng.z, -ng.y) / sqrt(ng.y * ng.y + ng.z * ng.z);
        }
        dpdv = cross(ng, dpdu);
    } else {
        float invdet = 1.0f / determinant;
        dpdu = (duv12[1] * dp02 - duv02[1] * dp12) * invdet;
        dpdv = (-duv12[0] * dp02 + duv02[0] * dp12) * invdet;
    }
    return DifferentialSurface(dpdu, dpdv);
}

Calculating surface differentials does add a small bit of overhead to the renderer, but the cost can be minimized with some careful work. The naive approach to surface differentials is to calculate them with every intersection point and return them as part of the hit point information that is produced by ray traversal. However, this computation is wasted if the shading operation for a given hit point doesn’t actually end up doing any texture lookups. In Takua, surface differentials are calculated on demand at texture lookup time instead of at ray intersection time; this way, we don’t have to pay the computational cost for the above function unless we actually need to do texture lookups. Takua also supports multiple uv sets per mesh, so the above function is parameterized by uv set ID, and the function is called once for each uv set that a texture specifies. Surface differentials are also cached within a shading operation per hit point, so if a shader does multiple texture lookups within a single invocation, the required surface differentials don’t need to be redundantly calculated.

Sony Imageworks’ variant of Arnold (we’ll refer to it as SPI Arnold to disambiguate from Solid Angle’s Arnold) does something even more advanced [Kulla et al. 2018]. Instead of the above explicit surface differential calculation, SPI Arnold implements an automatic differentiation system utilizing dual arithmetic [Piponi 2004]. SPI Arnold extensively utilizes OSL for shading; this means that they are able to trace at runtime what dependencies a particular shader execution path requires, and therefore when a shader needs any kind of derivative or differential information. The calls to the automatic differentiation system are then JITed into the shader’s execution path, meaning shader authors never have to be aware of how derivatives are computed in the renderer. The SPI Arnold team’s decision to use dual arithmetic based automatic differentiation is influenced by lessons they had previously learned with BMRT’s finite differencing system, which required lots of extraneous shading computations for incoherent ray tracing [Gritz and Hahn 1996]. At least for my purposes, though. I’ve found that the simpler approach I have taken in Takua is sufficiently negligible in both final overhead and code complexity that I’ll probably skip something like the SPI Arnold approach for now.

Once we have the surface differential, we can then approximate the local surface geometry at the intersection point with a tangent plane, and intersect the offset rays with the tangent plane. To find the corresponding uv coordinates for the offset ray tangent plane intersection planes, dpdu/dpdv, the main ray intersection point, and the offset ray intersection points can be used to establish a linear system. Solving this linear system leads us directly to dudx/dudy and dvdx/dvdy; for the exact mathematical details and explanation, see section 10.1 in Physically Based Rendering 3rd Edition. The actual code might look something like this:

// This code is heavily aped from PBRT v3; consult the PBRT book for details!
vec4 calculateScreenSpaceDifferential(const vec3& p,            // Surface intersection point
                                      const vec3& n,            // Surface normal
                                      const vec3& origin,       // Main ray origin
                                      const vec3& rDirection,   // Main ray direction
                                      const vec3& xorigin,      // Offset x ray origin
                                      const vec3& rxDirection,  // Offset x ray direction
                                      const vec3& yorigin,      // Offset y ray origin
                                      const vec3& ryDirection,  // Offset y ray direction
                                      const vec3& dpdu,         // Surface differential w.r.t. u
                                      const vec3& dpdv          // Surface differential w.r.t. v
                                      ) {
    // Compute offset-ray intersection points with tangent plane
    float d = dot(n, p);
    float tx = -(dot(n, xorigin) - d) / dot(n, rxDirection);
    vec3 px = origin + tx * rxDirection;
    float ty = -(dot(n, yorigin) - d) / dot(n, ryDirection);
    vec3 py = origin + ty * ryDirection;
    vec3 dpdx = px - p;
    vec3 dpdy = py - p;

    // Compute uv offsets at offset-ray intersection points
    // Choose two dimensions to use for ray offset computations
    ivec2 dim;
    if (std::abs(n.x) > std::abs(n.y) && std::abs(n.x) > std::abs(n.z)) {
        dim = ivec2(1,2);
    } else if (std::abs(n.y) > std::abs(n.z)) {
        dim = ivec2(0,2);
    } else {
        dim = ivec2(0,1);
    }
    // Initialize A, Bx, and By matrices for offset computation
    mat2 A;
    A[0][0] = ds.dpdu[dim[0]];
    A[0][1] = ds.dpdv[dim[0]];
    A[1][0] = ds.dpdu[dim[1]];
    A[1][1] = ds.dpdv[dim[1]];
    vec2 Bx(px[dim[0]] - p[dim[0]], px[dim[1]] - p[dim[1]]);
    vec2 By(py[dim[0]] - p[dim[0]], py[dim[1]] - p[dim[1]]);

    float dudx, dvdx, dudy, dvdy;

    // Solve two linear systems to get uv offsets
    auto solveLinearSystem2x2 = [](const mat2& A, const vec2& B, float& x0, float& x1) -> bool {
        float det = A[0][0] * A[1][1] - A[0][1] * A[1][0];
        if (abs(det) < (float)constants::EPSILON) {
            return false;
        }
        x0 = (A[1][1] * B[0] - A[0][1] * B[1]) / det;
        x1 = (A[0][0] * B[1] - A[1][0] * B[0]) / det;
        if (std::isnan(x0) || std::isnan(x1)) {
            return false;
        }
        return true;
    };
    if (!solveLinearSystem2x2(A, Bx, dudx, dvdx)) {
        dudx = dvdx = 0.0f;
    }
    if (!solveLinearSystem2x2(A, By, dudy, dvdy)) {
        dudy = dvdy = 0.0f;
    }

    return vec4(dudx, dvdx, dudy, dvdy);
}

Now that we have dudx/dudy and dvdx/dvdy, getting the proper mipmap level works just like in the rasterization case. The above approach is exactly what I have implemented in Takua Renderer for camera rays. Similar to surface differentials, Takua caches dudx/dudy and dvdx/dvdy computations per shader invocation per hit point, so that multiple textures utilizing the same uv set dont’t require multiple redundant calls to the above function.

The ray derivative approach to mipmap level selection is basically the standard approach in modern production rendering today for camera rays. PBRT [Pharr et al. 2016], Mitsuba [Jakob 2010], and Solid Angle’s version of Arnold [Georgiev et al. 2018] all use ray differential systems based on this approach for camera rays. Renderman [Christensen et al. 2018] uses a simplified version of ray differentials that only tracks two floats per ray, instead of the four vectors needed to represent a full ray differential. Renderman tracks a width at each ray’s origin, and a spread value representing the linear rate of change of the ray width over a unit distance. This approach does not encode as much information as the full ray derivative approach, but nonetheless ends up being sufficient since in a path tracer, every pixel essentially ends up being supersampled. Hyperion [Burley et al. 2018] uses a similarly simplified scheme.

A brief side note: being able to calculate the differential for surface normals with respect to screen space is useful for bump mapping, among other things, and the calculation is directly analogous to the pseudocode above for calculateDifferentialSurfaceForTriangle() and calculateScreenSpaceDifferential(), just with surface normals substituted in for surface positions.

Ray Differentials and Path Tracing

We now know how to calculate filter footprints using ray differentials for camera rays, which is great, but what about secondary rays? Without ray differentials for secondary rays, path tracing texture access behavior degrades severely, since secondary rays have to fall back to point sampling textures at the lowest mip level. A number of different schemes exist for calculating filter footprints and mipmap levels for secondary rays; here are a few that have been presented in literature and/or are known to be in use in modern production renderers:

Igehy [1999] demonstrates how to propagate ray differentials through perfectly specular reflection and refraction events, which boil down to some simple extensions to the basic math for optical reflection and refraction. However, we still need a means for handling glossy (so really, non-zero surface roughness), which requires an extended version of ray differentials. Path differentials [Suykens and Willems 2001] consider more than just partial derivatives for each screen space pixel footprint; with path differentials, partial derivatives can also be taken at each scattering event along a number of dimensions. As an example, for handling a arbitrarily shaped BSDF lobe, new partial derivatives can be calculated along some parameter of the lobe that describes the shape of the lobe, which takes the form of a bunch of additional scattering directions around the main ray’s scattering direction. If we imagine looking down the main scattering direction and constructing a convex hull around the additional scattering directions, the result is a polygonal footprint describing the ray differential over the scattering event. This footprint can then be approximated by finding the major and minor axis of the polygonal footprint. While the method is general enough to handle arbitrary factors impacting ray directions, unfortunately it can be fairly complex and expensive to compute in practice, and differentials for some types of events can be very difficult to derive. For this reason, none of the major production renderers today actually use this approach. However, there is a useful observation that can be drawn from path differentials: generally, in most cases, primary rays have narrow widths and secondary rays have wider widths [Christensen et al. 2003]; this observation is the basis of the ad-hoc techniques that most production renderers utilize.

Recently, research has appeared that provides an entirely different, more principled approach to selecting filter footprints, based on covariance tracing [Belcour et al. 2013]. The high-level idea behind covariance tracing is that local light interaction effects such as transport, occlusion, roughness, etc. can all be encoded as 5D covariance matrices, which in turn can be used to determine ideal sampling rates. Covariance tracing builds an actual, implementable rendering algorithm on top of earlier, mostly theoretical work on light transport frequency analysis [Durand et al. 2005]. Belcour et al. [2017] presents an extension to covariance tracing for calculating filter footprints for arbitrary shading effects, including texture map filtering. The covariance-tracing based approach differs from path differentials in two key areas. While both approaches operate in path space, path differentials are much more expensive to compute than the covariance-tracing based technique; path differential complexity scales quadratically with path length, while covariance tracing only ever carries a single covariance matrix along a path for a given effect. Also, path differentials can only be generated starting from the camera, whereas covariance tracing works from the camera and the light; in the next section, we’ll talk about why this difference is critically important.

Covariance tracing based techniques have a lot of promise, and are the best known approach to date for for selecting filter footprints along a path. The original covariance tracing paper had some difficulty with handling high geometric complexity; covariance tracing requires a voxelized version of the scene for storing local occlusion covariance information, and covariance estimates can degrade severely if the occlusion covariance grid is not high resolution enough to capture small geometric details. For huge production scale scenes, geometric complexity requirements can make covariance tracing either slow due to huge occlusion grids, or degraded in quality due to insufficiently large occlusion grids. However, the voxelization step is not as much of a barrier to practicality as it may initially seem. For covariance tracing based filtering, visibility can be neglected, so the entire scene voxelization step can be skipped; Belcour et al. [2017] demonstrates how. Since covariance tracing based filtering can be used with the same assumptions and data as ray differentials but is both superior in quality and more generalizable than ray differentials, I would not be surprised to see more renderers adopt this technique over time.

As of present, however, instead of using any of the above techniques, pretty much all production renderers today use various ad-hoc methods for tracking ray widths for secondary rays. SPI Arnold tracks accumulated roughness values encountered by a ray: if a ray either encounters a diffuse event or reaches a sufficiently high accumulated roughness value, SPI Arnold automatically goes to basically the highest available MIP level [Kulla et al. 2018]. This scheme produces very aggressive texture filtering, but in turn provides excellent texture access patterns. Solid Angle Arnold similarly uses an ad-hoc microfacet-inspired heuristic for secondary rays [Georgiev et al. 2018] . Renderman handles reflection and refraction using something similar to Igehy [1999], but modified for the single-float-width ray differential representation that Renderman uses [Christensen et al. 2018]. For glossy and diffuse events, Renderman uses an empirically determined heuristic where higher ray width spreads are driven by lower scattering direction pdfs. Weta’s Manuka has a unified roughness estimation system built into the shading system, which uses a mean cosine estimate for figuring out ray differentials [Fascione et al. 2018].

Generally, roughness driven heuristics seem to work reasonably well in production, and the actual heuristics don’t actually have to be too complicated! In an experimental branch of PBRT, Matt Pharr found that a simple heuristic that uses a ray differential covering roughly 1/25th of the hemisphere for diffuse events and 1/100th of the hemisphere for glossy events generally worked reasonably well [Pharr 2017].

Ray Differentials and Bidirectional Techniques

So far everything we’ve discussed has been for unidirectional path tracing that starts from the camera. What about ray differentials and mip level selection for paths starting from a light, and by extension, for bidirectional path tracing techniques? Unfortunately, nobody has a good, robust solution for calculating ray differentials for light path! Calculating ray differentials for light paths is fundamentally something of an ill defined problem: a ray differential has to be calculated with respect to a screen space pixel footprint, which works fine for camera paths since the first ray starts from the camera, but for light paths, the last ray in the path is the one that reaches the camera. With light paths, we have something of a chicken-and-egg problem; there is no way to calculate anything with respect to a screen space pixel footprint until a light path has already been fully constructed, but the shading computations required to construct the path are the computations that want differential information in the first place. Furthermore, even if we did have a good way to calculate a starting ray differential from a light, the corresponding path differential can’t become as wide as in the case of a camera path, since at any given moment the light path might scatter towards the camera and therefore needs to maintain a footprint no wider than a single screen space pixel.

Some research work has gone into this question, but more work is needed on this topic. The previously discussed covariance tracing based technique [Belcour et al. 2017] does allow for calculating an ideal texture filtering width and mip level once a light path is fully constructed, but again, the real problem is that footprints need to be available during path construction, not afterwards. With bidirectional path tracing, things get even harder. In order to keep a bidirectional path unbiased, all connections between camera and light path vertices must be consistent in what mip level they sample; however, this is difficult since ray differentials depend on the scattering events at each path vertex. Belcour et al. [2017] demonstrates how important consistent texture filtering between two vertices is.

Currently, only a handful of production renderers have extensive support for bidirectional techniques; of the ones that do, the most common solution to calculating ray differentials for bidirectional paths is… simply not to at all. Unfortunately, this means bidirectional techniques must rely on point sampling the lowest mip level, which defeats the whole point of mipmapping and destroys texture caching performance. The Manuka team alludes to using ray differentials for photon map gather widths in VCM and notes that these ray differentials are implemented as part of their manifold next event estimation system [Fascione et al. 2018], but there isn’t enough detail in their paper to be able to figure out how this actually works.

Camera-Based Mipmap Level Selection

Takua has implementations of standard bidirectional path tracing, progressive photon mapping, and VCM, and I wanted mipmapping to work with all integrator types in Takua. I’m interested in using Takua to render scenes with very high complexity levels using advanced (often bidirectional) light transport algorithms, but reaching production levels of shading complexity without a mipmapped texture cache simply is not possible without crazy amounts of memory (where crazy is defined as in the range of dozens to hundreds of GB of textures or more). However, for the reasons described above, standard ray differential based techniques for calculating mip levels weren’t going to work with Takua’s bidirectional integrators.

The lack of a ray differential solution for light paths left me stuck for some time, until late in 2017, when I got to read an early draft of what eventually became the Manuka paper [Fascione et al. 2018] in the ACM Transactions on Graphics special issue on production rendering. I highly recommend reading all five of the production renderer system papers in the ACM TOG special issue. However, if you’re already generally familiar with how a modern PBRT-style renderer works and only have time to read one paper, I would recommend the Manuka paper simply because Manuka’s architecture and the set of trade-offs and choices made by the Manuka team are so different from what every other modern PBRT-style production path tracer does. What I eventually implemented in Takua is directly inspired by Manuka, although it’s not what Manuka actually does (I think).

The key architectural feature that differentiates Manuka from Arnold/Renderman/Vray/Corona/Hyperion/etc. is its shade-before-hit architecture. I should note here that in this context, shade refers to the pattern generation part of shading, as opposed to the bsdf evaluation/sampling part of shading. The brief explanation (you really should go read the full paper) is that Manuka completely decouples pattern generation from path construction and path sampling, as opposed to what all other modern path tracers do. Most modern path tracers use shade-on-hit, which means pattern generation is lazily evaluated to produce bsdf parameters upon demand, such as when a ray hits a surface. In a shade on hit architecture, pattern generation and path sampling are interleaved, since path sampling depends on the results of pattern generation. Separating out geometry processing from path construction is fairly standard in most modern production path tracers, meaning subdivision/tessellation/displacement happens before any rays are traced, and displacement usually involves some amount of pattern generation. However, no other production path tracer separates out all of pattern generation from path sampling the way Manuka does. At render startup, Manuka runs geometry processing, which dices all input geometry into micropolygon grids, and then runs pattern generation on all of the micropolygons. The result of pattern generation is a set of bsdf parameters that are baked into the micropolygon vertices. Manuka then builds a BVH and proceeds with normal path tracing, but at each path vertex, instead of having to evaluate shading graphs and do texture lookups to calculate bsdf parameters, the bsdf parameters are looked up directly from the pre-calculated cached values baked into the micropolygon vertices. Put another way, Manuka is a path tracer with a REYES-style shader execution model [Cook et al. 1987] instead of a PBRT-style shader execution model; Manuka preserves the grid-based shading coherence from REYES while also giving more flexibility to path sampling and light transport, which no longer have to worry about pattern generation making shading slow.

So how does any of this relate to the bidirectional path tracing mip level selection problem? The answer is: in a shade-before-hit architecture, by the time the renderer is tracing light paths, there is no need for mip level selection because there are no texture lookups required anymore during path sampling. During path sampling, Manuka evaluates bsdfs at each hit point using pre-shaded parameters that are bilinearly interpolated from the nearest micropolygon vertices; all of the texture lookups were already done in the pre-shade phase of the renderer! In other words, at least in principle, a Manuka-style renderer can entirely sidestep the bidirectional path tracing mip level selection problem (although I don’t know if Manuka actually does this or not). Also, in a shade-before-hit architecture, there are no concerns with biasing bidirectional path tracing from different camera/light path vertex connections seeing different mip levels. Since all mip level selection and texture filtering decisions take place before path sampling, the view of the world presented to path sampling is always consistent.

Takua is not a shade-before-hit renderer though, and for a variety of reasons, I don’t plan on making it one. Shade-before-hit presents a number of tradeoffs which are worthwhile in Manuka’s case because of the problems and requirements the Manuka team aimed to solve and meet, but Takua is a hobby renderer aimed at something very different from Manuka. The largest drawback of shade-before-hit is the startup time associated with having to pre-shade the entire scene; this startup time can be quite large, but in exchange, the total render time can be faster as path sampling becomes more efficient. However, in a number of workflows, the time to a full render is not nearly as important as the time to a minimum sample count at which point an artistic decision can be made on a noisy image; beyond this point, full render time is less important as long as it is within a reasonable ballpark. Takua currently has a fast startup time and reaches a first set of samples quickly, and I wanted to keep this behavior. As a result, the question then became: in a shade-on-hit architecture, is there a way to emulate shade-before-hit’s consistent view of the world, where texture filtering decisions are separated from path sampling?

The approach I arrived at is to drive mip level selection based on only a world-space distance-to-camera metric, with no dependency at all on the incoming ray at a given hit point. This approach is… not even remotely novel; in a way, this approach is probably the most obvious solution of all, but it took me a long time and a circuitous path to arrive at for some reason. Here’s the high-level overview of how I implemented a camera-based mip level selection technique:

At render startup time, calculate a ray differential for each pixel in the camera’s image plane. The goal is to find the narrowest differential in each screen space dimension x and y. Store this piece of information for later.
At each ray-surface intersection point, calculate the differential surface.
Create a ‘fake’ ray going from the camera’s origin position to the current intersection point, with a ray differential equal to the minimum differential in each direction found in step 1.
Calculate dudx/dudy and dvdx/dvdy using the usual method presented above, but using the fake ray from step 3 instead of the actual ray.
Calculate the mip level as usual from dudx/dudy and dvdx/dvdy.

The rational for using the narrowest differentials in step 1 is to guarantee that texture frequency remains sub-pixel for the all pixels in screen space, even if that means that we might be sampling some pixels at a higher resolution mip level than whatever screen space pixel we’re accumulating radiance too. In this case, being overly conservative with our mip level selection is preferable to visible texture blurring from picking a mip level that is too low resolution.

Takua uses the above approach for all path types, including light paths in the various bidirectional integrators. Since the mip level selection is based entirely on distance-to-camera, as far as the light transport integrators are concerned, their view of the world is entirely consistent. As a result, Takua is able to sidestep the light path ray differential problem in much the same way that a shade-before-hit architecture is able to. There are some particular implementation details that are slightly complicated by Takua having support for multiple uv sets per mesh, but I’ll write about multiple uv sets in a later post. Also, there is one notable failure scenario, which I’ll discuss more in the results section.

Results

So how well does camera-based mipmap level selection work compared to a more standard approach based on path differentials or ray widths from the incident ray? Typically in a production renderer, mipmaps work in conjunction with tiled textures, where tiles are a fixed size (unless a tile is in a mipmap level with a total resolution smaller than the tile resolution). Therefore, the useful metric to compare is how many texture tiles each approach access throughout the course of a render; the more an approach accesses higher mipmap levels (meaning lower resolution mipmap levels), the fewer tiles in total should be accessed since lower resolution mipmap levels have fewer tiles.

For unidirectional path tracing from the camera, we can reasonably expect the camera-based approach to perform less well than a path differential or ray width technique (which I’ll call simply ‘ray-based’). In the camera-based approach, every texture lookup has to use a footprint corresponding to approximately a single screen space pixel footprint, whereas in a more standard ray-based approach, footprints get wider with each successive bounce, leading to access to higher mipmap levels. Depending on how aggressively ray widths are widened at diffuse and glossy events, ray-based approaches can quickly reach the highest mipmap levels and essentially spend the majority of the render only accessing high mipmap levels.

For bidirectional integrators though, the camera-based techinque has the major advantage of being able to provide reasonable mipmap levels for both camera and light paths, whereas the more standard ray-based approaches have to fall back to point sampling the lowest mipmap level for light paths. As a result, for bidirectional paths we can expect that a ray-based approach should perform somewhere in between how a ray-based approach performs in the unidirectional case and how point sampling only the lowest mipmap level performs in the unidirectional case.

As a baseline, I also implemented a ray-based approach with a relatively aggressive widening heuristic for glossy and diffuse events. For the forest scene from Figure 1, I got the following results at 1920x1080 resolution with 16 samples per pixel. I compared unidirectional path tracing from the camera and standard bidirectional path tracing; statistics are presented as total number of texture tiles accessed divided by total number of texture tiles across all mipmap levels. The lower the percentage, the better:

16 SPP 1920x1080 Unidirectional (PT)
    No mipmapping:                       314439/745394 tiles (42.18%)
    Ray-based level selection:           103206/745394 tiles (13.84%)
    Camera-based level selection:        104764/745394 tiles (14.05%)

16 SPP 1920x1080 Bidirectional (BDPT)
    No mipmapping:                       315452/745394 tiles (42.32%)
    Ray-based level selection:           203491/745394 tiles (27.30%)
    Camera-based level selection:        104858/745394 tiles (14.07%)

As expected, in the unidirectional case, the camera-based approach accesses slightly more tiles than the ray-based approach, and both approaches significantly outperform point sampling the lowest mipmap level. In the bidirectional case, the camera-based approach accesses slightly more tiles than in the unidirectional case, while the ray-based approach performs somewhere between the ray-based approach in unidirectional and point sampling the lowest mipmap level in unidirectional. What surprised me is how close the camera-based approach performed compared to the ray-based approach in the unidirectional case, especially since I chose a fairly aggresive widening heuristic (essentially a more aggressive version of the same heuristic that Matt Pharr uses in the texture cached branch of PBRTv3).

To help with visualizing what mipmap levels are being accessed, I implemented a new AOV in Takua that assigns colors to surfaces based on what mipmap level is accessed. With camera-based mipmap level selection, this AOV shows simply what mipmap level is accessed by all rays that hit a given point on a surface. Each mipmap level is represented by a different color, with support up to 12 mipmap levels. The following two images show accessed mipmap level at 1080p and 2160p (4K); note how the 2160p render accesses more lower mipmap levels than the 1080p render. The pixel footprints in the higher resolution render are smaller when projected into world space since more pixels have to pack into the same field of view. The key below each image shows what mipmap level each color corresponds to:

In general, everything looks as we would expect it to look in a working mipmapping system! Surface points farther away from the camera are generally accessing higher mipmap levels, and surface points closer to the camera are generally accessing lower mipmap levels. The ferns in the front of the frame near the camera access higher mipmap levels than the big fallen log in the center of the frame even though the ferns are closer to the camera because the textures for each leaf are extremely high resolution and the fern leaves are very small in screen-space. Surfaces that are viewed at highly glancing angles from the camera tend to access higher mipmap levels than surfaces that are camera-facing; this effect is easiest to see on the rocks in bottom front of the frame. The interesting sudden shift in mipmap level on some of the tree trunks comes from the tree trunks using two diffrent uv sets; the lower part of each tree trunk uses a different texture than the upper part, and the main textures are blended using a blending mask in a different uv space from the main textures; since the differential surface depends in part on the uv parameterization, different uv sets can result in different mipmap level selection behavior.

I also added a debug mode to Takua that tracks mipmap level access per texture sample. In this mode, for a given texture, the renderer splats into an image the lowest acceessed mipmap level for each texture sample. The result is sort of a heatmap that can be overlaid on the original texture’s lowest mipmap level to see what parts of texture are sampled at what resolution. Figure 5 shows one of these heatmaps for the texture on the fallen log in the center of the frame:

Just like in Figures 3 and 4, we can see that renders at higher resolutions will tend to access lower mipmap levels more frequently. Also, we can see that the vast majority of the texture is never sampled at all; with a tiled texture caching system where tiles are loaded on demand, this means there are a large number of texture tiles that we never bother to load at all. In cases like Figure 5, not loading unused tiles provides enormous memory savings compared to if we just loaded an entire non-mipmapped texture.

So far using a camera-based approach to mipmap level selection combined with just point sampling at each texture sample has held up very well in Takua! In fact, the Scandinavian Room scene from earlier this year was rendered using the mipmap approach described in this post as well. There is, however, a relatively simple type of scene that Takua’s camera-based approach fails badly at handling: refraction near the camera. If a lens is placed directly in front of the camera that significantly magnifies part of the scene, a purely world-space metric for filter footprints can result in choosing mipmap levels that are too high, which translates to visible texture blurring or pixelation. I don’t have anything implemented to handle this failure case right now. One possible solution I’ve thought about is to initially trace a set of rays from the camera using traditional ray differential propogation for specular objects, and cache the resultant mipmap levels in the scene. Then, during the actual renders, the renderer could compare the camera-based metric from the nearest N cached metrics to infer if a lower mipmap level is needed than what the camera-based metric produces. However, such a system would add significant cost to the mipmap level selection logic, and there are a number of implementation complications to consider. I do wonder how Manuka handles the “lens in front of a camera” case as well, since the shade-before-hit paradigm also fails on this scenario for the same reasons.

Long term, I would like to spend more time looking in to (and perhaps implementing) a covariance tracing based approach. While Takua currently gets by with just point sampling, filtering becomes much more important for other effects, such as glinty microfacet materials, and covariance tracing based filtering seems to be the best currently known solution for these cases.

In an upcoming post, I’m aiming to write about how Takua’s texture caching system works in conjunction with the mipmapping system described in this post. As mentioned earlier, I’m also planning a (hopefully) short-ish post about supporting multiple uv sets, and how that impacts a mipmapping and texture caching system.

Additional Renders

Finally, since this has been a very text-heavy post, here are some bonus renders of the same forest scene under different lighting conditions. When I was setting up this scene for Takua, I tried a number of different lighting conditions and settled on the one in Figure 1 for the main render, but some of the alternatives were interesting too. In a future post, I’ll show a bunch of interesting renders of this scene from different camera angles, but for now, here is the forest at different times of day:

References

John Amanatides. 1984. Ray Tracing with Cones. ACM SIGGRAPH Computer Graphics (Proc. of SIGGRAPH) 18, 3 (Jul. 1984), 129-135.

Laurent Belcour, Cyril Soler, Kartic Subr, Nicolas Holzschuch, and Frédo Durand. 2013. 5D Covariance Tracing for Efficient Defocus and Motion Blur. ACM Transactions on Graphics 32, 3 (Jun. 2013), Article 31.

Laurent Belcour, Ling-Qi Yan, Ravi Ramamoorthi, and Derek Nowrouzezahrai. 2017. Antialiasing Complex Global Illumination Effects in Path-Space. ACM Transactions on Graphics 36, 1 (Jan. 2017), Article 9.

Per Christensen, David M. Laur, Julian Fong, Wayne Wooten, and Dana Batali. 2003. Ray Differentials and Multiresolution Geometry Caching for Distribution Ray Tracing in Complex Scenes. Computer Graphics Forum (Proc. of Eurographics) 22, 3 (Sep. 2003), 543-552.

Robert L. Cook, Loren Carpenter, and Edwin Catmull. 1987. The Reyes Image Rendering Architecture. ACM SIGGRAPH Computer Graphics (Proc. of SIGGRAPH) 21, 4 (Jul. 1987), 95-102.

Frédo Durand, Nicolas Holzchuch, Cyril Soler, Eric Chan, and François X Sillion. 2005. A Frequency Analysis of Light Transport. ACM Transactions on Graphics (Proc. of SIGGRAPH) 24, 3 (Aug. 2005), 1115-1126.

Iliyan Georgiev, Thiago Ize, Mike Farnsworth, Ramón Montoya-Vozmediano, Alan King, Brecht van Lommel, Angel Jimenez, Oscar Anson, Shinji Ogaki, Eric Johnston, Adrien Herubel, Declan Russell, Frédéric Servant, and Marcos Fajardo. 2018. Arnold: A Brute-Force Production Path Tracer. ACM Transactions on Graphics 37, 3 (Jul. 2018), Article 32.

Larry Gritz and James K. Hahn. 1996. BMRT: A Global Illumination Implementation of the RenderMan Standard. Journal of Graphics Tools 1, 3 (Jul. 1996), 29-47.

Paul S. Heckbert and Pat Hanrahan. 1984. Beam Tracing Polygonal Objects. ACM SIGGRAPH Computer Graphics (Proc. of SIGGRAPH) 18, 3 (Jul. 1984), 119-127.

Homan Igehy. 1999. Tracing Ray Differentials. In Proc. of SIGGRAPH (SIGGRAPH 1999). 179–186.

Wenzel Jakob. 2010. Mitsuba Renderer.

Christopher Kulla, Alejandro Conty, Clifford Stein, and Larry Gritz. 2018. Sony Pictures Imageworks Arnold. ACM Transactions on Graphics 37, 3 (Jul. 2018), Article 29.

Mark Lee, Brian Green, Feng Xie, and Eric Tabellion. 2017. Vectorized Production Path Tracing. In Proc. of High Performance Graphics (HPG 2017). Article 10.

Darwyn Peachey. 1990. Texture on Demand. Pixar Technical Report 217.

Matt Pharr, Wenzel Jakob, and Greg Humphreys. 2016. Physically Based Rendering: From Theory to Implementation, 3rd ed. Morgan Kaufmann.

Matt Pharr. 2017. The Implementation of a Scalable Texture Cache. Physically Based Rendering Supplemental Material.

Dan Piponi. 2004. Automatic Differentiation, C++ Templates and Photogrammetry. Journal of Graphics Tools 9, 4 (Sep. 2004), 41-55.

Mikio Shinya, Tokiichiro Takahashi, and Seiichiro Naito. 1987. Principles and Applications of Pencil Tracing. ACM SIGGRAPH Computer Graphics (Proc. of SIGGRAPH) 21, 4 (Jul. 1987), 45-54.

Frank Suykens and Yves. D. Willems. 2001. Path Differentials and Applications. In Proc. of Eurographics Workshop on Rendering (Rendering Techniques 2001). 257–268.

Lance Williams. 1983. Pyramidal Parametrics. ACM SIGGRAPH Computer Graphics (Proc. of SIGGRAPH) 12, 3 (Jul. 1983), 1-11.

Transactions on Graphics Paper- The Design and Evolution of Disney's Hyperion Renderer

2018-08-17T00:00:00+00:00

The August 2018 issue of ACM Transactions on Graphics (Volume 37 Issue 3) is partially a special issue on production rendering, featuring five systems papers describing notable, major production renderers in use today. I got to contribute to one of these papers as part of the Hyperion team at Walt Disney Animation Studios! Our paper, titled “The Design and Evolution of Disney’s Hyperion Renderer”, discusses exactly what the title suggests. We present a detailed look inside how Hyperion is designed today, discuss the decisions that went into its current design, and examine how Hyperion has evolved since the original EGSR 2013 “Sorted Deferred Shading for Production Path Tracing” paper that was the start of Hyperion. A number of Hyperion developers contributed to this paper as co-authors, along with Hank Driskill, who was the technical supervisor on Big Hero 6 and Moana and was one of the key supporters of Hyperion’s early development and deployment.

Here is the paper abstract:

Walt Disney Animation Studios has transitioned to path-traced global illumination as part of a progression of brute-force physically based rendering in the name of artist efficiency. To achieve this without compromising our geometric or shading complexity, we built our Hyperion renderer based on a novel architecture that extracts traversal and shading coherence from large, sorted ray batches. In this article, we describe our architecture and discuss our design decisions. We also explain how we are able to provide artistic control in a physically based renderer, and we demonstrate through case studies how we have benefited from having a proprietary renderer that can evolve with production needs.

The paper and related materials can be found at:

We owe a huge thanks to Matt Pharr, who came up with the idea for a TOG special issue on production rendering and coordinated the writing of all of the papers, and Kavita Bala, who as editor-in-chief of TOG supported all of the special issue papers. This issue has actually been in the works for some time; Matt Pharr contacted us over a year ago about putting together a special issue, and we began work on our paper in May 2017. Matt and Kavita generously gave all of the contributors to the special issue a significant amount of time to write, and Matt provided a lot of valuable feedback and suggestions to all five of the final papers. The end result is, in my opinion, something special indeed. The five rendering teams that contributed papers in the end were Solid Angle’s Arnold, Sony Imageworks’ Arnold, Weta Digital’s Manuka, Pixar’s Renderman, and ourselves. All five of the papers in the special issue are fascinating, well-written, highly technical rendering systems papers (as opposed to just marketing fluff), and absolutely worth a read!

Something important that I want to emphasize here is that the author lists for all five papers are somewhat deceptive. One might think that the author lists represent all of the people responsible for each renderers’ success; this idea is, of course, inaccurate. For Hyperion, the authors on this paper represent just a small fraction of all of the people responsible for Hyperion’s success. Numerous engineers not on the author list have made significant contributions to Hyperion in the past, and the project relies enormously on all of the QA engineers, managers/leaders, TDs, artists, and production partners that test, lead, deploy, and use Hyperion every day. We also owe an enormous amount to all of the researchers that we have collaborated directly with, or who we haven’t collaborated directly with but have used their work. The success of every production renderer comes not just from the core development team, but instead from the entire community of folks that surround a production renderer; this is just as true for Hyperion as it is for Renderman, Arnold, Manuka, etc. The following is often said in our field but nonetheless true: building an advanced production renderer in a reasonable timeframe really is only possible through a massive team effort.

This summer, in addition to publishing this paper, members of the Hyperion team also presented the following at SIGGRAPH 2018:

Peter Kutz was on the “Design and Implementation of Modern Production Renderers” panel put together by Matt Pharr to discuss the five TOG production rendering papers. Originally Brent Burley was supposed to represent the Hyperion team, but due to some outside circumstances, Brent wasn’t able to make it to SIGGRAPH this year, so Peter went in Brent’s place.
Matt Jen-Yuan Chiang presented a talk on rendering eyes, titled “Plausible Iris Caustics and Limbal Arc Rendering”, in the “It’s a Material World” talks session.

Disney Animation Data Sets

2018-07-03T00:00:00+00:00

Today at EGSR 2018, Walt Disney Animation Studios announced the release of two large, production quality/scale data sets for rendering research purposes. The data sets are available on a new data sets page on the official Disney Animation website. The first data set is the Cloud Data Set, which contains a large and highly detailed volumetric cloud data set that we used for our “Spectral and Decomposition Tracking for Rendering Heterogeneous Volumes” SIGGRAPH 2017 paper, and the second data set is the Moana Island Scene, which is a full production scene from Moana.

In this post, I’ll share some personal thoughts, observations, and notes. The release of these data sets was announced by my teammate, Ralf Habel, at EGSR today, but this release has been in the works for a very long time now, and is the product of the collective effort of an enormous number of people across the studio. A number of people deserve to be highlighted: Rasmus Tamstorf spearheaded the entire effort and was instrumental in getting the resources and legal approval needed for the Moana Island Scene. Heather Pritchett is the TD that did the actual difficult work of extracting the Moana Island Scene out of Disney Animation’s production pipeline and converting it from proprietary data formats into usable, industry-standard data formats. Sean Palmer and Jonathan Garcia also helped in resurrecting the data from Moana. Hyperion developers Ralf Habel and Peter Kutz led the effort to get the Cloud Data Set approved and released; the cloud itself was made by artists Henrik Falt and Alex Nijmeh. On the management side of things, technology manager Rajesh Sharma and Disney Animation CTO, Nick Cannon, provided crucial support and encouragement. Matt Pharr has been crucial in collaborating with us to get these data sets released. Matt was highly accommodating in helping us get the Moana Island Scene into a PBRT scene; I’ll talk a bit more about this later. Intel’s Embree team also gave significant feedback. My role was actually quite small; along with other members of the Hyperion development team, I just provided some consultation throughout the whole process.

Please note the licenses that the data sets come with. The Cloud Data Set is licensed under a Creative Commons Attribution ShareAlike 3.0 Unported License; the actual cloud is based on a photograph by Kevin Udy on his Colorado Clouds Blog, which is also licensed under the same Creative Commons license. The Moana Island Scene is licensed under a more restrictive, custom Disney Enterprises research license. This is because the Moana Island Scene is a true production scene; it was actually used to produce actual frames in the final film. As such, the data set is being released only for pure research and development purposes; it’s not meant for use in artistic projects. Please stick to and follow the licenses these data sets are released under; if people end up misusing these data sets, then it makes releasing more data sets into the community in the future much harder for us.

This entire effort was sparked two years ago at SIGGRAPH 2016, when Matt Pharr made an appeal to the industry to provide representative production-scale data sets to the research community. I don’t know how many times I’ve had conversations about how well new techniques or papers or technologies will scale to production cases, only to have further discussion stymied by the lack of any true production data sets that the research community can test against. We decided as a studio to answer Matt’s appeal, and last year at SIGGRAPH 2017, Brent Burley and Rasmus Tamstorf announced our intention to release both the Cloud and Moana Island data sets. It’s taken nearly a year from announcement to release because the process has been complex, and it was very important to the studio to make sure the release was done properly.

One of the biggest challenges was getting all of the data out of the production pipeline and our various proprietary data formats into something that the research community can actually parse and make use of. Matt Pharr was extremely helpful here; over the past year, Matt has added support for Ptex textures and implemented the Disney Bsdf in PBRT v3. Having Ptex and the Disney Bsdf available in PBRT v3 made PBRT v3 the natural target for an initial port to a renderer other than Hyperion, since internally all of Hyperion’s shading uses the Disney Bsdf, and all of our texturing is done through Ptex. Our texturing also relies heavily on procedural SeExpr expressions; all of the expression-drive texturing had to be baked down into Ptex for the final release.

Both the Cloud and Moana Island data sets are, quite frankly, enormous. The Cloud data set contains a single OpenVDB cloud that weighs in at 2.93 GB; the data set also provides versions of the VDB file scaled down to half, quarter, eighth, and sixteenth scale resolutions. The Moana Island data set comes in three parts: a base package containing raw geometry and texture data, an animation package containing animated stuff, and a PBRT package containing a PBRT scene generated from the base package. These three packages combined, uncompressed, weigh in at well over 200 GB of disk space; the uncompressed PBRT package along weighs in at around 38 GB.

For the Moana Island Scene, the provided PBRT scene requires a minimum of around 90 GB if RAM to render. This many seem enormous for consumer machines, because it is. However, this is also what we mean by “production scale”; for Disney Animation, 90 GB is actually a fairly mid-range memory footprint for a production render. On a 24-core, dual-socket Intel Xeon Gold 6136 system, the PBRT scene took me a little over an hour and 15 minutes to render from the ‘shotCam’ camera. Hyperion renders the scene faster, but I would caution against using this data set to do performance shootouts between different renders. I’m certain that within a short period of time, enthusiastic members of the rendering community will end up porting this scene to Renderman and Arnold and Vray and Cycles and every other production renderer out there, which will be very cool! But keep in mind, this data set was authored very specifically around Hyperion’s various capabilities and constraints, which naturally will be very different from how one might author a complex data set for other renderers. Every renderer works a bit differently, so the most optimal way to author a data set for every renderer will be a bit different; this data set is no exception. So if you want to compare renderers using this data set, make sure you understand the various ways how the way this data set is structured impacts the performance of whatever renderers you are comparing.

For example, Hyperion subdivides/tessellates/displaces everything to as close to sub-poly-per-pixel as it can get while still fitting within computational resources. This means our scenes are usually very heavily subdivided and tessellated. However, the PBRT version of the scene doesn’t come with any subdivision; as a result, silhouettes in the following comparison images don’t fully match in some areas. Similarly, PBRT’s lights and lighting model differ from Hyperion’s, and Hyperion has various artistic controls that are unique to Hyperion, meaning the renders produced by PBRT versus Hyperion differ in many ways:

Another example of a major difference between the Hyperion renders and the PBRT renders is in the water, which Hyperion renders using photon mapping to get the caustics. The provided PBRT scenes use unidirectional pathtracing for everything including the water, hence the very different caustics. Similarly, the palm trees in the ‘palmsCam’ camera angle look very different between PBRT and Hyperion because Hyperion’s lighting controls are very different from PBRT; Hyperion’s lights include various artistic controls for custom shaping and whatnot, which aren’t necessarily fully physical. Also, the palm leaves are modeled using curves, and the shading depends on varying colors and attributes along the length and width of the curve, which PBRT doesn’t support yet (getting the palm leaves converted to meshes is actually the top priority for if more resources are freed up to improve the data set release). These difference between renderers don’t necessarily mean that one renderer is better than the other; they simply mean that the renderers are different. This will be true for any pair of renderers that one wants to compare.

The Cloud Data Set includes an example render from Hyperion, which implements our Spectral and Decomposition Tracking paper in its volumetric rendering system to efficiently render the cloud with thousands of bounces. This render contains no post-processing; what you see in the provided image is exactly what Hyperion outputs. The VDB file expresses the cloud as a field of heterogeneous densities. Also provided is an example Mitsuba scene, renderable using the Mitsuba-VDB plugin that can be found on Github. Please consult the README file for some modifications in Mitsuba that are necessary to render the cloud. Also, please note that the Mitsuba example will take an extremely long time to render, since Mitsuba isn’t really meant to render high-albedo heterogeneous volumes. With proper acceleration structures and algorithms, rendering the cloud only takes us a few minutes using Hyperion, and should be similarly fast in any modern production renderer.

One might wonder just why production data sets in general are so large. This is an interesting question; the short answer across the industry basically boils down to “artist time is more expensive and valuable than computer hardware”. We could get these scenes to fit into much smaller footprints if we were willing to make our artists spend a lot of time aggressively optimizing assets and scenes and whatnot so that we could fit these scenes into smaller disk, memory, and compute footprints. However, this isn’t actually always a good use of artist time; computer hardware is cheap compared to wasting artist time, which often could be better spent elsewhere making the movie better. Throwing more memory and whatnot at huge data sets is also simply more scalable than using more artist resources, relatively speaking.

Both data sets come with detailed README documents; the Moana Island Scene’s documentation in particular is quite extensive and contains a significant amount of information about how assets are authored and structured at Disney Animation, and how renders are lit, art-directed, and assembled at Disney Animation. I highly recommend reading all of the documentation carefully if you plan on working with these data sets, or just if you are generally curious about how production scenes are built at Disney Animation.

Personally, I’m very much looking forward to seeing what the rendering community (and the wider computer graphics community at large) does with these data sets! I’m especially excited to see what the realtime world will be able to do with this data; seeing the Moana Island Scene in its full glory in Unreal Engine 4 or Unity would be something indeed, and I think these data sets should provide a fantastic challenge to research into light transport and ray tracing speed as well. If you do interesting things with these data sets, please write to us at the email addresses in the provided README files!

Also, Matt Pharr has written on his blog about how the Moana Island Scene has further driven the development of PBRT v3. I highly recommend giving Matt’s blog a read!

Scandinavian Room Scene

2018-02-23T00:00:00+00:00

Almost three years ago, I rendered a small room interior scene to test an indoor, interior illumination scenario. Since then, a lot has changed in Takua, so I thought I’d revisit an interior illumination test with a much more complex, difficult scene. I don’t have much time to model stuff anymore these days, so instead I bought Evermotion’s Archinteriors Volume 48 collection, which is labeled as Scandinavian interior room scenes (I don’t know what’s particularly Scandinavian about these scenes, but that’s what the label said) and ported one of the scenes to Takua’s scene format. Instead of simply porting the scene as-is, I modified and added various things in the scene to make it feel a bit more customized. See if you can spot what they are:

I had a lot of fun adding all of my customizations! I brought over some props from the old complex room scene, such as the purple flowers and vase, a few books, and Utah teapot tray, and also added a few new fun models, such as the MacBook Pro in the back and the copy of Physically Based Rendering 3rd Edition in the foreground. The black and white photos on the wall are crops of my Minecraft renders, and some of the books against the back wall have fun custom covers and titles. Even all of the elements that came with the original scene are re-shaded. The original scene came with Vray’s standard VrayMtl as the shader for everything; Takua’s base shader parameterization draws some influence from Vray, but also draws from the Disney Bsdf and Arnold’s AlShader and as a result has a parameterization that is sufficiently different that I wound up just re-shading everything instead of trying to write some conversion tool. For the most part I was able to re-use the textures that came with the scene to drive various shader parameters. The skydome is from the noncommercial version of VizPeople’s HDRi v1 collection.

Speaking of the skydome… the main source of illumination in this scene comes from the sun in the skydome, which presented a huge challenge for efficient light sampling. Takua has had domelight/environment map importance sampling using CDF inversion sampling for a long time now, which helps a lot, but the indoor nature of this scene still made sampling the sun difficult. Sampling the sun in an outdoor scene is fairly efficient since most rays will actually reach the sun, but in indoor scenes, importance sampling the sun becomes inefficient without taking occlusion into account since only rays that actually make it outdoors through windows can reach the sun. The best known method currently for handling domelight importance sampling through windows in an indoor scene is Portal Masked Environment Map Sampling (PMEMS) by Bitterli et al. I haven’t actually implemented PMEMS yet though, so the renders in this post all wound up requiring a huge number of samples per pixel to render; I intend on implementing PMEMS at some point in the near future.

Apart from the skydome, this scene also contains several other practical light sources, such as the lamp’s bulb, the MacBook Pro’s screen, and the MacBook Pro’s glowing Apple logo on the back of the screen (which isn’t even visible to camera, but is still enabled since it provides a tiny amount of light against the back wall!). In addition to choosing where on a single light to sample, choosing which light to sample is also an extremely important and difficult problem. Until this rendering this scene, I hadn’t really put any effort into efficiently selecting which light to sample. Most of my focus has been on the integration part of light transport, so Takua’s light selection has just been uniform random selection. Uniform random selection is terrible for scenes that contain multiple lights with highly varying emission between different lights, which is absolutely the case for this scene. Like any other importance sampling problem, the ideal solution is to send rays towards lights with a probability proportional to the amount of illumination we expect each light to contribute to each ray origin point.

I implemented a light selection strategy where the probability of selecting each light is weighted by the total emitted power of each light; essentially this boils down to estimating the total emitted power of each light according to the light’s surface texture and emission function, building a CDF across all of the lights using the total emission estimates, and then using standard CDF inversion sampling to pick lights. This strategy works significantly better than uniform random selection and made a huge difference in render speed for this scene, as seen in Figures 2 through 4. Figure 2 uses uniform random light selection with 128 spp; note how the area lit by the wall-mounted lamp is well sampled, but the image overall is really noisy. Figure 3 uses power-weighted light selection with the same spp as Figure 2; the lamp area is more noisy than in Figure 2, but the render is less noisy overall. Notably, Figure 3 also took a third of the time compared to Figure 2 for the same sample count; this is because in this scene, sending rays towards the lamp is significantly more expensive due to heavier geometry than sending rays towards the sun, even when rays towards the sun get occluded by the walls. Figure 4 uses power-weighted light selection again, but is equal-time to Figure 2 instead of equal-spp; note the significant noise reduction:

However, power-weighted light selection still is not even close to being the most optimal technique possible; this technique completely ignores occlusion and distance, which are extremely important. Unfortunately, because occlusion and distance to each light varies for each point in space, creating a light selection strategy that takes occlusion and distance into account is extremely difficult and is a subject of continued research in the field. In Hyperion, we use a cache point system, which we described on page 97 of our SIGGRAPH 2017 Production Volume Rendering course notes. Other published research on the topic includes Practical Path Guiding for Efficient Light-Transport Simulation by Muller et al, On-line Learning of Parametric Mixture Models for Light Transport Simulation by Vorba et al, Product Importance Sampling for Light Transport Path Guiding by Herholz et al, Learning Light Transport the Reinforced Way by Dahm et al, and more. At some point in the future I’ll revisit this topic.

For a long time now, Takua has also had a simple interactive mode where the camera can be moved around in a non-shaded/non-lit view; I used this mode to interactively scout out some interesting and fun camera angles for some more renders. Being able to interactively scout in the same renderer used to final rendering is an extremely powerful tool; instead of guessing at depth of field settings and such, I was able to directly set and preview depth of field with immediate feedback. Unfortunately some of the renders below are noisier than I would like, due to the previously mentioned light sampling difficulties. All of the following images are rendered using Takua a0.8 with VCM:

Beyond difficult light sampling, generally complex and difficult light transport with lots of subtle caustics also wound up presenting major challenges in this scene. For example, note the subtle caustics on the wall in the upper right hand part of Figure 10; those caustics are actually visibly not fully converged, even though the sample count across Figure 10 was in the thousands of spp! I intentionally did not use adaptive sampling in any of these renders; instead, I wanted to experiment with a common technique used in a lot of modern production renderers for noise reduction: in-render firefly clamping. My adaptive sampler is already capable of detecting firefly pixels and driving more samples at fireflies in the hopes of accelerating variance reduction on firefly pixels, but firefly clamping is a much more crude, biased, but nonetheless effective technique. The idea is to detect on each pixel spp if a returned sample is an outlier relative to all of the previously accumulated samples, and discard or clamp the sample if it in fact is an outlier. Picking what threshold to use for outlier detection is a very manual process; even Arnold provides a tuning max-value parameter for firefly clamping.

I wanted to be able to directly compare the render with and without firefly clamping, so I implemented firefly clamping on top of Takua’s AOV system. When enabled, firefly clamping mode produces two images for a single render: one output with firefly clamping enabled, and one with clamping disabled. I tried re-rendering Figure 10 using unidirectional pathtracing and a relatively low spp count to produce as many fireflies as I could, for a clearer comparison. For this test, I set the firefly threshold to be samples that are at least 250 times brighter than the estimated pixel value up to that sample.

Note how Figure 13 appears to be completely firefly-free compared to Figure 12, and how Figure 13 doesn’t have visible caustic noise on the walls compared to Figure 10. However, notice how Figure 13 is also missing significant illumination in some areas, such as in the corner of the walls near the floor behind the wooden step ladder, or in the deepest parts of the purple flower bunch. Finding a threshold that eliminates all fireflies without loosing significant illumination in other areas is very difficult or, in some cases, impossible since some of these types of light transport essentially manifest as firefly-like high energy samples that only smooth out over time. For the final renders in Figure 1 and Figures 6 through 11, I wound up not actually using any firefly clamping. While biased noise-reduction techniques are a necessary evil in actual production, I expect that I’ll try to avoid relying on firefly clamping in the vast majority of what I do with Takua, since Takua is meant to just be a brute-force, hobby kind of thing anyway.

Aventador Renders Revisited

2017-12-03T00:00:00+00:00

A long time ago, I made some posts that featured a cool Lamborghini Aventador model. Recently, I revisited that model and made some new renders using the current version of Takua, mostly just for fun. To me, one of the most important parts of writing a renderer has always been being able to actually use the renderer to make fun images. The last time I rendered this model was something like four years ago, and back then Takua was still in a very basic state; the renders in those old posts don’t even have any shading beyond 50% grey lambertian surfaces! The renders in this post utilize a lot of advanced features that I’ve added since then, such as a proper complex layered Bsdf and texturing system, advanced bidirectional light transport techniques, huge speed improvements to ray traversal, advanced motion blur and generalized time capabilities, and more. I’m way behind in writing up many of these features and capabilities, but in the meantime, I thought I’d post some for-fun rendering projects I’ve done with Takua.

All of the renders in this post are directly from Takua, with a basic white balance and conversion from HDR EXR to LDR PNG being the only post-processing steps. Each render took about half a day to render (except for the wireframe render, which was much faster) on a 12-core workstation at 2560x1440 resolution.

Shading the Aventador model was a fun, interesting exercise. I went for a orange-red paint scheme since, well, Lamborghinis are supposed to look outrageous and orange-red is a fairly exotic paint scheme (I suppose I could have picked green or yellow or something instead, but I like orange-red). I ended up making a triple-lobe shader with a metallic base, a dielectric lobe, and a clear-coat lobe on top of that. The base lobe uses a GGX microfacet metallic Brdf. Takua’s shading system implements a proper metallic Fresnel model for conductors, where the Fresnel model includes both a Nd component representing refractive index and a k component representing the extinction coefficient for when an electromagnetic wave propagates through a material. For conductors, the final Fresnel index of refraction for each wavelength of light is defined by a complex combination of Nd and k. For the base metallic lobe, most of the color wound up coming from the k component. The dielectric lobe is meant to simulate paint on top of a car’s metal body; the dielectric lobe is where most of the orange-red color comes from. The dielectric lobe is again a GGX microfacet Brdf, but with a dielectric Fresnel model, which has a much simpler index of refraction calculation than the metallic Fresnel model does. I should note that Takua’s current standard material implementation actually only supports a single primary specular lobe and an additional single clear-coat lobe, so for shaders authored with both a metallic and dielectric component, Takua takes a blend weight between the two components and for each shading evaluation stochastically selects between the two lobes according to the blend weight. The clear-coat layer on top has just a slightly amount of extinction to provide just a bit more of the final orange look, but is mostly just clear.

All of the window glass in the render is tinted slightly dark through extinction instead of through a fixed refraction color. Using proper extinction to tint glass is more realistic than using a fixed refraction color. Similarly, the red and yellow glass used in the head lights and tail lights are colored through extinction. The brake disks use an extremely high resolution bump map to get the brushed metal look. The branding and markings on the tire walls are done through a combination of bump mapping and adjusting the roughness of the microfacet Brdf; the tire treads are made using a high resolution normal map. There’s no displacement mapping at all, although in retrospect the tire treads probably should be displacement mapped if I want to put the camera closer to them. Also, I actually didn’t really shade the interior of the car much, since I knew I was going for exterior shots only.

Eventually I’ll get around to implementing a proper car paint Bsdf in Takua, but until then, the approach I took here seems to hold up reasonable well as long as the camera doesn’t get super close up to the car.

I lit the scene using two lights: an HDR skydome from HDRI-Skies, and a single long, thin rectangular area light above the car. The skydome provides the overall soft-ish lighting that illuminates the entire scene, and the rectangular area light provides the long, interesting highlights on the car body that help with bringing out the car’s shape.

For all of the renders in this post, I used my VCM integrator, since the scene contains a lot of subtle caustics and since the inside of the car is lit entirely through glass. I also wound up modifying my adaptive sampler; it’s still the same adaptive sampler that I’ve had for a few years now, but with an important extension. Instead of simply reducing the total number of paths per iteration as areas reach convergence, the adaptive sampler now keeps the number of paths the same and instead reallocates paths from completed pixels to high-variance pixels. The end result is that the adaptive sampler is now much more effective at eliminating fireflies and targeting caustics and other noisy areas. In the above render, some pixels wound up with as few as 512 samples, while a few particularly difficult pixels finished with as many as 20000 samples. Here is the adaptive sampling heatmap for Figure 1 above; brighter areas indicate more samples. Note how the adaptive sampler found a number of areas that we’d expect to be challenging, such as the interior through the car’s glass windows, and parts of the body with specular inter-reflections.

I recently implemented support for arbitrary camera shutter curves, so I thought doing a motion blurred render would be fun. After all, Lamborghinis are supposed to go fast! I animated the Lamborghini driving forward in Maya; the animation was very basic, with the main body just translating forward and the wheels both translating and rotating. Of course Takua has proper rotational motion blur. The motion blur here is effectively multi-segment motion blur; generating multi-segment motion blur from an animated sequence in Takua is very easy due to how Takua handles and understands time. I actually think that Takua’s concept of time is one of the most unique things in Takua; it’s very different from how every other renderer I’ve used and seen handles time. I intend to write more about this later. Instead of an instantaneous shutter, I used a custom cosine-based shutter curve that places many more time samples near the center of the shutter interval than towards the shutter open and close. Using a shutter shape like this wound up being important to getting the right look to the motion blur; even the car is moving extremely quickly, the overall form of the car is still clearly distinguishable and the front and back of the car appear more motion-blurred than the main body.

Since Takua has a procedural wireframe texture now, I also did a wireframe render. I mentioned my procedural wireframe texture in a previous post, but I didn’t write about how it actually works. For triangles and quads, the wireframe texture is simply based on the distance from the hitpoint to the nearest edge. If the distance to the nearest edge is smaller than some threshold, draw one color, otherwise, draw some other color. The nearest edge calculation can be done as follows (the variable names should be self-explanatory):

float calculateMinDistance(const Poly& p, const Intersection& hit) const {
    float md = std::numeric_limits<float>::infinity();
    const int verts = p.isQuad() ? 4 : 3;
    for (int i = 0; i < verts; i++) {
        const glm::vec3& cur = p[i].m_position;
        const glm::vec3& next = p[(i + 1) % verts].m_position;
        const glm::vec3 d1 = glm::normalize(next - cur);
        const glm::vec3 d2 = hit.m_point - cur;
        const float l = glm::length((cur + d1 * glm::dot(d1, d2) - hit.m_point));
        md = glm::min(md, l * l);
    }
    return md;
};

The topology of the meshes are pretty strange, since the car model came as a triangle mesh, which I then subdivided:

The material in the wireframe render only uses the lambertian diffuse lobe in Takua’s standard material; as such, the adaptive sampling heatmap for the wireframe render is interesting to compare to Figure 2. Overall the sample distribution is much more even, and areas where diffuse inter-reflections are present got more samples:

Takua’s shading model supports layering different materials through parameter blending, similar to how the Disney Brdf (and, at this point, most other shading systems) handles material layering. I wanted to make an even more outrageous looking version of the Aventador than the orange-red version, so I used the procedural wireframe texture as a layer mask to drive parameter blending between a black paint and a metallic gold paint:

Olaf's Frozen Adventure

2017-11-16T00:00:00+00:00

After an amazing 2016, Walt Disney Animation Studios is having a bit of a break year this year. Disney Animation doesn’t have a feature film this year; instead, we made a half-hour featurette called Olaf’s Frozen Adventure, which will be released in front of Pixar’s Coco during Thanksgiving. I think this is the first time a Disney Animation short/featurette has accompanied a Pixar film. Olaf’s Frozen Adventure is a fun little holiday story set in the world of Frozen, and I had the privilege of getting to play a small role in making Olaf’s Frozen Adventure! I got an official credit as part of a handful of engineers that did some specific, interesting technology development for Olaf’s Frozen Adventure.

Olaf’s Frozen Adventure is really really funny; because Olaf is the main character, the entire story takes on much more of a self-aware, at times somewhat absurdist tone. The featurette also has a bunch of new songs- there are six new songs in total, which is somehow pretty close to the original film’s count of eight songs, but in a third of the runtime. Olaf’s Frozen Adventure was originally announced as a TV special, but the wider Walt Disney Company was so happy with the result that they decided to give Olaf’s Frozen Adventure a theatrical release instead!

Something I personally find fascinating about Olaf’s Frozen Adventure is comparing it visually with the original Frozen. Olaf’s Frozen Adventure is rendered entirely with Disney’s Hyperion Renderer, compared with Frozen, which was rendered using pre-RIS Renderman. While both films used our Disney BRDF [Burley 2012] and Ptex [Burley and Lacewell 2008], Olaf’s Frozen Adventure benefits from all of the improvements and advancements that have been made during Big Hero 6, Zootopia, and Moana. The original Frozen used dipole subsurface scattering, radiosity caching, and generally had fairly low geometric complexity relative to Hyperion-era films. In comparison, Olaf’s Frozen Adventure uses brute force subsurface scattering, uses path-traced global illumination, uses the full Disney BSDF (which is significantly extended from the Disney BRDF) [Burley 2015], uses our advanced fur/hair shader developed during Zootopia [Chiang et al. 2016], and has much greater geometric complexity. A great example of the greater geometric complexity is the knitted scarf sequence [Staub et al. 2018], where 2D animation was brought into Hyperion as a texture map to drive the colors on a knitted scarf that was modeled and rendered down to the fiber level. Some shots even utilize an extended version of the photon mapped caustics we developed during Moana; the photon mapped caustics system on Moana only supported distant lights as a photon source, but for Olaf’s Frozen Adventure, the photon mapping system was extended to support all of Hyperion’s existing light types as photon sources. These extensions to our photon mapping system is one of the things I worked on for Olaf’s Frozen Adventure, and was used for lighting the ice crystal tree that Elsa creates at the end of the film. Even the water in Arendelle Harbor looks way better than in Frozen, since the FX artists were able to make use of the incredible water systems developed for Moana [Palmer et al. 2017]. Many of these advancements are discussed in our SIGGRAPH 2017 Course Notes [Burley et al. 2017].

One of the huge advantages to working on an in-house production rendering team in a vertically integrated studio is being able to collaborate and partner closely with productions on executing long-term technical visions. Because of the show leadership’s confidence in our long-term development efforts targeted at later shows, the artists on Olaf’s Frozen Adventure were willing to take on and try out early versions of a number of new features in Hyperion that were originally targeted at later shows. Some of these “preview” features wound up making a big difference on Olaf’s Frozen Adventure, and lessons learned on Olaf’s Frozen Adventure were instrumental in making these features much more robust and complete on Ralph Breaks the Internet.

One major feature was brute force path-traced subsurface scattering; Peter Kutz, Matt Chiang, and Brent Burley had originally started development during Moana’s production on brute force path-traced subsurface scattering [Chiang 2016] as a replacement for Hyperion’s existing normalized diffusion based subsurface scattering [Burley 2015]. This feature wasn’t completed in time for use on Moana (although some initial testing was done using Moana assets), but was far enough along by Olaf’s Frozen Adventure was in production that artists started to experiment with it. If I remember correctly, the characters in Olaf’s Frozen Adventure are still using normalized diffusion, but path-traced subsurface wound up finding extensive use in rendering all of the snow in the show, since the additional detail that path-traced subsurface brings out helped highlight the small granular details in the snow. A lot of lessons learned from using path-traced subsurface scattering on the snow were then applied to making path-traced subsurface scattering more robust and easier to use and control. These experiences gave us the confidence to go ahead with full-scale deployment on Ralph Breaks the Internet, which uses path-traced subsurface scattering for everything including characters.

Another major development effort that found experimental use on Olaf’s Frozen Adventure were some large overhauls to Hyperion’s ray traversal system. During the production of Moana, we started running into problems with how large instance groups are structured in Hyperion. Moana’s island environments featured vast quantities of instanced vegetation geometry, and because of how the instancing was authored, Hyperion’s old strategy for grouping instances in the top-level BVH wound up producing heavily overlapping BVH leaves, which in extreme cases could severely degrade traversal performance. On Moana, the solution to this problem was to change how instances were authored upstream in the pipeline, but the way that the renderer wanted instances organized was fairly different from how artists and our pipeline like to think about instances, which made authoring more difficult. This problem motivated Peter Kutz and I to develop a new traversal system that would be less sensitive to how instance groups were authored; the system we came up with allows Hyperion to internally break up top-level BVH nodes with large overlapping bounds into smaller, tighter subbounds based on the topology of the lower-level BVHs. It turns out this system is conceptually essentially identical to BVH rebraiding [Benthin et al. 2017], but we developed and deployed this system independently before Benthin 2017 was published. As part of this effort, we also wound up revisiting Hyperion’s original cone-based packet traversal strategy [Eisenacher et al. 2013] and, motivated by extensive testing and statistical performance analysis, developed a new, simpler, higher performance multithreading strategy for handling Hyperion’s ultra-wide batched ray traversal. Olaf’s Frozen Adventure has a sequence where Olaf and Sven are being pulled down a mountainside through a forest by a burning sled; the enormous scale of the groundplane and large quantities of instanced trees proved to be challenging for Hyperion’s old traversal system. We were able to partner with the artists to deploy a mid-development prototype of our new traversal system on this sequence, and were able to cut traversal times by up to close to an order of magnitude in some cases. As a result, the artists were able to render this sequence with reasonable render times, and we were able to field-test the new traversal system prior to studio-wide deployment and iron out various kinks that were found along the way.

The last major mid-development Hyperion feature that saw early experimental use on Olaf’s Frozen Adventure was our new, next-generation spectral and decomposition tracking [Kutz et al. 2017] based null-collision volume rendering system, which was written with the intention of eventually completely replacing Hyperion’s existing residual ratio tracking [Novák 2014] based volume rendering system [Fong 2017]. Artists on Olaf’s Frozen Adventure ran into some difficulties with rendering loose, fluffy white snow, where the bright white appearance is the result of high-order scattering requiring large numbers of bounces. We realized that this problem is essentially identical to the problem of rendering white puffy clouds, which also have an appearance dominated by energy from high-order scattering. Since null-collision volume integration is specifically very efficient at handling high-order scattering, we gave the artists an early prototype version of Hyperion’s new volume rendering system to experiment with rendering loose fluffy snow as a volume. The initial results looked great; I’m not sure if this approach wound up being used in the final film, but this experiment gave both us and the artists a lot of confidence in the new volume rendering system and provided valuable feedback.

As usual with Disney Animation projects I get to work on, here are some stills in no particular order, from the film. Even though Olaf’s Frozen Adventure was originally meant for TV, the whole studio still put the same level of effort into it that goes into full theatrical features, and I think it shows!

Here is a credits frame with my name! I wasn’t actually expecting to get a credit on Olaf’s Frozen Adventure, but because I had spent a lot of time supporting the show and working with artists on deploying experimental Hyperion features to solve particularly difficult shots, the show decided to give me a credit! I was very pleasantly surprised by that; my teammate Matt Chiang got a credit as well for similar reasons.

All images in this post are courtesy of and the property of Walt Disney Animation Studios.

References

Carsten Benthin, Sven Woop, Ingo Wald, and Attila T. Áfra. 2017. Improved Two-Level BVHs using Partial Re-Braiding. In Proc. of High Performance Graphics (HPG 2017). Article 7.

Brent Burley. 2012. Physically Based Shading at Disney. In ACM SIGGRAPH 2012 Course Notes: Practical Physically-Based Shading in Film and Game Production.

Brent Burley. 2015. Extending the Disney BRDF to a BSDF with Integrated Subsurface Scattering. In ACM SIGGRAPH 2015 Course Notes: Physically Based Shading in Theory and Practice.

Brent Burley and Dylan Lacewell. 2008. Ptex: Per-face Texture Mapping for Production Rendering. Computer Graphics Forum (Proc. of Eurographics Symposium on Rendering) 27, 4 (Jun. 2008), 1155-1164.

Matt Jen-Yuan Chiang, Peter Kutz, and Brent Burley. 2016. Practical and Controllable Subsurface Scattering for Production Path Tracing. In ACM SIGGRAPH 2016 Talks. Article 49.

Julian Fong, Magnus Wrenninge, Christopher Kulla, and Ralf Habel. 2017. Production Volume Rendering. In ACM SIGGRAPH 2017 Courses. Article 2.

Sean Palmer, Jonathan Garcia, Sara Drakeley, Patrick Kelly, and Ralf Habel. 2017. The Ocean and Water Pipeline of Disney’s Moana. In ACM SIGGRAPH 2017 Talks. Article 29.

Josh Staub, Alessandro Jacomini, Dan Lund. 2018. The Handiwork Behind “Olaf’s Frozen Adventure”. In ACM SIGGRAPH 2018 Talks. Article 26.

SIGGRAPH 2017 Course Notes- Recent Advances in Disney's Hyperion Renderer

2017-08-04T00:00:00+00:00

This year at SIGGRAPH 2017, Luca Fascione and Johannes Hanika from Weta Digital organized a Path Tracing in Production course. The course was split into two halves: a first half about production renderers, and a second half about using production renderers to make movies. Brent Burley presented our recent work on Disney’s Hyperion Renderer as part of the first half of the course. To support Brent’s section of the course, the entire Hyperion team worked together to put together some course notes describing recent work in Hyperion done for Zootopia, Moana, and upcoming films.

Here is the abstract for the course notes:

Path tracing at Walt Disney Animation Studios began with the Hyperion renderer, first used in production on Big Hero 6. Hyperion is a custom, modern path tracer using a unique architecture designed to efficiently handle complexity, while also providing artistic controllability and efficiency. The concept of physically based shading at Disney Animation predates the Hyperion renderer. Our history with physically based shading significantly influenced the development of Hyperion, and since then, the development of Hyperion has in turn influenced our philosophy towards physically based shading.

The course notes and related materials can be found at:

The course wasn’t recorded due to proprietary content from various studios, but the overall course notes for the entire course cover everything that was presented. The major theme of our part of the course notes (and Brent’s presentation) is replacing multiple scattering approximations with accurate brute-force path-traced solutions. Interestingly, the main motivator for this move is primarily a desire for better, more predictable and intuitive controls for artists, as opposed to simply just wanting better visual quality. In the course notes, we specifically discuss fur/hair, path-traced subsurface scattering, and volume rendering.

The Hyperion team also had two other presentations at SIGGRAPH 2017:

Ralf Habel presented several sections of the “Production Volume Rendering”” course, which was jointly put together by Julian Fong and Magnus Wrenninge from Pixar Animation Studios, Christophe Kulla from Sony Imageworks, and Ralf Habel from Walt Disney Animation Studios.
Peter Kutz presented our “Spectral and Decomposition Tracking for Rendering Heterogeneous Volumes” technical paper in the “Rendering Volumes” papers session.

SIGGRAPH 2017 Paper- Spectral and Decomposition Tracking for Rendering Heterogeneous Volumes

2017-07-25T00:00:00+00:00

Some recent work I was part of at Walt Disney Animation Studios has been published in the July 2017 issue of ACM Transactions on Graphics as part of SIGGRAPH 2017! The paper is titled “Spectral and Decomposition Tracking for Rendering Heterogeneous Volumes”, and the project was a collaboration between the Hyperion development team at Walt Disney Animation Studios (WDAS) and the rendering group at Disney Research Zürich (DRZ). From the WDAS side, the authors are Peter Kutz (who was at Penn at the same time as me), Ralf Habel, and myself. On the DRZ side, our collaborator was Jan Novák, the head of DRZ’s rendering research group.

Here is the paper abstract:

We present two novel unbiased techniques for sampling free paths in heterogeneous participating media. Our decomposition tracking accelerates free-path construction by splitting the medium into a control component and a residual component and sampling each of them separately. To minimize expensive evaluations of spatially varying collision coefficients, we define the control component to allow constructing free paths in closed form. The residual heterogeneous component is then homogenized by adding a fictitious medium and handled using weighted delta tracking, which removes the need for computing strict bounds of the extinction function. Our second contribution, spectral tracking, enables efficient light transport simulation in chromatic media. We modify free-path distributions to minimize the fluctuation of path throughputs and thereby reduce the estimation variance. To demonstrate the correctness of our algorithms, we derive them directly from the radiative transfer equation by extending the integral formulation of null-collision algorithms recently developed in reactor physics. This mathematical framework, which we thoroughly review, encompasses existing trackers and postulates an entire family of new estimators for solving transport problems; our algorithms are examples of such. We analyze the proposed methods in canonical settings and on production scenes, and compare to the current state of the art in simulating light transport in heterogeneous participating media.

The paper and related materials can be found at:

Peter Kutz will be presenting the paper at SIGGRAPH 2017 in Log Angeles as part of the Rendering Volumes Technical Papers session.

Instead of repeating the contents of the paper here (which is pointless since the paper already says everything we want to say), I thought instead I’d use this blog post to talk about some of the process we went through while writing this paper. Please note that everything stated in this post are my own opinions and thoughts, not Disney’s.

This project started over a year ago, when we began an effort to significantly overhaul and improve Hyperion’s volume rendering system. Around the same time that we began to revisit volume rendering, we heard a lecture from a visiting professor on multilevel Monte Carlo (MLMC) methods. Although the final paper has nothing to do with MLMC methods, the genesis of this project was in initial conversations we had about how MLMC methods might be applied to volume rendering. We concluded that MLMC could be applicable, but weren’t entirely sure how. However, these conversations eventually gave Peter the idea to develop the technique that would eventually become decomposition tracking (importantly, decomposition tracking does not actually use MLMC though). Further conversations about weighted delta tracking then led to Peter developing the core ideas behind what would become spectral tracking. After testing some initial implementations of these prototype version of decomposition and spectral tracking, Peter, Ralf, and I shared the techniques with Jan. Around the same time, we also shared the techniques with our sister teams, Pixar’s RenderMan development group in Seattle and the Pixar Research Group in Emeryville, who were able to independently implement and verify our techniques. Being able to share research between Walt Disney Animation Studios, Disney Research, the Renderman group, Pixar Animation Studios, Industrial Light & Magic, and Imagineering is one of the reasons why Disney is such an amazing place to be for computer graphics folks.

At this point we had initial rudimentary proofs for why decomposition and spectral tracking worked separately, but we still didn’t have a unified framework that could be used to explain and combine the two techniques. Together with Jan, we began by deep-diving into the origins of delta/Woodcock tracking in neutron transport and reactor physics papers from the 1950s and 1960s and working our way forward to the present. All of the key papers we dug up during this deep-dive are cited in our paper. Some of these early papers were fairly difficult to find. For example, the original delta tracking paper, “Techniques used in the GEM code for Monte Carlo neutronics calculations in reactors and other systems of complex geometry” (Woodcock et al. 1965), is often cited in graphics literature, but a cursory Google search doesn’t provide any links to the actual paper itself. We eventually managed to track down a copy of the original paper in the archives of the United States Department of Commerce, which for some reason hosts a lot of archive material from Argonne National Laboratory. Since the original Woodcock paper has been in the public domain for some time now but is fairly difficult to find, I’m hosting a copy here for any researchers that may be interested.

Several other papers we were only able to obtain by requesting archival microfilm scans from several university libraries. I won’t host copies here, since the public domain status for several of them isn’t clear, but if you are a researcher looking for any of the papers that we cited and can’t find it, feel free to contact me. One particularly cool find was “The Relativistic Doppler Problem” (Zerby et al. 1961), which Peter obtained by writing to the Oak Ridge National Laboratory’s research library. Their staff were eventually able to find the paper in their records/archives, and subsequently scanned and uploaded the paper online. The paper is now publicly available here, on the United States Department of Energy’s Office of Scientific and Technical Information website.

Eventually, through significant effort from Jan, we came to understand Galtier et al.’s 2013 paper, “Integral Formulation of Null-Collision Monte Carlo Algorithms”, and were able to import the integral formulation into computer graphics and demonstrate how to derive both decomposition and spectral tracking directly from the radiative transfer equation using the integral formulation. This step also allowed Peter to figure out how to combine spectral and decomposition tracking into a single technique. With all of these pieces in place, we had the framework for our SIGGRAPH paper. We then put significant effort into working out remaining details, such as finding a good mechanism for bounding the free-path-sampling coefficient in spectral tracking. Producing all of the renders, results, charts, and plots in the paper also took an enormous amount of time; it turns out that producing all of this stuff can take significantly longer than the amount of time originally spent coming up with and implementing the techniques in the first place!

One major challenge we faced in writing the final paper was finding the best order in which to present the three main pieces of the paper: decomposition tracking, spectral tracking, and the integral formulation of null-collision algorithms. At one point, we considered first presenting decomposition tracking, since on a general level decomposition tracking is the easiest of the three contributions to understand. Then, we planned to use the proof of decomposition tracking to expand out into the integral formulation of the RTE with null collisions, and finally derive spectral tracking from the integral formulation. The idea was essentially to introduce the easiest technique first, expand out to the general mathematical framework, and then demonstrate the flexibility of the framework by deriving the second technique. However, this approach in practice felt disjointed, especially with respect to the body of prior work we wanted to present, which underpinned the integral framework but wound up being separated by the decomposition tracking section. So instead, we arrived on the final presentation order, where we first present the integral framework and derive out prior techniques such as delta tracking, and then demonstrate how to derive out new decomposition tracking and spectral tracking techniques. We hope that presenting the paper in this way will encourage other researchers to adopt the integral framework and derive other, new techniques from the framework. For Peter’s presentation at SIGGRAPH, however, Peter chose to go with the original order since it made for a better presentation.

Since our final paper was already quite long, we had to move some content into a separate supplemental document. Although the supplemental content isn’t necessary for implementing the core algorithms presented, I think the supplemental content is very useful for gaining a better understanding of the techniques. The supplemental content contains, among other things, an extended proof of the minimum-of-exponents mechanism that decomposition tracking is built on, various proofs related to choosing bounds for the local collision weight in spectral tracking, and various additional results and further analysis. We also provide a nifty interactive viewer for comparing our techniques against vanilla delta tracking; the interactive viewer framework was originally developed by Fabrice Rousselle, Jan Novák and Benedikt Bitterli at Disney Research Zürich.

One of the major advantages of doing rendering research at a major animation or VFX studio is the availability of hundreds of extremely talented artists, who are always eager to try out new techniques and software. Peter, Ralf, and I worked closely with a number of artists at WDAS to test our techniques and produce interesting scenes with which to generate results and data for the paper. Henrik Falt and Alex Nijmeh had created a number of interesting clouds in the process of testing our general volume rendering improvements, and worked with us to adapt a cloud dataset for use in Figure 11 of our paper. The following is one of the renders from Figure 11:

Henrik and Alex also constructed the cloudscape scene used as the banner image on the first page of the paper. After we submitted the paper, Henrik and Alex continued iterating on this scene, which eventually resulted in the more detailed version seen in our SIGGRAPH Fast Forward video. The version of the cloudscape used in our paper is reproduced below:

To test out spectral tracking, we wanted an interesting, dynamic, colorful dataset. After describing spectral tracking to Jesse Erickson, we arrived at the idea of a color explosion similar in spirit to certain visuals used in recent Apple and Microsoft ads, which in turn were inspired by the Holi festival celebrated in India and Nepal. Jesse authored the color explosion in Houdini and provided a set of VDBs for each color section, which we were then able to shade, light, and render using Hyperion’s implementation of spectral tracking. The final result was the color explosion from Figure 12 of the paper, seen at the top of this post. We were honored to learn that the color explosion figure was chosen to be one of the pictures on the back cover of this year’s conference proceedings!

At one point we also remembered that brute force path-traced subsurface scattering is just volume rendering inside of a bounded surface, which led to the translucent heterogeneous Stanford dragon used in Figure 15 of the paper:

For our video for the SIGGRAPH 2017 Fast Forward, we were able to get a lot of help from a number of artists. Alex and Henrik and a number of other artists significantly expanded and improved the cloudscape scene, and we also rendered out several more color explosion variants. The final fast forward video contains work from Alex Nijmeh, Henrik Falt, Jesse Erickson, Thom Wickes, Michael Kaschalk, Dale Mayeda, Ben Frost, Marc Bryant, John Kosnik, Mir Ali, Vijoy Gaddipati, and Dimitre Berberov. The awesome title effect was thought up by and created by Henrik. The final video is a bit noisy since we were severely constrained on available renderfarm resources (we were basically squeezing our renders in between actual production renders), but I think the end result is still really great:

Spectral and Decomposition Tracking for Rendering Homogeneous Volumes- SIGGRAPH 2017 Fast Forward Video

Here are a couple of cool stills from the fast forward video:

We owe an enormous amount of thanks to fellow Hyperion teammate Patrick Kelly, who played an instrumental role in designing and implementing our overall new volume rendering system, and who discussed with us extensively throughout the project. Hyperion teammate David Adler also helped out a lot in profiling and instrumenting our code. We also must thank Thomas Müller, Marios Papas, Géraldine Conti, and David Adler for proofreading, and Brent Burley, Michael Kaschalk, and Rajesh Sharma for providing support, encouragement, and resources for this project.

I’ve worked on a SIGGRAPH Asia paper before, but working on a large scale publication in the context of a major animation studio instead of in school was a very different experience. The support and resources we were given and the amount of talent and help that we were able to tap into made this project possible. This project is also an example of the incredible value that comes from companies maintaining in-house industrial research labs; this project absolutely would not have been possible without all of the collaboration from DRZ, in both the form of direct collaboration from Jan and indirect collaboration from all of the DRZ researchers that provided discussions and feedback. Everyone worked really hard, but overall the whole process was immensely intellectually satisfying and fun, and seeing our new techniques in use by talented, excited artists makes all of the work absolutely worthwhile!

Subdivision Surfaces and Displacement Mapping

2017-05-14T00:00:00+00:00

Two standard features that every modern production renderer supports are subdivision surfaces and some form of displacement mapping. As we’ll discuss a bit later in this post, these two features are usually very closely linked to each other in both usage and implementation. Subdivision and displacement are crucial tools for representing detail in computer graphics; from both a technical and authorship point of view, being able to represent more detail than is actually present in a mesh is advantageous. Applying detail at runtime allows for geometry to take up less disk space and memory than would be required if all detail was baked into the geometry, and artists often like the ability to separate broad features from high frequency detail.

I recently added support for subdivision surfaces and for both scalar and vector displacement to Takua; Figure 1 shows an ocean wave was rendered using vector displacement in Takua. The ocean surface is entirely displaced from just a single plane!

Both subdivision and displacement originally came from the world of rasterization rendering, where on-the-fly geometry generation was historically both easier to implement and more practical/plausible to use. In rasterization, geometry is streamed to through the renderer and drawn to screen, so each individual piece of geometry could be subdivided, tessellated, displaced, splatted to the framebuffer, and then discarded to free up memory. Old REYES Renderman was famously efficient at rendering subdivision surfaces and displacement surfaces for precisely this reason. However, in naive ray tracing, rays can intersect geometry at any moment in any order. Subdividing and displacing geometry on the fly for each ray and then discarding the geometry is insanely expensive compared to processing geometry once across an entire framebuffer. The simplest solution to this problem is to just subdivide and displace everything up front and keep it all around in memory during ray tracing. Historically though, just caching everything was never a practical solution since computers simply didn’t have enough memory to keep that much data around. As a result, past research work put significant effort into more intelligent ray tracing architectures that made on-the-fly subdivision/displacement affordable again; notable advancements include geometry caching for ray tracing [Pharr and Hanrahan 1996], direct ray tracing of displacement mapped triangles [Smits et al. 2000], reordered ray tracing [Hanika et al. 2010], and GPU ray traced vector displacement [Harada 2015].

In the past five years or so though, the story on ray traced displacement has changed. We now have machines with gobs and gobs of memory (at a number of studios, renderfarm nodes with 256 GB of memory or more is not unusual anymore). As a result, ray traced renderers don’t need to be nearly as clever anymore about managing displaced geometry; a combination of camera-adaptive tessellation and a simple geometry cache with a least-recently-used eviction strategy is often enough to make ray traced displacement practical. Heavy displacement is now common in the workflows for a number of production pathtracers, including Arnold, Renderman/RIS, Vray, Corona, Hyperion, Manuka, etc. With the above in mind, I tried to implement subdivision and displacement in Takua as simply as I possibly could.

Takua doesn’t have any concept of an eviction strategy for cached tessellated geometry; the hope is to just fit in memory and be as efficient as possible with what memory is available. Admittedly, since Takua is just my hobby renderer instead of a fully in-use production renderer, and I have personal machines with 48 GB of memory, I didn’t think particularly hard about cases where things don’t fit in memory. Instead of tessellating on-the-fly per ray or anything like that, I simply pre-subdivide and pre-displace everything upfront during the initial scene load. Meshes are loaded, subdivided, and displaced in parallel with each other. If Takua discovers that all of the subdivided and displaced geometry isn’t going to fit in the allocated memory budget, the renderer simply quits.

I should note that Takua’s scene format distinguishes between a mesh and a geom; a mesh is the raw vertex/face/primvar data that makes up a surface, while a geom is an object containing a reference to a mesh along with transformation matrices, shader bindings, and so on and so forth. This separation between the mesh data and the geometric object allows for some useful features in the subdivision/displacement system. Takua’s scene file format allows for binding subdivision and displacement modifiers either on the shader level, or per each geom. Bindings at the geom level override bindings on the shader level, which is useful for authoring since a whole bunch of objects can share the same shader but then have individual specializations for different subdivision rates and different displacement maps and displacement settings. During scene loading, Takua analyzes what subdivisions/displacements are required for which meshes by which geoms, and then de-duplicates and aggregates any cases where different geoms want the same subdivision/displacement for the same mesh. This de-duplication even works for instances (I should write a separate post about Takua’s approach to instancing someday…).

Once Takua has put together a list of all meshes that require subdivision, meshes are subdivided in parallel. For Catmull-Clark subdivision [Catmull and Clark 1978], I rely on OpenSubdiv for calculating subdivision stencil tables [Halstead et al. 1993] for feature adaptive subdivision [Nießner et al. 2012], evaluating the stencils, and final tessellation. As far as I can tell, stencil calculation in OpenSubdiv is single threaded, so it can get fairly slow on really heavy meshes. Stencil evaluation and final tessellation is super fast though, since OpenSubdiv provides a number of parallel evaluators that can run using a variety of backends ranging from TBB on the CPU to CUDA or OpenGL compute shaders on the GPU. Takua currently relies on OpenSubdiv’s TBB evaluator. One really neat thing about the stencil implementation in OpenSubdiv is that the stencil calculation is dependent on only the topology of the mesh and not individual primvars, so a single stencil calculation can then be reused multiple times to interpolate many different primvars, such as positions, normals, uvs, and more. Currently Takua doesn’t support creases; I’m planning on adding crease support later.

No writing about subdivision surfaces is complete without a picture of a cube being subdivided into a sphere, so Figure 2 shows a render of a cube with subdivision levels 0, 1, 2, and 3, going from left to right. Each subdivided cube is rendered with a procedural wireframe texture that I implemented to help visualize what was going on with subdivision.

Each subdivided mesh is placed into a new mesh; base meshes that require multiple subdivision levels for multiple different geoms get one new subdivided mesh per subdivision level. After all subdivided meshes are ready, Takua then runs displacement. Displacement is parallelized both by mesh and within each mesh. Also, Takua supports both on-the-fly displacement and fully cached displacement, which can be specified per shader or per geom. If a mesh is marked for full caching, the mesh is fully displaced, stored as a separate mesh from the undisplaced subdivision mesh, and then a BVH is built for the displaced mesh. If a mesh is marked for on-the-fly displacement, the displacement system calculates each displaced face, then calculates the bounds for that face, and then discards the face. The displaced bounds are then used to build a tight BVH for the displaced mesh without actually having to store the displaced mesh itself; instead, just a reference to the undisplaced subdivision mesh has to be kept around. When a ray traverses the BVH for an on-the-fly displacement mesh, each BVH leaf node specifies which triangles on the undisplaced mesh need to be displaced to produce final polys for intersection and then the displaced polys are intersected and discarded again. For the scenes in this post, on-the-fly displacement seems to be about twice as slow as fully cached displacement, which is to be expected, but if the same mesh is displaced multiple different ways, then there are correspondingly large memory savings. After all displacement has been calculated, Takua goes back and analyzes which base meshes and undisplaced subdivision meshes are no longer needed, and frees those meshes to reclaim memory.

I implemented support for both scalar displacement via regular grayscale texture maps, and vector displacement from OpenEXR textures. The ocean render from the start of this post uses vector displacement applied to a single plane. Figure 3 shows another angle of the same vector displaced ocean:

For both ocean renders, the vector displacement OpenEXR texture is borrowed from Autodesk, who generously provide it as part of an article about vector displacement in Arnold. The renders are lit with a skydome using hdri-skies.com’s HDRI Sky 193 texture.

For both scalar and vector displacement, the displacement amount from the displacement texture can be controlled by a single scalar value. Vector displacement maps are assumed to be in a local tangent space; which axis is used as the basis of the tangent space can be specified per displacement map. Figure 4 shows three dirt shaderballs with varying displacement scaling values. The leftmost shaderball has a displacement scale of 0, which effectively disables displacement. The middle shaderball has a displacement scale of 0.5 of the native displacement values in the vector displacement map. The rightmost shaderball has a displacement scale of 1.0, which means just use the native displacement values from the vector displacement map.

Figure 5 shows a closeup of the rightmost dirt shaderball from Figure 4. The base mesh for the shaderball is relatively low resolution, but through subdivision and displacement, a huge amount of geometric detail can be added in-render. In this case, the shaderball is tessellated to a point where each individual micropolygon is at a subpixel size. The model for the shaderball is based on Bertrand Benoit’s shaderball. The displacement map and other textures for the dirt shaderball are from Quixel’s Megascans library.

One major challenge with displacement mapping is cracking. Cracking occurs when adjacent polygons displace the same shared vertices different ways for each polygon. This can happen when the normals across a surface aren’t continuous, or if there is a discontinuity in either how the displacement texture is mapped to the surface, or in the displacement texture itself. I implemented an optional, somewhat brute-force solution to displacement cracking. If crack removal is enabled, Takua analyzes the mesh at displacement time and records how many different ways each vertex in the mesh has been displaced by different faces, along with which faces want to displace that vertex. After an initial displacement pass, the crack remover then goes back and for every vertex that is displaced more than one way, all of the displacements are averaged into a single displacement, and all faces that use that vertex are updated to share the same averaged result. This approach requires a fair amount of bookkeeping and pre-analysis of the displaced mesh, but it seems to work well. Figure 6 is a render of two cubes with geometric normals assigned per face. The two cubes are displaced using the same checkerboard displacement pattern, but the cube on the left has crack removal disabled, while the cube on the right has crack removal enabled:

In most cases, the crack removal system seems to work pretty well. However, the system isn’t perfect; sometimes, stretching artifacts can appear, especially with surfaces with a textured base color. This stretching happens because the crack removal system basically stretches micropolygons to cover the crack. This texture stretching can be seen in some parts of the shaderballs in Figures 5, 7, and 8 in this post.

Takua automatically recalculates normals for subdivided/displaced polygons. By default, Takua simply uses the geometric normal as the shading normal for displaced polygons; however, an option exists to calculate smooth normals for the shading normals as well. I chose to use geometric normals as the default with the hope that for subpixel subdivision and displacement, a different shading normal wouldn’t be as necessary.

In the future, I may choose to implement my own subdivision library, and I should probably also put more thought into some kind of proper combined tessellation cache and eviction strategy for better memory efficiency. For now though, everything seems to work well and renders relatively efficiently; the non-ocean renders in this post all have sub-pixel subdivision with millions of polygons and each took several hours to render at 4K (3840x2160) resolution on a machine with dual Intel Xeon X5675 CPUs (12 cores total). The two ocean renders I let run overnight at 1080p resolution; they took longer to converge mostly due to the depth of field. All renders in this post were shaded using a new, vastly improved shading system that I’ll write about at a later point. Takua can now render a lot more complexity than before!

In closing, I rendered a few more shaderballs using various displacement maps from the Megascans library, seen in Figures 7 and 8.

References

Edwin E. Catmull and James H. Clark. 1978. Recursively Generated B-spline Surfaces on Arbitrary Topological Meshes. Computer-Aided Design 10, 6 (Nov. 1978), 350-355.

Mark Halstead, Michael Kass, and Tony DeRose. 1993. Efficient, Fair Interpolation using Catmull-Clark Surfaces. In Proc. of SIGGRAPH (SIGGRAPH 1993). 35-44.

Johannes Hanika, Alexander Keller, and Hendrik P A Lensch. 2010. Two-Level Ray Tracing with Reordering for Highly Complex Scenes. In Proc. of Graphics Interfaces (GI 2010). 145-152.

Takahiro Harada. 2015. Rendering Vector Displacement Mapped Surfaces in a GPU Ray Tracer. In GPU Pro 6. 459-474.

Matthias Nießner, Charles Loop, Mark Meyer, and Tony DeRose. 2012. Feature Adaptive GPU Rendering of Catmull-Clark Subdivision Surfaces. ACM Transactions on Graphics 31, 1 (Jan. 2012), Article 6.

Matt Pharr and Pat Hanrahan. 1996. Geometry Caching for Ray-Tracing Displacement Maps. In Proc. of Eurographics Workshop on Rendering (Rendering Techniques 1996). 31-40.

Brian Smits, Peter Shirley, and Michael M. Stark. 2000. Direct Ray Tracing of Displacement Mapped Triangles. In Proc. of Eurographics Workshop on Rendering (Rendering Techniques 2000). 307-318.

Inner Workings

2017-03-07T00:00:00+00:00

Along with Moana’s theatrical release last fall, Disney Animation also released a short film titled Inner Workings, which is also rendered using Disney’s Hyperion Renderer. The Blu-ray release for Moana is out today, and Inner Workings is included with the Blu-ray and digital releases. I didn’t work on Inner Workings directly aside from the usual support and bug fixing that our team provides for all projects and artists using Hyperion, but I thought I’d share a few technical notes about Inner Workings along with a handful of my favorite frames to celebrate the occasion of the Blu-ray release.

Inner Workings follows our hero, Paul, and his internal organs as he goes through a typical work day. We see how his different organs represent his different emotions and desires and how they represent Paul’s internal struggle between being rational logical and being free-spirited and adventurous. Inner Workings was made with essentially the same version of Hyperion which was used on Zootopia, but Inner Workings was not rendered entirely using Hyperion; the other renderer using on Inner Workings was the human hand! A large chunk of Inner Workings is traditional hand-drawn animation made using Disney Animation’s Meander tool [Whited et al. 2012], which was previously used on Paperman and Feast [Kahrs et al. 2012, Osborne and Staub 2015]! Every time Paul’s brain imagines a future scenario, all of the animation is wonderful hand-drawn work that is in the director, Leo Matsuda’s, personal style. It looks absolutely fantastic, and this type of merger between cutting edge CG and beautiful traditional animation speaks towards Disney Animation’s combination of modern technology with rich artistic legacy.

The opening shot of Inner Workings is an anatomy textbook with clear plastic pages flipping to show overlays of different body systems inside of Paul. This shot seems pretty simple but is actually a great example of a type of shot that is pretty easy with a modern ray tracing renderer and insanely difficult using older rasterized rendering; with Hyperion’s ray tracing, there are no compositing hacks required to see through all of the clear sheets, and the printing on each sheet just automatically and naturally casts soft shadows onto the sheets below, lending depth and realism. The design of Paul’s insides made for a bit of a fun rendering problem; all of the organs are cartoony and friendly and squishy, which from a rendering perspective means they are all gummy objects with tons of subsurface scattering but also lots of internal glow. The final look of the organs is made up of a mish-mash of subsurface scattering, diffuse transmission, and internal volumetrics. In general a lot of the look of Inner Workings follows a sort of heightened cartoony physicality, which I think really showcases the flexibility and power of Hyperion’s Disney Principled BSDF shading model [Burley 2015]. Inner Workings is also the last project made using our older pre-Moana water rendering system; Moana features a brand new, from-the-ground-up approach to water rendering [Palmer et al. 2017]. Despite the older water rendering tech, I think the handful of ocean beach shots in Inner Workings look great! Really this just goes to show that while better rendering technology always helps, at the end of the day the most important factor to making really nice looking films is the artists.

Here are a handful of frames from Inner Workings pulled from the Blu-ray and presented in random order, to showcase how Hyperion was used on this short. Get Inner Workings with a copy of Moana (digital or physical) and see it on the biggest screen you can!

All images in this post are courtesy of and the property of Walt Disney Animation Studios.

References

Brent Burley. 2015. Extending the Disney BRDF to a BSDF with Integrated Subsurface Scattering. In ACM SIGGRAPH 2015 Course Notes: Physically Based Shading in Theory and Practice.

Patrick Osborne and Josh Staub. 2014. Feast – A Look at Walt Disney Animation Studios’ Newest Short. In ACM SIGGRAPH 2014 Production Sessions.

Sean Palmer, Jonathan Garcia, Sara Drakeley, Patrick Kelly, and Ralf Habel. 2017. The Ocean and Water Pipeline of Disney’s Moana. In ACM SIGGRAPH 2017 Talks. Article 29.

Brian Whited, Eric Daniels, Michael Kaschalk, Patrick Osborne, and Kyle Odermatt. 2012. Computer-Assisted Animation of Line and Paint in Disney’s Paperman. In ACM SIGGRAPH 2012 Talks. Article 19.

Moana

2016-11-17T00:00:00+00:00

2016 is the first year ever that Walt Disney Animation Studios is releasing two CG animated films. We released Zootopia back in March, and next week, we will be releasing our newest film, Moana. I’ve spent the bulk of the last year and a half working as part of Disney’s Hyperion Renderer team on a long list of improvements and new features for Moana. Moana is the first film I have an official credit on, and I couldn’t be more excited for the world to see what we have made!

We’re all incredibly proud of Moana; the story is fantastic, the characters are fresh and deep and incredibly appealing, and the music is an instant classic. Most important for a rendering guy though, I think Moana is flat out the best looking animated film anyone has ever made. Every single department on this film really outdid themselves. The technology that we had to develop for this film was staggering; we have a whole new distributed fluid simulation package for the endless oceans in the film, we added advanced new lighting capabilities to Hyperion that have never been used in an animated film before to this extent (to the best of my knowledge), we made huge advances in our animation technology for characters such as Maui; the list goes on and on and on. Something like over 85% of the shots in this movie have significant FX work in them, which is unheard of for animated features.

Hyperion gained a number of major new capabilities in support of making Moana. Rendering the ocean was a major concern on Moana, so much of Hyperion’s development during Moana revolved around features related to rendering water. Our lighters wanted caustics in all shots with shallow water, such as shots set at the beach or near the shoreline; faking caustics was quickly ruled out as an option since setting up lighting rigs with fake caustics that looked plausible and visually pleasing proved to be difficult and laborious. We found that providing real caustics was vastly preferable to faking things, both from a visual quality standpoint and a artist workflow standpoint, so we wound up adding a photon mapping system to Hyperion. The design of the photon mapping system is highly optimized around handling sun-water caustics, which allows for some major performance optimizations, such as an adaptive photon distribution system that makes sure that photons are not wasted on off-camera parts of the scene. Most of the photon mapping system was written by Peter Kutz; I also got to work on the photon mapping system a bit.

Water is in almost every shot in the film in some form, and the number of water effects was extremely varied, ranging from the ocean surface going out for dozens of miles in every direction, to splashes and boat wakes [Stomakhin and Selle 2017] and other finely detailed effects. Water had to be created using a host of different techniques, from relatively simple procedural wave functions [Garcia et al. 2016], to hand-animatable rigged wave systems [Byun and Stomakhin 2017], all the way to huge complex fluid simulations using Splash, a custom in-house APIC-based fluid simulator [Jiang et al. 2015]. We even had to support water as a straight up rigged character [Frost et al. 2017]! In order to bring the results of all of these techniques together into a single renderable water surface, an enormous amount of effort was put into building a level-set compositing system, in which all water simulation results would be converted into signed distance fields that could then be combined and converted into a watertight mesh. Having a single watertight mesh was important, since the ocean often also contained a homogeneous volume to produce physically correct scattering. This is where all of the blues and the greens in ocean water come from. This entire system could be run by Hyperion at rendertime, or could be run offline beforehand to generate a cached result that Hyperion could load; a whole complex pipeline had to be build to support this capability [Palmer et al. 2017]. Building this level-set compositing and meshing system involved a large number of TDs and engineers; on the Hyperion side, this project was led by Ralf Habel, Patrick Kelly, and Andy Selle. Peter and I also helped out at various points.

At one point early on the film’s production, we noticed that our lighters were having a difficult time getting specular glints off of the ocean surface to look right. For artistic controllability reasons, our lighters prefer to keep the sun and the skydome as two separate lights; the skydome is usually an image-based light that is either painted or is from photography with the sun painted out, and the sun is usually a distant infinite light that subtends some sold angle. After a lot of testing, we found that the look of specular glints on the ocean surface comes partially from the sun itself, but also partially from the atmospheric scattering that makes the sun look hazy and larger in the sky than it actually is. To get this look, I added a system to analytically add a Mie-scattering halo around our distant lights; we called the result the “halo light”.

Up until Moana, Hyperion actually never had proper importance sampling for emissive meshes; we just relied on paths randomly finding their way to emissive meshes and only worried about importance sampling analytical area lights and distant infinite lights. For shots with the big lava monster Te-Ka [Bryant et al. 2017], however, most of the light in the frame came from emissive lava meshes, and most of what was being lit were complex, dense smoke volumes. Peter added a highly efficient system for importance sampling emissive meshes into the renderer, which made Te-Ka shots go from basically un-renderable to not a problem at all. David Adler also made some huge improvements to our denoiser’s ability to handle volumes to help with those shots.

Hyperion also saw a huge number of other improvements during Moana; Dan Teece and Matt Chiang made numerous improvements to the shading system, I reworked the ribbon curve intersection system to robustly handle Heihei’s and hawk-Maui’s feathers, Greg Nichols made our camera-adaptive tessellation more robust, and the team in general made many speed and memory optimizations. Throughout the whole production cycle, Hyperion partnered really closely with production to make Moana the most beautiful animated film we’ve ever made. This close partnership is what makes working at Disney Animation such an amazing, fun, and interesting experience.

The first section of the credits sequence in Moana showcases a number of the props that our artists made for the film. I highly recommend staying and staring at all of the eye candy; our look and modeling departments are filled with some of the most dedicated and talented folks I’ve ever met. The props in the credits have simply preposterous amounts of detail on them; every single prop has stuff like tiny little flyaway fibers or microscratches or imperfections or whatnot on them. In some of the international posters, one can see that all of the human characters are covered with fine peach fuzz (an important part of making their skin catch the sunlight correctly), which we rendered in every frame! Something that we’re really proud of is the fact that none of the credit props were specially modeled for the credits! Those are all the exact props we used in every frame that they show up in, which really is a testament to both how amazing our artists our and how much work we’ve put into every part of our technology. The vast majority of production for Moana happened in essentially the 9 months between Zootopia’s release in March and October; this timeline becomes even more astonishing given the sheer beauty and craftsmanship in Moana.

Below are a number of stills (in no particular order) from the movie, 100% rendered using Hyperion. These stills give just a hint at how beautiful this movie looks; definitely go see it on the biggest screen you can find!

Here is a credits frame with my name that Disney kindly provided! Most of the Hyperion team is grouped under the Rendering/Pipeline/Engineering Services (three separate teams under the same manager) category this time around, although a handful of Hyperion guys show up in an earlier part of the credits instead.

All images in this post are courtesy of and the property of Walt Disney Animation Studios.

Also, Moana is accompanied by a fun new short from Disney Animation called Inner Workings, which combines CG animation rendered using Hyperion with more traditional hand-drawn elements. Be sure to catch Inner Workings with Moana!

Addendum 2018-08-18

A lot more detailed information about the photon mapping system, the level-set compositing system, and the halo light is now available as part of our recent TOG paper on Hyperion [Burley et al. 2018].

References

Marc Bryant, Ian Coony, and Jonathan Garcia. 2017. Moana: Foundation of a Lava Monster. In ACM SIGGRAPH 2017 Talks. Article 10.

Dong Joo Byun and Alexey Stomakhin. 2017. Moana: Crashing Waves. In ACM SIGGRAPH 2017 Talks. Article 41.

Ben Frost, Alexey Stomakhin, and Hiroaki Narita. 2017. Moana: Performing Water. In ACM SIGGRAPH 2017 Talks. Article 30.

Jonathan Garcia, Sara Drakeley, Sean Palmer, Erin Ramos, David Hutchins, Ralf Habel, and Alexey Stomakhin. 2016. Rigging the Oceans of Disney’s Moana. In ACM SIGGRAPH Asia 2016 Technical Briefs. Article 30.

Sean Palmer, Jonathan Garcia, Sara Drakeley, Patrick Kelly, and Ralf Habel. 2017. The Ocean and Water Pipeline of Disney’s Moana. In ACM SIGGRAPH 2017 Talks. Article 29.

Alexey Stomakhin and Andy Selle. 2017. Fluxed Animated Boundary Method. ACM Transactions on Graphics (Proc. of SIGGRAPH) 36, 4 (Aug. 2017), Article 68.

Physically Based Rendering 3rd Edition

2016-09-30T00:00:00+00:00

Today is the release date for the digital version of the new Physically Based Rendering 3rd Edition, by Matt Pharr, Wenzel Jakob, and Greg Humphreys. As anyone in the rendering world knows, Physically Based Rendering is THE reference book for the field; for novices, Physically Based Rendering is the best introduction one can get to the field, and for experts, Physically Based Rendering is an invaluable reference book to consult and check. I share a large office with three other engineers on the Hyperion team, and I think between the four of us, we actually have an average of more than one copy per person (of varying editions). I could not recommend this book enough. The third edition adds Wenzel Jakob as an author; Wenzel is the author of the research-oriented Mitsuba Renderer and is one of the most prominent new researchers in rendering in the past decade. There is a lot of great new light transport stuff in the third edition, which I’m guessing comes from Wenzel. Both Wenzel’s work and the previous editions of Physically Based Rendering were instrumental in influencing my path in rendering, so of course I’ve already had the third edition on pre-order since it was announced over a year ago.

Each edition of Physically Based Rendering is accompanied by a major release of the PBRT renderer, which implements the book. The PBRT renderer is a major research resource for the community, and basically everyone I know in the field has learned something or another from looking through and taking apart PBRT. As part of the drive towards PBRT-v3, Matt Pharr made a call for interesting scenes to provide as demo scenes with the PBRT-v3 release. I offered Matt the PBRT-v2 scene I made a while back, because how could that scene not be rendered in PBRT? I’m very excited that Matt accepted and included the scene as part of PBRT-v3’s example scenes! You can find the example scenes here on the PBRT website.

Converting the scene to PBRT’s format required a lot of manual work, since PBRT’s scene specification and shading system is very different from Takua’s. As a result, the image that PBRT renders out looks slightly different from Takua’s version, but that’s not a big deal. Here is the scene rendered using PBRT-v3:

…and for comparison, the same scene rendered using Takua:

Really, it’s just the lighting that is a bit different; the Takua version is slightly warmer and slightly underexposed in comparison.

At some point I should make an updated version of this scene using the third edition book. I’m hoping to be able to contribute more of my Takua test scenes to the community in PBRT-v3 format in the future; giving back to such a major influence on my own career is extremely important. As part of the process of porting the scene over to PBRT-v3, I also wrote a super-hacky render viewer for PBRT that shows the progress of the render as the renderer runs. Unfortunately, this viewer is mega-hacky, and I don’t have time at the moment to clean it up and release it. Hopefully at some point I’ll be able to; alternatively, anyone else who wants to take a look and give it a stab, feel free to contact me.

Addendum 2017-04-28

Matt was recently looking for some interesting water-sim scenes to demonstrate dielectrics and glass materials and refraction and whatnot. I contributed a few frames from my PIC/FLIP fluid simulator, Ariel. Most of the data from Ariel doesn’t exist in meshed format anymore; I still have all of the raw VDBs and stuff, but the meshes took up way more storage space than I could afford at the time. I can still regenerate all of the meshes though, and I also have a handleful of frames in mesh form still from my attenuated transmission blog post. The frame from the first image in that post is now also included in the PBRT-v3 example scene suite! The PBRT version looks very different since it is intended to demonstrate and test something very different from what I was doing in that blog post, but it still looks great!

Rendering Minecraft in Renderman/RIS

2016-07-22T00:00:00+00:00

The vast majority of my computer graphics time is spent developing renderers (Disney’s Hyperion renderer as a professional, Takua Renderer as a hobbyist). However, I think having experience using renderers as an artist is an important part of knowing what to focus on as a renderer developer. I also think that knowing how a variety of different renderers work and how they are used is important; a lot of artists are used to using several different renderers, and each renderer has its own vocabulary and tried and true workflows and whatnot. Finally, there are a lot of really smart people working on all of the major production renderers out there, and seeing the cool things everyone is doing is fun and interesting! Because of all of these reasons, I like putting some time aside every once in a while to tinker with other renderers. I usually don’t write about my art projects that much anymore, but this project was particularly fun and produced some nice looking images, so I thought I’d write it up. As usual, before we dive into the post, here is the final image I made, rendered using Pixar’s Photorealistic Renderman 20 in RIS mode:

About two years ago, Pixar’s Photorealistic Renderman got a new rendering mode called RIS. PRman was one of the first production renderers ever developed, and historically PRman has always been a REYES-style rasterization renderer. Over time though, PRman has gained a whole bunch of added on features. At the time of Monsters University, PRman was actually a kind of hybrid rasterizer and raytracer; the rendering system on Monsters University used raytracing to build a multiresolution radiosity cache that was then used for calculating GI contributions in the shading part of REYES rasterization. That approach worked well and produced beautiful images, but it was also really complicated and had a number of drawbacks! RIS replaces all of that with a brand new, pure pathtracing system. In fact, while RIS is marketed as a new mode in PRman, RIS is actually a completely new renderer written almost completely from scratch; it just happens to be able to read Renderman RIB files as input.

Recently, I wanted to try rendering a Minecraft world from a Minecraft server that I play on. There are a lot of great Minecraft rendering tools available these days (Chunky comes to mind), but I wanted much more production-like control over the look of the render, so I decided to do everything using a normal CG production workflow instead of a prebuilt dedicated Minecraft rendering tool. I thought that I would use the project as a chance to give RIS a spin. At Cornell’s Program of Computer Graphics, Pixar was kind enough to provide us with access to the Renderman 19 beta program, which included the first version of RIS. I tinkered with the PRman 19 beta quite a lot at Cornell, and being an early beta, RIS had some bugs and incomplete bits back then. Since then though, the Renderman team has followed up PRman 19 with versions 20 and 21, which introduced a number of new features and speed/stability improvements to RIS. Since joining the Hyperion team, I’ve had the chance to meet and talk to various (really smart!) folks on the Renderman team since they are a sister team to us, but I haven’t actually had time to try the new versions of RIS. This project was a fun way to try the newest version of RIS on my own!

The Minecraft data for this project comes from the Nerd.nu community Minecraft server, which is run by a collective of players for free. I’ve been playing on the Nerd.nu PvE (Player versus Environment) server for years and years now, and players have built a mind-boggling number of amazing detailed creations. Every couple of months, the server is reset with a fresh map; I wanted to render a town that fellow player Avi_Dangerstein and I built on the previous map revision. Fortunately, all previous Nerd.nu map revisions are available for download in the server archives (the specific map I used is labeled pve-rev17). Here is an overview of the map revision I wanted to pull data from:

…and here is a zoomed in view of the part of the map that contains our town. The vast majority of the town was built by two players over the course of about 4 months. Our town is about 250 blocks long; the entire server map is a 6000 block by 6000 block square.

The first problem to tackle in this project was just getting Minecraft world data into a usable format. Pixar provides a free, non-commercial version of Renderman for Maya, and I’m very familiar with Maya, so my entire workflow for this project was based around good ol’ Maya. Maya doesn’t know how to read Minecraft data though… in fact, Minecraft’s chunked data format is a fascinating rabbit hole to read about in its own right. I briefly entertained the idea of writing my own Minecraft to Maya importer, but then I found a number of Minecraft to Obj exporters that other folks have already written. I first tried jmc2obj, but the section of the Minecraft world that I wanted to export was so large that jmc2obj kept running out of memory and crashing. Instead, I found that Eric Haines’s Mineways exporter was able to handle the data load well (incidentally, Eric Haines is also a Cornell Program of Computer Graphics alum; I inherited a pile of his ACM Transactions on Graphics hardcopies while at Cornell). The chunk of the world I wanted to export was pretty large; in the Mineways screenshot below, the area outlined in red is the part of the world that I wanted:

The area outlined above is significantly larger than the area I wound up rendering… initially I was thinking of a very different camera angle from the ground with the mountains in the background before I picked an aerial view much later. The size of the exported obj mesh was about 1.5 GB. Mineways exports the world as a single mesh, optimized to remove all completely occluded internal faces (so the final mesh is hollow instead of containing useless faces for all of the internal blocks). Each visible block face is uv’d into a corresponding square on a single texture file. This approach produces an efficient mesh, but I realized early on that I would need water in a separate mesh containing completely enclosed volumes for each body of water (Mineways only provides geometry for the top surface of water). Glass had to be handled similarly; both water and glass need special handling for the same reasons that I mentioned immediately after the first diagram in my attenuated transmission blog post. Mineways allows for exporting different block types as separate meshes (but still with internal faces removed), so I simply deleted the water and glass meshes after exporting. Luckily, jmc2obj allows exporting individual block types as closed meshes, so I went back to jmc2obj for just the water and glass. Since just the water and glass is a much smaller data set than the whole world, jmc2obj was able to export without a problem. Since rendering refractive interfaces correctly requires expanding out the refractive mesh slightly at the interfaces, I wrote a custom program based on Takua Renderer’s obj mesh processing library to push out all of the vertices of the water and glass meshes slightly along the average of the face normals at each vertex.

Next up was shading everything in Maya. Renderman 20 ships with an implementation of Disney’s Principled Brdf, which I’ve gotten very familiar with using, so I went with Renderman’s PxrDisney Bxdf. The Disney Brdf allows for quickly creating very good looking materials using a fairly small parameter set. Overall I tried to stick close to the in-game aesthetic, which meant using all of the standard in-game textures instead of a custom resource pack, and I also wound up having to reign back a bit on making materials look super realistic. Everything basically has some varied roughness and specularity, and that’s pretty much it. I did add a subtle bump map to everything though; I made the bump map by simply making a black and white version of the default texture pack and messing with the brightness and contrast a bit. Below is a render of a test world created by Minecraft Forum user QMagnet specifically for testing resource packs. I lit the test scene using a single IBL (HDRI Sky 141 from the HDRI-Skies library). The test render below isn’t using the final specialized water and leaf shaders I created, which I’ll describe a bit further down, and there are also some resolution problems on the alpha masks for the leaf blocks, but overall this test render gives an idea of what my final materials look like:

One detail worth going into a bit more detail about are the glowing blocks. The glowstone, lantern, and various torch blocks use a trick based on something that I have seen lighters use in production. The basic idea is to decouple the direct and indirect visibility for the light. I got this decoupling to work in RIS by making all of the glowing blocks into pairs of textured PxrMeshLights. Using PxrMeshLights is necessary in order to allow for efficient light sampling; however, the actual exposures the lights are at make the textures blow out in camera. In order to make the textures discernible in camera, a second PxrMeshLights is needed for each glowing object; one of the lights is at the correct exposure but is marked visible to only indirect rays and invisible to direct camera rays, and the other light is at a much lower exposure but is also only visible to direct camera rays. This trick is a totally non-physical cheaty-hack, but it allows for a believable visual appearance if the exposures are chosen carefully.

In the final renders a few pictures down, I also used a more specialized shader for leaves and vines and tall grass and whatnot. The leaf block shader uses a PxrLMPlastic material instead of PxrDisney; this is because the leaf block shader has a slight amount of diffuse transmission (translucency) and also has more specialized diffuse/specular roughness maps.

For the water shader in the final render, I used a PxrLMGlass material with an IOR of 1.325, a slightly blue tinted refraction color, and a light blue absorption color. Using slightly different colors for the refraction and absorption colors allows for the water to transition to a slightly different hue at deeper depths than at the surface (as opposed to just a more saturated version of the same color). I also added a simple water surface displacement map to get the wavy surface effect. Combined with the refractive interface stuff mentioned before, the final water looks like this:

Note the total lack of real caustics in the water… I wound up just using the basic pathtracer built into RIS instead of Pixar’s VCM implementation. Pixar’s VCM implementation is one of the first commercial VCM implementations out there, but as of Renderman 20, it has no adaptivity in its light path distribution whatsoever. As a result, the Renderman 20 VCM integrator is not really suitable for use on huge scenes; most of the light paths end up getting wasted on areas of the scene that aren’t even close to being in-camera, which means that all of the efficiency in rendering caustics is lost. This problem is fundamental to lighttracing-based techniques (meaning that bidirectional techniques inherit the same problem), and solving it remains a relatively open problem (Takua has some basic photon targeting mechanisms for PPM/VCM that I’ll write about at some point). Apparently, this large-scene problem was a major challenge on Finding Dory and is one of the main reasons why Pixar didn’t use VCM heavily on Dory; Dory relied mostly on projected and pre-baked caustics.

I should also note that Renderman 21 does away with the PxrLM and PxrDisney materials entirely and instead introduces the shader set that Christophe Hery and Ryusuke Villemin wrote for Finding Dory. I haven’t tried the Renderman 21 shading system yet, but I would be very curious to compare against our Disney Brdf.

The final lighting setup I used was very simple. There are two main lights in the scene: an IBL dome light for sky illumination, and a 0.5 degree distant light as a sun stand-in. The IBL is another free sky from the HDRI-Skies library; this time, I used HDRI Sky 84. There is also a third spotlight used for getting long, dramatic shadows out of the fog, which I’ll talk about a bit later. Here is a lighting test with just the dome and distant lights on a grey clay version of the scene:

For efficiency reasons, I broke out the fog into a separate pass entirely and added it back in comp afterwards. At the time that I did this project, Renderman 20’s volume system was still relatively new (Renderman 21 introduces a significantly overhauled, much faster volume system, but Renderman 21 wasn’t out yet when I did this project), and while perfectly capable, wasn’t necessarily super fast. Iterating on the look of the fog separately from the main render was simply a more efficient workflow. Here is the raw render directly out of RIS:

For the fog, I initially wanted to do fully simulated fog in Houdini. I experimented with using a point SOP to control wind direction and to drive a wind DOP and have fog flow through the scene, but the sheer scale of the scene made this approach impracticable on my home computers. Instead, I wound up just creating a static procedural volume noise field and dumping it out to VDB. I then brought the VDB back into Maya for RIS rendering. Initially I tried rendering the fog pass without the additional spotlight and got something that looked like this:

After getting this first fog attempt rendered, I did a first pass at a final comp and color grade. I wound up using a very different color grade on this earlier attempt. This earlier version is the version that I shared in some places, such as the Nerd.nu subreddit and on Twitter:

This first attempt looked okay, but didn’t quite hit what I was going for. I wanted something with much more dramatic shadow beams, and I also felt that the fog didn’t really look settled into the terrain. Eventually I realized that I needed to make the fog sparser and that the fog should start thinning out after rising just a bit off of the ground. After adjusting the fog and adding in a spotlight with a bit of a cooler temperature than the sun, I got the image below. I’m pretty happy with how the fog looks like it is settling in the river valley and is pouring out of the forested hill in the upper left of the image, even though none of the fog is actually simulated!

Finally, I brought everything together in comp and added a color grading pass in Lightroom. The grade that I went with is much much more heavy-handed than what I usually like to use, but it felt appropriate for this image. I also added a slight amount of vignetting and grain in the final image. The final image is at the top of this post, but here it is again for convenience:

I learned a lot about using RIS from this project! By my estimation, RIS is orders of magnitude easier to use than old REYES Renderman; the overall experience was fairly similar to my previous experiences with Vray and Arnold. Both Takua and Hyperion make some similar choices and some very different choices in comparison, but then again, every renderer has large similarities and large differences from every other renderer out there. Rendering a Minecraft world was a lot of fun; I definitely am looking forward to doing more Minecraft renders using this pipeline again sometime in the future.

Also, here’s a shameless plug for the Nerd.nu Minecraft server that this data set is from. If you like playing Minecraft and are looking for a fast, free, friendly community to build with, you should definitely come check out the Nerd.nu PvE server, located at p.nerd.nu. The little town in this post is not even close to the most amazing thing that people have built on that server.

A final note on the (lack of) activity on my blog recently: we’ve been extremely busy at Walt Disney Animation Studios for the past year trying to release both Zootopia and Moana in the same year. Now that we’re closing in on the release of Moana, hopefully I’ll find time to post more. I have a lot of cool posts about Takua Renderer in various states of drafting; look for them soon!

Addendum 2016-10-02

After I published this post, Eric Haines wrote to me with a few typo corrections and, more importantly, to tell me about a way to get completely enclosed meshes from Mineways using the color schemes feature. Serves me right for not reading the documentation completely before starting! The color schemes feature allows assigning a color and alpha value to each block type; the key part of this feature for my use case is that Mineways will delete blocks with a zero alpha value when exporting. Setting all blocks except for water to have an alpha of zero allows for exporting water as a complete enclosed mesh; the same trick applies for glass or really any other block type.

One of the neat things about this feature is that the Mineways UI draws the map respecting assigned alpha values from the color scheme being used. As a result, setting everything except for water to have a zero alpha produces a cool view that shows only all of the water on the map:

Going forward, I’ll definitely be adopting this technique to get water meshes instead of using jmc2obj. Being able to handle all of the mesh exporting work in a single program makes for a nicer, more streamlined pipeline. Of course both jmc2obj and Mineways are excellent pieces of software, but in my testing Mineways handles large map sections much better, and I also think that Mineways produces better water meshes compared to jmc2obj. As a result, my pipeline is now entirely centered around Mineways.

Zootopia

2016-02-12T00:00:00+00:00

Walt Disney Animation Studios’ newest film, Zootopia, will be releasing in the United States three weeks from today. I’ve been working at Walt Disney Animation Studios on the the core development team for Disney’s Hyperion Renderer since July of last year, and the release of Zootopia is really special for me; Zootopia is the first feature film I’ve worked on. My actual role on Zootopia was fairly limited; so far, I’ve been spending most of my time and effort on the version of Hyperion for our next film, Moana (coming out November of this year). On Zootopia I basically only did support and bugfixes for Zootopia’s version of Hyperion (and I actually don’t even have a credit in Zootopia, since I hadn’t been at the studio for very long when the credits were compiled). Nonetheless, I’m incredibly proud of all of the work and effort that has been put into Zootopia, and I consider myself very fortunate to have been able to play even a small role in making the film!

Zootopia is a striking film in every way. The story is fantastic and original and relevant, the characters are all incredibly appealing, the setting is fascinating and immensely clever, the music is wonderful. However, on this blog, we are more interested in the technical side of things; luckily, the film is just as unbelievable in its technology. Quite simply, Zootopia is a breathtakingly beautiful film. In the same way that Big Hero 6 was several orders of magnitude more complex and technically advanced than Frozen in every way, Zootopia represents yet another enormous leap over Big Hero 6 (which can be hard to believe, considering how gorgeous Big Hero 6 is).

The technical advances made on Zootopia are far beyond what I can go into detail here since I don’t think I can describe them in a way that does them justice, but I think I can safely say that Zootopia is the most technically advanced animated film ever made to date. The fur and cloth (and cloth on top of fur!) systems on Zootopia are beyond anything I’ve ever seen, the sets and environments are simply ludicrous in both detail and scale, and of course the shading and lighting and rendering are jaw-dropping. In a lot of ways, many of the technical challenges that had to be solved on Zootopia can be summarized in a single word: complexity. Enormous care had to be put into creating believable fur and integrating different furry characters into different environments [Burkhard et al. 2016], and the huge quantities of fur in the movie required developing new level-of-detail approaches [Palmer and Litaker 2016] to make the fur manageable on both the authoring and rendering sides. The sheer number of crowds characters in the film also required developing a new crowds workflow [El-Ali et al. 2016], again to make both authoring and rendering tractable, and the complex jungle environments seen throughout most of the film similarly required new approaches to procedural vegetation [Keim et al. 2016]. Complexity wasn’t just a problem on a large scale though; Zootopia is also incredible rich in the smaller details. Zootopia was the first movie that Disney Animation deployed a flesh simulation system on [Milne et al. 2016] in order to create convincing muscular movement under the skin and fur of the animal characters. Even individual effects such as scooping ice cream [Byun et al. 2016] sometimes required innovative new CG techniques. On the rendering side the Hyperion team developed a brand new BSDF for shading hair and fur [Chiang et al. 2016], with a specific focus on balencing artistic controllability, physical plausibility, and render efficiency. Disney isn’t paying me to write this on my personal blog, and I don’t write any of this to make myself look grand either. I played only a small role, and really the amazing quality of the film is a testament to the capabilities of the hundreds of artists that actually made the final frames. I’m deeply humbled to see what amazing things great artists can do with the tools that my team makes.

Okay, enough rambling. Here are some stills from the film, 100% rendered with Hyperion, of course. Go see the film; these images only scratch the surface in conveying how gorgeous the film is.

All images in this post are courtesy of and the property of Walt Disney Animation Studios.

References

Nicholas Burkard, Hans Keim, Brian Leach, Sean Palmer, Ernest J. Petti, and Michelle Robinson. 2016. From Armadillo to Zebra: Creating the Diverse Characters and World of Zootopia. In ACM SIGGRAPH 2016 Production Sessions. Aritcle 24.

Dong Joo Byun, James Mansfield, and Cesar Velazquez. 2016. Delicious Looking Ice Cream Effects with Non-Simulation Approaches. In ACM SIGGRAPH 2016 Talks. Article 25.

Moe El-Ali, Joyce Le Tong, Josh Richards, Tuan Nguyen, Alberto Luceño Ros, and Norman Moses Joseph. 2016. Zootopia Crowd Pipeline. In ACM SIGGRAPH 2016 Talks. Article 59.

Hans Keim, Maryann Simmons, Daniel Teece, and Jared Reisweber. 2016. Art-Directable Procedural Vegetation in Disney’s Zootopia. In ACM SIGGRAPH 2016 Talks. Article 18.

Andy Milne, Mark McLaughlin, Rasmus Tamstorf, Alexey Stomakhin, Nicholas Burkard, Mitch Counsell, Jesus Canal, David Komorowski, and Evan Goldberg. 2016. Flesh, Flab, and Fascia Simulation on Zootopia. In ACM SIGGRAPH 2016 Talks. Article 34.

Sean Palmer and Kendall Litaker. 2016. Artist Friendly Level-of-Detail in a Fur-Filled World. In ACM SIGGRAPH 2016 Talks. Article 32.

Attenuated Transmission

2015-06-18T00:00:00+00:00

A few months ago I added attenuation to Takua a0.5’s Fresnel refraction BSDF. Adding attenuation wound up being more complex than originally anticipated because handling attenuation through refractive/transmissive mediums requires volumetric information in addition to the simple surface differential geometry. In a previous post about my BSDF system, I mentioned that the BSDF system only considered surface differential geometry information; adding attenuation meant extending my BSDF system to also consider volume properties and track more information about previous ray hits.

First off, what is attenuation? Within the context of rendering and light transport, attenuation is when light is progressively absorbed within a medium, which results in a decrease in light intensity as one goes further and further into a medium away from a light source. One simple example is in deep water- near the surface, most of the light that has entered the water remains unabsorbed, and so the light intensity is fairly high and the water is fairly clear. Going deeper and deeper into the water though, more and more light is absorbed and the water becomes darker and darker. Clear objects gain color when light is attenuated at different rates according to different wavelengths. Combined with scattering, attenuation is a major contributing property to the look of transmissive/refractive materials in real life.

Attenuation is described using the Beer-Lambert Law. The part of the Beer-Lambert Law we are concerned with is the definition of transmittance:

\[ T = \frac{\Phi_{e}^{t}}{\Phi_{e}^{i}} = e^{-\tau}\]

The above equation states that the transmittance of a material is equal to the transmitted radiant flux over the received radiant flux, which in turn is equal to e raised to the power of the negative of the optical depth. If we assume uniform attenuation within a medium, the Beer-Lambert law can be expressed in terms of an attenuation coefficient μ as:

\[ T = e^{-\mu\ell} \]

From these expressions, we can see that light is absorbed exponentially as distance into an absorbing medium increases. Returning back to building a BSDF system, supporting attenuation therefore means having to know not just the current intersection point and differential geometry, but also the distance a ray has traveled since the previous intersection point. Also, if the medium’s attenuation rate is not constant, then an attenuating BSDF not only needs to know the distance since the previous intersection point, but also has to sample along the incoming ray at some stepping increment and calculate the attenuation at each step. In other words, supporting attenuation required BSDFs to know the previous hit point in addition to the current one and also requires BSDFs to be able to ray march from the previous hit point to the current one.

Adding previous hit information and ray march support to my BSDF system was a very straightforward task. I also added volumetric data support to Takua, allowing for the following attenuation test with a glass Stanford Dragon filled with a checkerboard red and blue medium. The red and blue medium is ray marched through to calculate the total attenuation. Note how the thinner parts of the dragon allow more light through resulting in a lighter appearance, while thicker parts of the dragon attenuate more light resulting in a darker appearance. Also note the interesting red and blue caustics below the dragon:

Things got much more complicated once I added support for what I call “deep attenuation”- that is, attenuation through multiple mediums embedded inside of each other. A simple example is an ice cube floating in a glass of liquid, which one might model in the following way:

There are two things in the above diagram that make deep attenuation difficult to implement. First, note that the ice cube is modeled without a corresponding void in the liquid- that is, a ray path that travels through the ice cube records a sequence of intersection events that goes something like “enter water, enter ice cube, exit ice cube, exit water”, as opposed to “enter water, exit water, enter ice cube, exit ice cube, enter water, exit water”. Second, note that the liquid boundary is actually slightly inside of the inner wall of the glass. Intuitively, this may seem like a mistake or an odd property, but this is actually the correct way to model a liquid-glass interface in computer graphics- see this article and this other article for details on why.

So why do these two cases complicate things? As a ray enters each new medium, we need to know what medium the ray is in so that we can execute the appropriate BSDF and get the correct attenuation for that medium. We can only evaluate the attenuation once the ray exits the medium, since attenuation is dependent on how far through the medium the ray traveled. The easy solution is to simply remember what the BSDF is when a ray enters a medium, and then use the remembered BSDF to evaluate attenuation upon the next intersection. For example, imagine the following sequence of intersections:

Intersect glass upon entering glass.
Intersect glass upon exiting glass.
Intersect water upon entering water.
Intersect water upon exiting water.

This sequence of intersections is easy to evaluate. The evaluation would go something like:

Enter glass. Store glass BSDF.
Exit glass. Evaluate attenuation from stored glass BSDF.
Enter water. Store water BSDF.
Exit water. Evaluate attenuation from stored water BSDF.

So far so good. However, remember that in the first case, sometimes we might not have a surface intersection to mark that we’ve exited one medium before entering another. The following scenario demonstrates how this first case results in missed attenuation evaluations:

Intersect water upon entering water.
Exit water, but no intersection!
Intersect ice upon entering ice.
Intersect ice upon exiting ice.
Enter water again, but no intersection either!
Intersect water upon exiting water.

The evaluation sequence ends up playing out okay:

Enter water. Store water BSDF.
Exit water, but no intersection. No BSDF evaluated.
Enter ice. Intersection occurs, so evaluate attenuation from stored water BSDF. Store ice BSDF.
Exit ice. Evaluate attenuation from stored ice BSDF.
Enter water again, but no intersection, so no BSDF stored.
Exit water…. but there is no previous BSDF stored! No attenuation is evaluated!

Alternatively, in step 6, instead of no previous BSDF, we might still have the ice BSDF stored and evaluate attenuation based on the ice. However, this result is still wrong, since we’re now using the ice BSDF for the water.

One simple solution to this problem is to keep a stack of previously seen BSDFs with each ray instead of just storing the previously seen BSDF. When the ray enters a medium through an intersection, we push a BSDF onto the stack. When the ray exits a medium through an intersection, we evaluate whatever BSDF is on the top of the stack and pop the stack. Keeping a stack works well for the previous example case:

Enter water. Push water BSDF on stack.
Exit water, but no intersection. No BSDF evaluated.
Enter ice. Intersection occurs, so evaluate water BSDF from top of stack. Push ice BSDF on stack.
Exit ice. Evaluate ice BSDF from top of stack. Pop ice BSDF off stack.
Enter water again, but no intersection, so no BSDF stored.
Exit water. Intersection occurs, so evaluate water BSDF from top of stack. Pop ice BSDF off stack.

Excellent, we now have evaluated different medium attenuations in the correct order, haven’t missed any evaluations or used the wrong BSDF for a medium, and as we exit the water and ice our stack is now empty as it should be. The first case from above is now solved… what happens with the second case though? Imagine the following sequence of intersections where the liquid boundary is inside of the two glass boundaries:

Intersect glass upon entering glass.
Intersect water upon entering water.
Intersect glass upon exiting glass.
Intersect water upon exiting water.

The evaluation sequence using a stack is:

Enter glass. Push glass BSDF on stack.
Enter water. Evaluate glass attenuation from top of stack. Push water BSDF.
Exit glass. Evaluate water attenuation from top of stack, pop water BSDF.
Exit water. Evaluate glass attenuation from top of stack, pop glass BSDF.

The evaluation sequence is once again in the wrong order- we just used the glass attenuation when we were traveling through water at the end! Solving this second case requires a modification to our stack based scheme. Instead of popping the top of the stack every time we exit a medium, we should scan the stack from the top down and pop the first instance of a BSDF matching the BSDF of the surface we just exited through. This modified stack results in:

Enter glass. Push glass BSDF on stack.
Enter water. Evaluate glass attenuation from top of stack. Push water BSDF.
Exit glass. Evaluate water attenuation from top of stack. Scan stack and find first glass BSDF matching the current surface’s glass BSDF and pop that BSDF.
Exit water. Evaluate water attenuation from top of stack. Scan stack and pop first matching water BSDF.

At this point, I should mention that pushing/popping onto the stack should only occur when a ray travels through a surface. When the ray simply reflects off of a surface, an intersection has occurred and therefore attenuation from the top of the stack should still be evaluated, but the stack itself should not be modified. This way, we can support diffuse inter-reflections inside of an attenuating medium and get the correct diffuse inter-reflection with attenuation between diffuse bounces! Using this modified stack scheme for attenuation evaluation, we can now correctly handle all deep attenuation cases and embed as many attenuating mediums in each other as we could possibly want.

…or at least, I think so. I plan on running more tests before conclusively deciding this all works. So there may be a followup to this post later if I have more findings.

A while back, I wrote a PIC/FLIP fluid simulator. However, at the time, Takua Renderer didn’t have attenuation support, so I wound up rendering my simulations with Vray. Now that Takua a0.5 has robust deep attenuation support, I went back and used some frames from my fluid simulator as tests. The image at the top of this post is a simulation frame from my fluid simulator, rendered entirely with Takua a0.5. The water is set to attenuate red and green light more than blue light, resulting in the blue appearance of the water. In addition, the glass has a slight amount of hazy green attenuation too, much like real aquarium glass. As a result, the glass looks greenish from the ends of each glass plate, but is clear when looking through each plate, again much like real glass. Here are two more renders:

Complex Room Renders

2015-05-30T00:00:00+00:00

I realize I have not posted in some weeks now, which means I still haven’t gotten around to writing up Takua a0.5’s architecture and VCM integrator. I’m hoping to get to that once I’m finished with my thesis work. In the meantime, here are some more pretty pictures rendered using Takua a0.5.

A few months back, I made a high-complexity scene designed to test Takua a0.5’s capability for handling “real-world” workloads. The scene was also designed to have an extremely difficult illumination setup. The scene is an indoor room that is lit primarily from outside through glass windows. Yes, the windows are actually modeled as geometry with a glass BSDF! This means everything seen in these renders is being lit primarily through caustics! Of course, no real production scene would be set up in this manner, but I chose this difficult setup specifically to test the VCM integrator. There is a secondary source of light from a metal cylindrical lamp, but this light source is also difficult since the actual light is emitted from a sphere light inside of a reflective metal cylinder that blocks primary visibility from most angles.

The flowers and glass vase are the same ones from my earlier Flower Vase Renders post. The original flowers and vase are by Andrei Mikhalenko, with custom textures of my own. The amazing, colorful Takua poster on the back wall is by my good friend Alice Yang. The two main furniture pieces are by ODESD2, and the Braun SK4 record player model is by one of my favorite archviz artists, Bertrand Benoit. The teapot is, of course, the famous Utah teapot. All textures, shading, and other models are my own.

As usual, all depth of field is completely in-camera and in-renderer. Also, all BSDFs in this scene are fairly complex; there is not a single simple diffuse surface anywhere in the scene! Instancing is used very heavily; the wicker baskets, notebooks, textbooks, chess pieces, teacups, and tea dishes are all instanced from single pieces of geometry. The floorboards are individually modeled but not instanced, since they all vary in length and slightly in width.

A few more pretty renders, all rendered in Takua a0.5 using VCM:

Note On Images

2015-05-23T00:00:00+00:00

Just a quick note on images on this blog. So far, I’ve generally been embedding full resolution, losslessly compressed PNG format images in the blog. I prefer having the full resolution, lossless images available on the blog since they are the exact output from my renderer. However, full resolution lossless PNGs can get fairly large (several MB for a single 1920x1080 frame), which is dragging down the load times for the blog.

Going forward, I’ll be embedding lossy compressed JPG images in blog posts, but the JPGs will link through to the full resolution, lossless PNG originals. Fortunately, high quality JPG compression is quite good these days at fitting an image with nearly imperceptible compression differences into a much smaller footprint. I’ll also be going back and applying this scheme to old posts too at some point.

Addendum 2016-04-08

Now that I am doing some renders in 4K resolution (3840x2160), it’s time for an addendum to this policy. I won’t be uploading full resolution lossless PNGs for 4K images, due to the overwhelming file size (>30MB for a single image, which means a post with just a handful of 4K images can easily add up to hundreds of MB). Instead, for 4K renders, I will embed a downsampled 1080P JPG image in the post, and link through to a 4K JPG compressed to balance image quality and file size.

Hyperion

2015-04-24T00:00:00+00:00

Just a quick update on future plans. Starting in July, I’m going to be working full time for Walt Disney Animation Studios as a software engineer on their custom, in-house Hyperion Renderer. I couldn’t be more excited about working with everyone on the Hyperion team; ever since the Sorted Deferred Shading paper was published two years ago, I’ve thought that the Hyperion team is doing some of the most interesting work there is in the rendering field right now.

I owe an enormous thanks to everyone that’s advised and supported and encouraged me to continue exploring the rendering and graphics world. Thanks, Joe, Don, Peter, Tony, Mark, Christophe, Amy, Fran, Gabriel, Harmony, and everyone else!

Normally as a rule I only post images to this blog that I made or have a contribution to, but this time I’ll make an exception. Here’s one of my favorite stills from Big Hero 6, rendered entirely using Hyperion and lit by Angela McBride, a friend from PUPs 2011! Images like this one are an enormous source of inspiration to me, so I absolutely can’t wait to get started at Disney and help generate more gorgeous imagery like this!

BSDF System

2015-03-23T00:00:00+00:00

Takua a0.5’s BSDF system was particularly interesting to build, especially because in previous versions of Takua Renderer, I never really had a good BSDF system. Previously, my BSDFs were written in a pretty ad-hoc way and were somewhat hardcoded into the pathtracing integrator, which made BSDF extensibility very difficult and multi-integrator support nearly impossible without significant duplication of BSDF code. In Takua a0.5, I’ve written a new, extensible, modularized BSDF system that is inspired by Mitsuba and Renderman 19/RIS. In this post, I’ll write about how Takua a0.5’s BSDF system works and show some pretty test images generated during development with some interesting models and props.

First, here’s a still-life sort of render showcasing a number of models with a number of interesting materials, all using Takua a0.5’s BSDF system and rendered using my VCM integrator. All of the renders in this post are rendered either using my BDPT integrator or my VCM integrator.

BSDFs in Takua a0.5 are designed to support bidirectional evaluation and importance sampling natively. Basically, this means that all BSDFs need to implement five basic functions. These five basic functions are:

Evaluate, which takes input and output directions of light and a normal, and returns the BSDF weight, cosine of the angle of the input direction, and color absorption of the scattering event. Evaluate can also optionally return the probability of the output direction given the input direction, with respect to solid angle.
CalculatePDFW, which takes the input and output directions of light and a normal, and returns the forward probability of the output direction given the input direction. In order to make the BSDF operate bidirectionally, this function also needs to be able to return the backwards probability if the input and output are reversed.
Sample, which takes in an input direction, a normal, and a random number generator and returns an output direction, the BSDF weight, the forward probability of the output direction, and the cosine of the input angle.
IsDelta, which returns true if the BSDF’s probability distribution function is a Dirac delta function and false otherwise. This attribute is important for allowing BDPT and VCM to handle perfectly specular BSDFs correctly, since perfectly specular BSDFs are something of a special case.
GetContinuationProbability, which takes in an input direction and normal and returns the probability of ending a ray path at this BSDF. This function is used for Russian Roulette early path termination.

In order to be correct and bididirectional, each of these functions should return results that agree with the other functions. For example, taking the output direction generated by Sample and calling Evaluate with the Sample output direction should produce the same color absorption and forward probability and other attributes as Sample. Sample, Evaluate, and CalculatePDFW are all very similar functions and often can share a large amount of common code, but each one is tailored to a slightly different purpose. For example, Sample is useful for figuring out a new random ray direction along a ray path, while Evaluate is used for calculating BSDF weights while importance sampling light sources.

Small note: I wrote that these five functions all take in a normal, which is technically all they need in terms of differential geometry. However, in practice, passing in a surface point and UV and other differential geometry information is very useful since that allows for various properties to be driven by 2D and 3D textures. In Takua a0.5, I pass in a normal, surface point, UV coordinate, and a geom and primitive ID for future PTEX support, and allow every BSDF attribute to be driven by a texture.

One of the test props I made is the PBRT book, since I thought rendering the Physically Based Rendering book with a physically based renderer and physically based shading would be amusing. The base diffuse color is driven by a texture map, and the interesting rippled and variation in the glossiness of the book cover comes from driving additional gloss and specular properties with more texture maps.

In order to be physically correct, BSDFs should also fulfill the following three properties:

Positivity, meaning that the return value of the BSDF should always be positive or equal to 0.
Helmholtz Reciprocity, which means the return value of the BSDF should not be changed by switching the input and output directions (although switching the input and output CAN change how things are calculated internally, such as in perfectly specular refractive materials).
Energy Conservation, meaning the surface cannot reflect more light than arrives.

At the moment, my base BSDFs are not actually the best physically based BSDFs in the world… I just have Lambertian diffuse, normalized Blinn-Phong, and Fresnel-based perfectly specular reflection/refraction. At a later point I’m planning on adding Beckmann and Disney’s Principled BSDF, and possibly others such as GGX and Ward. However, for the time being, I can still create highly complex and interesting materials because of the modular nature of Takua a0.5’s BSDF system; one of the most powerful uses of this modular system is combining base BSDFs into more complex BSDFs. For example, I have another BSDF called FresnelPhong, which internally calls normalized Blinn-Phong BSDF but also calls the Fresnel code from my Fresnel specular BSDF to account for an output direction with the Fresnel effect with glossy surfaces. Since the Fresnel specular BSDF handles refractive materials, FresnelPhong allows for creating glossy transmissive surfaces such as frosted glass (albeit not as accurate to reality as one would get with Beckmann or GGX).

Another one of my test props is a glass chessboard, where half of the pieces and board squares are using frosted glass. Needless to say, this scene is very difficult to render using unidirectional pathtracing. I only have one model of each chess piece type, and all of the pieces on the board are instances with varying materials per instance.

Another interesting use of modular BSDFs and embedding BSDFs inside of other BSDFs is in implementing bump mapping. Takua a0.5 implements bump mapping as a simple BSDF wrapper that calculates the bump mapped normal and passes that normal into whatever the underlying BSDF is. This approach allows for any BSDF to have a bump map, and even allows for applying multiple bump maps to the same piece of geometry. In addition to specifying bump maps as wrapper BSDFs, Takua a0.5 also allows attaching bump maps to individual geometry so that the same BSDF can be reused with a number of different bump maps attached to a number of different geometries, but under the hood this system works exactly the same as the BSDF wrapper bump map.

This notebook prop’s leathery surface detail comes entirely from a BSDF wrapper bump map:

Finally, one of the most useful and interesting features of Takua a0.5’s BSDF system is the layered BSDF. The layered BSDF is a special BSDF that allows arbitrary combining, layering, and mixing between different BSDFs, much like Vray’s BlendMtl or Renderman 19/RIS’s LM shader system. Any BSDF can be used as a layer in a layered BSDF, including entire other layered BSDF networks. The Takua layered BSDF consists of a base substrate BSDF, and an arbitrary number of coat layers on top of the substrate. Each coat is given a texture-drive weight which determines how much of the final output BSDF is from the current coat layer versus from all of the layers and substrate below the current coat layer. Since the weight for each coat layer must be between 0 and 1, the result layered BSDF maintains physical correctness as long as all of the component BSDFs are also physically correct. Practically, the layered BSDF is implemented so that with each iteration, only one of the component BSDFs is evaluated and sampled, with the particular component BSDF per iteration chosen randomly based on each component BSDF’s weighting.

The layered BSDF system is what allows the creation of truly interesting and complex materials, since objects in reality often have complex materials consisting of a number of different scattering event types. For example, a real object may have a diffuse base with a glossy clear coat, but there may also be dust and fingerprints on top of the clear coat contributing to the final appearance. The globe model seen in my adaptive sampling post uses a complex layered BSDF; the base BSDF is ground glass, with the continents layered on top as a perfectly specular mirror BSDF, and then an additional dirt and fingerprints layer on top made up of diffuse and varying glossy BSDFs:

Here’s an additional close-up render of the globe that better shows off some of the complex surface detail:

Going forward, I’m planning on adding a number of better BSDFs to Takua a0.5 (as mentioned before). Since the BSDF system is so modular and extensible, adding new BSDFs should be relatively simple and should require little to no additional work to integrate into the renderer. Because of how I designed BSDF wrappers, any new BSDF I add will automatically work with the bump map BSDF wrapper and the layered BSDF system. I’m also planning on adding interesting effects to the refractive/transmission BSDF, such as absorption based on Beer’s law and spectral diffraction.

After I finish work on my thesis, I also intend on adding more complex materials for subsurface scattering and volume rendering. These additions will be much more involved than just adding GGX or Beckmann, but I have a rough roadmap for how to proceed and I’ve already built a lot of supporting infrastructure into Takua a0.5. The plan for now is to implement a unified SSS/volume system based on the Unified Points, Beams, and Paths presented at SIGGRAPH 2014. UPBP can be thought of as extending VCM to combine a number of different volumetric rendering techniques. I can’t wait to get started on that over the summer!

Adaptive Sampling

2015-03-18T00:00:00+00:00

Adaptive sampling is a relatively small and simple but very powerful feature, so I thought I’d write briefly about how adaptive sampling works in Takua a0.5. Before diving into the details though, I’ll start with a picture. The scene I’ll be using for comparisons in this post is a globe of the Earth, made of a polished ground glass with reflective metal insets for the landmasses and with a rough scratched metal stand. The globe is on a white backdrop and is lit by two off-camera area lights. The following render is the fully converged reference baseline for everything else in the post, rendered using VCM:

As mentioned before, in pathtracing based renderers, we solve the path integral through Monte Carlo sampling, which gives us an estimate of the total integral per sample thrown. As we throw more and more samples at the scene, we get a better and better estimate of the total integral, which explains why pathtracing based integrators start out producing a noisy image but eventually converge to a nice, smooth image if enough rays are traced per pixel.

In a naive renderer, the number of samples traced per pixel is usually just a fixed number, equal for all pixels. However, not all parts of the image are necessarily equally difficult to sample; for example, in the globe scene, the backdrop should require fewer samples than the ground glass globe to converge, and the ground glass globe in turn should require fewer samples than the two caustics on the ground. This observation means that a fixed sampling strategy can potentially be quite wasteful. Instead, computation can be used much more efficiently if the sampling strategy can adapt and drive more samples towards pixels that require more work to converge, while driving fewer samples towards pixels that have already converged mid-render. Such a sample can also be used to automatically stop the renderer once the sampler has detected that the entire render has converged, without needing user guesswork for how many samples to use.

The following image is the same globe scene as above, but limited to 5120 samples per pixel using bidirectional pathtracing and a fixed sampler. Note that most of the image is reasonable converged, but there is still noise visible in the caustics:

Since it may be difficult to see the difference between this image and the baseline image on smaller screens, here is a close-up crop of the same caustic area between the two images:

The difficult part of implementing an adaptive sampler is, of course, figuring out a metric for convergence. The PBRT book presents a very simple adaptive sampling strategy on page 388 of the 2nd edition: for each pixel, generate some minimum number of initial samples and record the radiances returned by each initial sample. Then, take the average of the luminances of the returned radiances, and compute the contrast between each initial sample’s radiance and the average luminance. If any initial sample has a contrast from the average luminance above some threshold (say, 0.5), generate more samples for the pixel up until some maximum number of samples per pixel is reached. If all of the initial samples have contrasts below the threshold, then the sampler can mark the pixel as finished and move onto the next pixel. The idea behind this strategy is to try to eliminate fireflies, since fireflies result from statistically improbably samples that are significantly above the true value of the pixel.

The PBRT adaptive sampler works decently, but has a number of shortcomings. First, the need to draw a large number of samples per pixel simultaneously makes this approach less than ideal for progressive rendering; while well suited to a bucketed renderer, a progressive renderer prefers to draw a small number of samples per pixel per iteration, and return to each pixel to draw more samples in subsequent iterations. In theory, the PBRT adaptive sampler could be made to work with a progressive renderer if sample information was stored from each iteration until enough samples were accumulated to run an adaptive sampling check, but this approach would require storing a lot of extra information. Second, while the PBRT approach can guarantee some degree of per-pixel variance minimization, each pixel isn’t actually aware of what its neighbours look like, meaning that there still can be visual noise across the image. A better, global approach would have to take into account neighbouring pixel radiance values as a second check for whether or not a pixel is sufficiently sampled.

My first attempt at a global approach (the test scene in this post is a globe, but that pun was not intended) was to simply have the adaptive sampler check the contrast of each pixel with it’s immediate neighbours. Every N samples, the adaptive sampler would pull the accumulated radiances buffer and flag each pixel as unconverged if the pixel has a contrast greater than some threshold from at least one of its neighbours. Pixels marked unconverged are sampled for N more iterations, while pixels marked as converged are skipped for the next N iterations. After another N iterations, the adaptive sampler would go back and reflag every pixel, meaning that a pixel previously marked as converged could be reflagged as unconverged if its neighbours changed enormously. Generally N should be a rather large number (say, 128 samples per pixel), since doing convergence checks is meaningless if the image is too noisy at the time of the check.

Using this strategy, I got the following image, which was set to run for a maximum of 5120 samples per pixel but wound up averaging 4500 samples per pixel, or about a 12.1% reduction in samples needed:

At an initial glance, this looks pretty good! However, as soon as I examined where the actual samples went, I realized that this strategy doesn’t work. The following image is a heatmap showing where samples were driven, with brighter areas indicating more samples per pixel:

Generally, my per-pixel adaptive sampler did correctly identify the caustic areas as needing more samples, but a problem becomes apparent in the backdrop areas: the per-pixel adaptive sampler drove samples at clustered “chunks” evenly, but not evenly across different clusters. This behavior happens because while the per-pixel sampler is now taking into account variance across neighbours, it still doesn’t have any sort of global sense across the entire image! Instead, the sampler is finding localized pockets where variance seems even across pixels, but those pockets can be quite disconnected from further out areas. While the resultant render looks okay at a glance, clustered variance patterns becomes apparent if the image contrast is increased:

Interestingly, these artifacts are reminiscent of the artifacts that show up in not-fully-converged Metropolis Light Transport renders. This similarity makes sense, since in both cases they arise from uneven localized convergence.

The next approach that I tried is a more global approach adapted from Dammertz et al.’s paper, “A Hierarchical Automatic Stopping Condition for Monte Carlo Global Illumination”. For the sake of simplicity, I’ll refer to the approach in this paper as Dammertz for the rest of this post. Dammertz works by considering the variance across an entire block of pixels at once and flagging the entire block as converged or unconverged, allowing for much more global analysis. At the first variance check, the only block considered is the entire image as one enormous block; if the total variance e_b in the entire block is below a termination threshold e_t, the block is flagged as converged and no longer needs to be sampled further. If e_b is greater than e_t but still less than a splitting threshold e_s, then the block will be split into two non-overlapping child blocks for the next round of variance checking after N iterations have passed. At each variance check, this process is repeated for each block, meaning the image eventually becomes split into an ocean of smaller blocks. Blocks are kept inside of a simple unsorted list, require no relational information to each other, and are removed from the list once marked as converged, making the memory requirements very simple. Blocks are split along their major axis, with the exact split point chosen to keep error as equal as possible across the two sides of the split.

The actual variance metric used is also very straightforward; instead of trying to calculate an estimate of variance based on neighbouring pixels, Dammertz stores two framebuffers: one buffer I for all accumulated radiances so far, and a second buffer A for accumulated radiances from every other iteration. As the image approaches full convergence, the differences between I and A should shrink, so an estimation of variance can be found simply by comparing radiance values between I and A. The specific details and formulations can be found in section 2.1 of the paper.

I made a single modification to the paper’s algorithm: I added a lower bound to the block size. Instead of allowing blocks to split all the way to a single pixel, I stop splitting after a block reaches 64 pixels in a 8x8 square. I found that splitting down to single pixels could sometimes cause false positives in convergence flagging, leading to missed pixels similar to in the PBRT approach. Forcing blocks to stop splitting at 64 pixels means there is a chance of false negatives for convergence, but a small amount of unnecessary oversampling is preferable to undersampling.

Using this per-block adaptive sampler, I got the following image, which again is superficially extremely similar to the fixed sampler result. This render was also set to run for a maximum of 5120 samples, but wound up averaging just 2920 samples per pixel, or about a 42.9% reduction in samples needed:

The sample heatmap looks good too! The heatmap shows that the sampler correctly identified the caustic and highlight areas as needing more samples, and doesn’t have clustering issues in areas that needed fewer samples:

Boosting the image contrast shows that the image is free of local clustering artifacts and noise is even across the entire image, which is what we would expect:

Looking at the same 500% crop area as earlier, the adaptive per-block and fixed sampling renders are indistinguishable:

So with that, I think Dammertz works pretty well! Also, the computational and memory overhead required for the Dammertz approach is basically negligible relative to the actual rendering process. This approach is the one that is currently in Takua a0.5.

I actually have an additional adaptive sampling trick designed specifically for targeting fireflies. This additional trick works in conjunction with the Dammertz approach. However, this post is already much longer than I originally planned, so I’ll save that discussion for a later post. I’ll also be getting back to the PPM/VCM posts in my series of integrator posts shortly; I have not had much time to write on my blog since the vast majority of my time is currently focused on my thesis, but I’ll try to get something posted soon!

Flower Vase Renders

2015-02-27T00:00:00+00:00

In order to test Takua a0.5, I’ve been using my renderer on some quick little “pretty picture” projects. I recently ran across a fantastic flower vase model by artist Andrei Mikhalenko and used Andrei’s model as the basis for a shading exercise. The above and following images are rendered entirely in Takua a0.5 using bidirectional pathtracing. I textured and shaded everything using Takua a0.5’s layered material system, and also made some small modifications to the model (moved some flowers around, extended the stems to the bottom of the vase, and thickened the bottom of the vase). Additionally, I further subdivided the flower petals to gain additional detail and smoothness, meaning the final rendered model weighs in at nearly a quarter of a billion triangles. Obviously using such heavy models is not practical for a single prop in real world production, but I wanted to push the amount of geometry my renderer can handle. Overall, total memory usage for each of these renders hovered around 10.5 GB. All images were rendered at 1920x1080 resolution; click on each image to see the full resolution renders.

For the flowers, I split all of the flowers into five randomly distributed groups and assigned each group a different flower material. Each material is a two-sided material with a different BSDF assigned to each side, with side determined by the surface normal direction. For each flower, the outside BSDF has a slightly darker reflectance than the inner BSDF, which efficiently approximates the subsurface scattering effect real flowers have, but without actually having to use subsurface scattering. In this case, using a two-sided material to fake the effect of subsurface scattering is desirable since the model is so complex and heavy. Also, the stems and branches are all bump mapped.

This set of renders was a good test for bidirectional pathtracing because of the complex nature of the caustics in the vase and water; note that the branches inside of the vase and water cannot be efficiently rendered by unidirectional pathtracing since they are in glass and therefore cannot directly sample the light sources. The scene is lit by a pair of rectlights, one warmer and one cooler in temperature. This lighting setup, combined with the thick glass and water volume at the bottom of the vase, produces some interesting caustic on the ground beneath the vase.

The combination of the complex caustics and the complex geometry in the bouquet itself meant that a fairly deep maximum ray path length was required (16 bounces). Using BDPT helped immensely with resolving the complex bounce lighting inside of the bouquet, but the caustics proved to be difficult for BDPT; in all of these renders, everything except for the caustics converged within about 30 minutes on a quad-core Intel Core i7 machine, but the caustics took a few hours to converge in the top image, and a day to converge for the second image. I’ll discuss caustic performance in BDPT compared to PPM and VCM in some upcoming posts.

All depth of field is completely in-camera and in-renderer as well. No post processed depth of field whatsoever! For the time being, Takua a0.5 only supports circular apertures and therefore only circular bokeh, but I plan on adding custom aperture shapes after I finish my thesis work. In general, I think that testing my own renderer with plausibly real-world production quality scenes is very important. After all, having just a toy renderer with pictures of spheres is not very fun… the whole point of a renderer is to generate some really pretty pictures! For my next couple of posts, I’m planning on showing some more complex material/scene tests, and then moving onto discussing the PPM and VCM integrators in Takua.

Addendum 2015-03-03

I should comment on the memory usage a bit more, since some folks have expressed interest in what I’m doing there. By default, the geometry actually weighs in closer to 30 GB in memory usage, so I had to implement some hackery to get this scene to fit in memory on a 16 GB machine. The hack is really simple: I added an optional half-float mode for geometry storage. In practice, using half-floats for geometry is usually not advisable due to precision loss, but in this particular scene, that precision loss becomes more acceptable due to a combination of depth of field hiding most alignment issues closer to camera, and sheer visual complexity making other alignment issues hard to spot without looking too closely. Additionally, for the flowers I also threw away all of the normals and recompute them on the fly at render-time. Recomputing normals on the fly results in a small performance hit, but it vastly preferable to going out of core.

Multiple Importance Sampling

2015-02-13T00:00:00+00:00

A key tool introduced by Veach as part of his bidirectional pathtracing formulation is multiple importance sampling (MIS). As discussed in my previous post, the entire purpose of rendering from a mathematical perspective is to solve the light transport equation, which in the case of all pathtracing type renderers means solving the path integral formulation of light transport. Since the path integral does not have a closed form solution in all but the simplest of scenes, we have to estimate the full integral using various sampling techniques in path space, hence unidirectional pathtracing and bidirectional pathtracing and metropolis based techniques, etc. As we saw with the light source in glass case and with SDS paths, often a single path sampling technique is not sufficient for capturing a good estimate of the path integral. Instead, a good estimate often requires a combination of a number of different path sampling techniques; MIS is a critical mechanism for combining multiple sampling techniques in a manner that reduces total variance. Without MIS, directly combining sampling techniques through averaging can often have the opposite effect and increase total variance.

The following image is a recreation of the test scene used in the Veach thesis to demonstrate MIS. The scene consists of four glossy bars going from less glossy at the top to more glossy at the bottom, and four sphere lights of increasing size. The smallest sphere light has the highest emission intensity, and the largest sphere light has the lowest emission. I modified the scene to add in a large rectangular area light off camera on each side of the scene, and I added an additional bar to the bottom of the scene with gloss driven by a texture map:

The above scene is difficult to render using any single path sampling technique because of the various combinations of surface glossiness and emitter size/intensity. For large emitter/low gloss combinations, importance sampling by the BSDF tends to result in lower variance. In the case, the reason is that a given random ray direction is more likely to hit the large light than it is to fall within a narrow BSDF lobe, so matching the sample distribution to the BSDF lobe is more efficient. However, for small emitter/high gloss combinations, the reverse is true. If we take the standard Veach scene and sample by only BSDF and then only by light source, we can see how each strategy fails in different cases. Both of these renders would eventually converge if left to render for long enough, but the rate of convergence in difficult areas would be extremely slow:

MIS allows us to combine m different sampling strategies to produce a single unbiased estimator by weighting each sampling strategy by its probability distribution function (pdf). Mathematically, this is expressed as:

\[ \langle I_{j} \rangle_{MIS} = \sum_{i=1}^{m} \frac{1}{n_{i}} \sum_{j=1}^{n_{i}} w_{i}(X_{i,j}) \frac{f(X_{i,j})}{p_{i}(X_{i,j})} \]

where X_i,j are independent random variables drawn from some distribution function p_i and w_i(X_i,j) is some heuristic for weighting each sampling technique with respect to pdf. The reason MIS is able to significantly lower variance is because a good MIS weighting function should dampen contributions with low pdfs. The Veach thesis presents two good weighting heuristics, the power heuristic and the balance heuristic. The power heuristic is defined as:

\[ w_{i}(x) = \frac{[n_{i}p_{i}(x)]^{\beta}}{\sum_{n}^{k=1}[n_{k}p_{k}(x)]^{\beta}}\]

The power heuristic states that the weight for a given sampling technique should be the pdf of the sampling technique raised to a power β divided by the sum of the pdfs of all considered sampling techniques, with each sampling technique also raised to β. For the power heuristic, β is typically set to 2. The balance heuristic is simply the power heuristic for β=1. In the vast majority of cases, the balance heuristic is a near optimal weighting heuristic (and the power heuristic can cover most remaining edge cases), assuming that the base sampling strategies are decent to begin with.

For the standard Veach MIS demo scene, the best result is obtained by using MIS to combine BSDF and light sampling. The following image is the Veach scene again, this time rendered using MIS with 64 iterations. Note that all highlights are now roughly equally converged and the entire image matches the reference render above, apart from noise:

BDPT inherently does not necessarily have an improved convergence rate over vanilla unidirectional pathtracing; BDPT gains its significant edge in convergence rate only once MIS is applied since BDPT’s efficiency comes from being able to extract a large number of path sampling techniques out of a single bidirectional path. To demonstrate the impact of MIS on BDPT, I rendered the following images using BDPT with and without MIS. The scene is a standard Cornell Box, but I replaced the back wall with a more complex scratched, glossy surface. The first image is the fully converged ground truth render, followed by with and without MIS:

As seen above, the version of BDPT without MIS is significantly less converged. BDPT without MIS will still converge to the correct solution, but in practice can often be only as good as, or worse than unidirectional pathtracing.

Later on, we’ll discuss MIS beyond bidirectional pathtracing. In fact, MIS is the critical component to making VCM possible!

Addendum 2018-01-12

A reader noticed some brightness inconsistencies in the original versions of Figures 2 and 3, which came from bugs in Takua’s light sampling code without MIS at the time. I have replaced the original versions of Figures 1, 2, 3, and 4 with new, correct versions rendered using the current version of Takua as of the time of writing for this addendum.

Because of how much noise there is in Figures 2 and 3, seeing that they converge to the reference image might be slightly harder. To make the convergence clearer, I rendered out each sampling strategy using 1024 samples per pixel, instead of just 64:

Note how Figures 8 and 9 match Figure 1 exactly, aside from noise. In Figure 9, the reflection of the right-most sphere light on the top-most bar is still extremely noisy because of the extreme difficulty of finding a random light sample that happens to produce a valid bsdf response for the near-perfect specular lobe.

One last minor note: I’m leaving the main text of this post unchanged, but the updated renders use Takua’s modern shading system instead of the old one from 2015; in the new shading system, the metal bars use roughness instead of gloss, and use GGX instead of a normalized Phong variant.

Bidirectional Pathtracing Integrator

2015-02-11T00:00:00+00:00

As part of Takua a0.5’s complete rewrite, I implemented the vertex connection and merging (VCM) light transport algorithm. Implementing VCM was one of the largest efforts of this rewrite, and is probably the single feature that I am most proud of. Since VCM subsumes bidirectional pathtracing and progressive photon mapping, I also implemented Veach-style bidirectional pathtracing (BDPT) with multiple importance sampling (MIS) and Toshiya Hachisuka’s stochastic progressive photon mapping (SPPM) algorithm. Since each one of these integrators is fairly complex and interesting by themselves, I’ll be writing a series of posts on my BDPT and SPPM implementations before writing about my full VCM implementation. My plan is for each integrator to start with a longer post discussing the algorithm itself and show some test images demonstrating interesting properties of the algorithm, and then follow up with some shorter posts detailing specific tricky or interesting pieces and also show some pretty real-world production-plausible examples of when each algorithm is particularly useful.

As usual, we’ll start off with an image. Of course, all images in this post are rendered entirely using Takua a0.5. The following image is a Cornell Box lit by a textured sphere light completely encased in a glass sphere, rendered using my bidirectional pathtracer integrator. For reasons I’ll discuss a bit later in this post, this scene belongs to a whole class of scenes that unidirectional pathtracing is absolutely abysmal; these scenes require a bidirectional integrator to converge in any reasonable amount of time:

To understand why BDPT is a more robust integrator than unidirectional pathtracing, we need to start by examining the light transport equation and its path integral formulation. The light transport equation was introduced by Kajiya and is typically presented using the formulation from Eric Veach’s thesis:

\[ L_{\text{o}}(\mathbf{x},\, \omega_{\text{o}}) \,=\, L_e(\mathbf{x},\, \omega_{\text{o}}) \ +\, \int_{\mathcal{S}^2} L_{\text{i}}(\mathbf{x}_\mathcal{M}(\mathbf{x},\, \omega_{i}),\, -\omega_{i}) \, f_s(\mathbf{x},\, \omega_{i} \rightarrow \omega_{\text{o}}) \, d \sigma_{\mathbf{x}}^{\perp} (\omega_{i}) \]

Put into words instead of math, the light transport equation simply states that the amount of light leaving any point is equal to the amount of light emitted at that point plus the total amount of light arriving at that point from all directions, weighted by the surface reflectance and absorption at that point. Combined with later extensions to account for effects such as volume scattering and subsurface scattering and diffraction, the light transport equation serves as the basis for all of modern physically based rendering. In order to solve the light transport equation in a practical manner, Veach presents the path integral formulation of light transport:

\[ I_{j} = \int_{\Omega}^{} L_{e}(\mathbf{x}_{0})G(\mathbf{x}_{0}\leftrightarrow \mathbf{x}_{1})[\prod_{i=1}^{k-1}\rho(\mathbf{x}_{i})G(\mathbf{x}_{i}\leftrightarrow \mathbf{x}_{i+1})]W_{e}(\mathbf{x}_{k}) d\mu(\bar{\mathbf{x}}) \]

The path integral states that for a given pixel on an image, the amount of radiance arriving at that pixel is the integral of all radiance coming in through all paths in path space, where a path is the route taken by an individual photon from the light source through the scene to the camera/eye/sensor, and path space simply encompasses all possible paths. Since there is no closed form solution to the path integral, the goal of modern physically based ray-tracing renderers is to sample a representative subset of path space in order to produce a reasonably accurate estimate of the path integral per pixel; progressive renderers estimate the path integral piece by piece, producing a better and better estimate of the full integral with each new iteration.

At this point, we should take a brief detour to discuss the terms “unbiased” versus “biased” rendering. Within the graphics world, there’s a lot of confusion and preconceptions about what each of these terms mean. In actuality, an unbiased rendering algorithm is simply one where each iteration produces an exact result for the particular piece of path space being sampled. A biased rendering algorithm is one where at each iteration, an approximate result is produced for the piece of path space being sampled. However, biased algorithms are not necessarily a bad thing; a biased algorithm can be consistent, that is, converges in the limit to the same result as an unbiased algorithm. Consistency means that an estimator arrives at the accurate result in the limit; so in practice, we should care less about whether or not an algorithm is biased or unbiased so long as it is consistent. BDPT is an unbiased, consistent integrator whereas SPPM is a biased but still consistent integrator.

Going back to the path integral, we can quickly see where unidirectional pathtracing comes from once we view light transport through the path integral. The most obvious way to evaluate the path integral is to do exactly as the path integral says: trace a path starting from a light source, through the scene, and if the path eventually hits the camera, accumulate the radiance along the path. This approach is one form of unidirectional pathtracing that is typically referred to as light tracing (LT). However, since the camera is a fairly small target for paths to hit, unidirectional pathtracing is typically implemented going in reverse: start each path at the camera, and trace through the scene until each path hits a light source or goes off into empty space and is lost. This approach is called backwards pathtracing and is what people usually are referring to when they use the term pathtracing (PT).

As I discussed a few years back in a previous post, pathtracing with direct light importance sampling is pretty efficient at a wide variety of scene types. However, pathtracing with direct light importance sampling will fail for any type of path where the light source cannot be directly sampled; we can easily construct a number of plausible, common setups where this situation occurs. For example, imagine a case where a light source is completely enclosed within a glass container, such as a glowing filament within a glass light bulb. In this case, for any pair consisting of a point in space and a point on the light source, the direction vector to hit the light point from the surface point through glass is not just the light point minus the surface point normalized, but instead has to be at an angle such that the path hits the light point after refracting through glass. Without knowing the exact angle required to make this connection beforehand, the probability of a random direct light sample direction arriving at the glass interface at the correct angle is extremely small; this problem is compounded if the light source itself is very small to begin with.

Taking the sphere light in a glass sphere scene from earlier, we can compare the result of pathtracing without glass around the light versus with glass around the light. The following comparison shows 16 iterations each, and we can see that the version with glass around the light is significantly less converged:

Generally, pathtracing is terrible at resolving caustics, and the glass-in-light scenario is one where all illumination within the scene is through caustics. Conversely, light tracing is quite good at handling caustics and can be combined with direct sensor importance sampling (same idea as direct light importance sampling, just targeting the camera/eye/sensor instead of a light source). However, light tracing in turn is bad at handling certain scenarios that pathtracing can handle well, such as small distant spherical lights.

The following image again shows the sphere light in a glass sphere scene, but is now rendered for 16 iterations using light tracing. Note how the render is significantly more converged, for approximately the same computational cost. The glass sphere and sphere light render as black since in light tracing, the camera cannot be directly sampled from a specular surface.

Since bidirectional pathtracing subsumes both pathtracing and light tracing, I implemented pathtracing and light tracing simultaneously and used each integrator as a check on the other, since correct integrators should converge to the same result. Implementing light tracing requires BSDFs and emitters to be a bit more robust than in vanilla pathtracing; emitters have to support both emission and illumination, and BSDFs have to support bidirectional evaluation. Light tracing also requires the ability to directly sample the camera and intersect the camera’s image plane to figure out what pixel to contribute a path to; as such, I implemented a rasterize function for my thin-lens and fisheye camera models. My thin-lens camera’s rasterization function supports the same depth of field and bokeh shape capabilities that the thin-lens camera’s raycast function supports.

The key insight behind bidirectional pathtracing is that since light tracing and vanilla pathtracing each have certain strengths and weaknesses, combining the two sampling techniques should result in a more robust path sampling technique. In BDPT, for each pixel per iteration, a path is traced starting from the camera and a second path is traced starting from a point on a light source. The two paths are then joined into a single path, conditional on an unoccluded line of sight from the end vertices of the two paths to each other. A BDPT path of length k with k+1 vertices can then be used to generate up to k+2 path sampling techniques by connecting each vertex on each subpath to every other vertex on the other subpath. While BDPT per iteration is much more expensive than unidirectional pathtracing, the much larger number of sampling techniques leads to a significantly higher convergence rate that typically outweighs the higher computational cost.

Below is the same scene as above rendered with 16 iterations of BDPT, and rendered with the same amount of computation time (about 5 iterations of BDPT). Note how with just 5 iterations, the BDPT result with the glass sphere has about the same level of noise as the pathtraced result for 16 iterations without the glass sphere. At 16 iterations, the BDPT result with the glass sphere is noticeably more converged than the pathtraced result for 16 iterations without the glass sphere.

A naive implementation of BDPT would be for each pixel per iteration, trace a full light subpath, store the result, trace a full camera subpath, store the result, and then perform the connection operations between each vertex pair. However, since this approach requires storing the entirety of both subpaths for the entire iteration, there is room for some improvement. For Takua a0.5, my implementation stores only the full light subpath. At each bounce of the camera subpath, my implementation connects the current vertex to each vertex of the stored light subpath, weights and accumulates the result, and then moves onto the next bounce without having to store previous path vertices.

The following image is another example of a scene that BDPT is significantly better at sampling than any unidirectional pathtracing technique. The scene consists of a number of diffuse spheres and spherical lights inside of a glass bunny. In this scene, everything outside of the bunny is being lit using only caustics, while diffuse surfaces inside of the bunny are being lit using a combination of direct lighting, indirect diffuse bounces, and caustics from outside of the bunny reflecting/refracting back into the bunny. This last type of lighting belongs to a category of paths known as specular-diffuse-specular (SDS) paths that are especially difficult to sample unidirectionally.

Here is the same scene as above, but with the glass bunny removed just so seeing what is going on with the spheres is a bit easier:

Comparing pathtracer versus BDPT performance for 16 interations, BDPT’s vastly better performance on this scene becomes obvious:

In the next post, I’ll write about multiple importance sampling (MIS), how it impacts BDPT, and my MIS implementation in Takua a0.5.

Consistent Normal Interpolation

2015-01-30T00:00:00+00:00

I recently ran into a problem with interpolated normals. Instead of supporting sphere primitives directly, Takua Rev 5 generates polygon mesh spheres and handles them the same way as any other polygon mesh is handled. However, when I ran a test using a glass sphere, a lot of fireflies appeared:

The fireflies are an artifact arising from how normal interpolation interacts with specular materials. Since the sphere is a polygonal mesh, normal interpolation is required to give the sphere a smooth appearance instead of a faceted one. The interpolation scheme I was using vanilla Phong normal interpolation: store a smoothed normal at each vertex, and then calculate the smooth shading normal at each point as the barycentric-coordinate-weighted sum of the smooth normals at each vertex of the current triangle. This works well for most cases, but a problem arises at grazing angles: since the smooth shading normal corresponds not to the actual geometry but to a “virtual” smoothed version of the geometry, sometimes outgoing specular rays will end up going below the tangent plane of the current hit point. Because of this, rays hitting a glass sphere with Phone normal interpolation at a grazing angle can sometimes go the wrong way, hence the artifacts in the above image.

Of course, the closer the actual geometry lines up to the virtual smoothed geometry, the less this grazing angle problem occurs. However, in order to completely eliminate artifacting, the polygon geometry needs to approach the limit of the virtual smoothed geometry. In this render, I regenerated the sphere with two more levels of subdivision. As a result, there are fewer fireflies, but fireflies are still present:

Initially I thought about just getting rid of the fireflies by checking pixel intensities and clamping intensities that were significantly brighter than their immediate neighbors, which is a fairly basic/standard firefly reduction strategy. However, since in this case the fireflies occur mostly at grazing angles and therefore on silhouettes, intensity clamping can lead to some unpleasant aliasing on silhouettes.

Fortunately, there was a paper by Alexander Reshetov, Alexei Soupikov, and William R. Mark at SIGGRAPH Asia 2010 about dealing with this exact problem. The paper, “Consistent Normal Interpolation”, presents a simple method for tweaking Phong normal interpolation to guarantee that reflected rays never go below the tangent plane. The method is based on incoming ray direction and the angle between the smooth interpolated normal and true face normal. The actual method presented in the paper is very straightforward to implement, but the derivation of the algorithm is fairly interesting and involves solving a nontrivial optimization problem to find a scaling term.

I implemented a slightly modified version of the algorithm presented on page 5 of the paper. The modification I made is simply to account for rays hitting polygons from below the tangent plane, as in the case of internal refraction. Now interpolated normals at grazing angles no longer produce firefly artifacts:

I’m working on writing up a lot of stuff, so more soon! Stay tuned!

Takua Render Revision 5

2014-12-28T00:00:00+00:00

I haven’t posted much at all this past year due, but I’ve been working on some stuff that I’m really excited about! For the past year and a half, I’ve been building a new, much more advanced version of Takua Render completely from scratch. In this post, I’ll give a brief introduction and runthrough of the new version of Takua, which I’ve numbered as Revision 5 or a0.5. Since I first started exploring the world of renderer construction a few years back, I’ve learned an immense amount about every part of building a renderer, ranging all the way from low level architecture all the way up to light transport and surface algorithms. I’ve also been fortunate and lucky enough to be able to meet and talk to a lot of people working on professional, industry quality renderers and people from some of the best rendering research groups in the world, and so this new version of my own renderer is an attempt at applying everything I’ve learned and building a base for even further future improvement and research projects.

Very broadly, the two things I’m most proud of with Takua a0.5 are the internal renderer architecture and a lot of work on integrators and light transport. Takua a0.5’s internal architecture is heavily influenced by Disney’s Sorted Deferred Shading paper, the internal architecture of NVIDIA’s Optix engine, and the modular architecture of Mitsuba Render. In the light transport area, Takua a0.5 implements not just unidirectional pathtracing with direct light importance sampling (PT), but also correctly implements multiple importance sampled bidirectional pathtracing (BDPT), progressive photon mapping (PPM), and the relatively new vertex connection and merging (VCM) algorithm. I’m planning on writing a series of posts in the next few weeks/months that will dive in depth into Takua a0.5’s various features.

Takua a0.5 has also marked a pretty large shift in strategy in terms of targeted hardware. In previous versions of Takua, I did a lot of exploration with getting the entire renderer to run on CUDA-enabled GPUs. In the interest of increased architectural flexibility, Takua a0.5 does not have a 100% GPU mode anymore. Instead, Takua a0.5 is structured in such a way that certain individual modules can be accelerated by running on the GPU, but overall much of the core of the renderer is designed to make efficient use of the CPU to achieve high performance while bypassing a lot of the complexity of building a pure GPU renderer. Again, I’ll have a detailed post on this decision later down the line.

Here is a list of the some of the major new things in Takua a0.5:

Completely modular plugin system
- Programmable ray/shader queue/dispatch system
- Natively bidirectional BSDF system
- Multiple geometry backends optimized for different hardware
- Plugin systems for cameras, lights, acceleration structures, geometry, viewers, materials, surface patterns, BSDFs, etc.
Task-based concurrency and parallelism via Intel’s TBB library
Mitsuba/PBRT/Renderman 19 RIS style integrator system
- Unidirectional pathtracing with direct light importance sampling
- Lighttracing with camera importance sampling
- Bidirectional pathtracing with multiple importance sampling
- Progressive photon mapping
- Vertex connection and merging
- All integrators designed to be re-entrant and capable of deferred operations
Native animation support
- Renderer-wide keyframing/animation support
- Transformational AND deformational motion blur
- Motion blur support for all camera, material, surface pattern, light, etc. attributes
- Animation/keyframe sequences can be instanced in addition to geometry instancing

The blue metallic XYZRGB dragon image is a render that was produced using only Takua a0.5. Since I now have access to the original physical Cornell Box model, I thought it would be fun to use a 100% measurement-accurate model of the Cornell Box as a test scene while working on Takua a0.5. All of these renders have no post-processing whatsoever. Here are some other renders made as tests during development:

More interesting non-Cornell Box renders coming in later posts!

Edit: Since making this post, I found a weighting bug that was causing a lot of energy to be lost in indirect diffuse bounces. I’ve since fixed the bug and updated this post with re-rendered versions of all of the images.

SIGGRAPH Asia 2014 Paper- A Framework for the Experimental Comparison of Solar and Skydome Illumination

2014-11-19T00:00:00+00:00

One of the projects I worked on in my first year as part of Cornell University’s Program of Computer Graphics has been published in the November 2014 issue of ACM Transactions on Graphics and is being presented at SIGGRAPH Asia 2014! The paper is “A Framework for the Experimental Comparison of Solar and Skydome Illumination”, and the team on the project was my junior advisor Joseph T. Kider Jr., my lab-mates Dan Knowlton and Jeremy Newlin, myself, and my main advisor Donald P. Greenberg.

The bulk of my work on this project was in implementing and testing sky models inside of Mitsuba and developing the paper’s sample-driven model. Interestingly, I also did a lot of climbing onto the roof of Cornell’s Rhodes Hall building for this paper; Cornell’s facilities was kind enough to give us access to the roof of Rhodes Hall to set up our capture equipment on. This usually involved Joe, Dan, and myself hauling multiple tripods and backpacks of gear up onto the roof in the morning, and then taking it all back down in the evening. Sunny clear skies can be a rare sight in Ithaca, so getting good captures took an awful lot of attempts!

Here is the paper abstract:

The illumination and appearance of the solar/skydome is critical for many applications in computer graphics, computer vision, and daylighting studies. Unfortunately, physically accurate measurements of this rapidly changing illumination source are difficult to achieve, but necessary for the development of accurate physically-based sky illumination models and comparison studies of existing simulation models.

To obtain baseline data of this time-dependent anisotropic light source, we design a novel acquisition setup to simultaneously measure the comprehensive illumination properties. Our hardware design simultaneously acquires its spectral, spatial, and temporal information of the skydome. To achieve this goal, we use a custom built spectral radiance measurement scanner to measure the directional spectral radiance, a pyranometer to measure the irradiance of the entire hemisphere, and a camera to capture high-dynamic range imagery of the sky. The combination of these computer-controlled measurement devices provides a fast way to acquire accurate physical measurements of the solar/skydome. We use the results of our measurements to evaluate many of the strengths and weaknesses of several sun-sky simulation models. We also provide a measurement dataset of sky illumination data for various clear sky conditions and an interactive visualization tool for model comparison analysis available at http://www.graphics.cornell.edu/resources/clearsky/.

The paper and related materials can be found at:

Joe Kider will be presenting the paper at SIGGRAPH Asia 2014 in Shenzen as part of the Light In, Light Out Technical Papers session. Hopefully our data will prove useful to future research!

Addendum 2017-04-26

I added a personal project page for this paper to my website, located here. My personal page mirrors the same content found on the main site, including an author’s version of the paper, supplemental materials, and more.

PIC/FLIP Simulator Meshing Pipeline

2014-02-14T00:00:00+00:00

In my last post, I gave a summary of how the core of my new PIC/FLIP fluid simulator works and gave some thoughts on the process of building OpenVDB into my simulator. In this post I’ll go over the meshing and rendering pipeline I worked out for my simulator.

Two years ago, when my friend Dan Knowlton and I built our semi-Lagrangian fluid simulator, we had an immense amount of trouble with finding a good meshing and rendering solution. We used a standard marching cubes implementation to construct a mesh from the fluid levelset, but the meshes we wound up with had a lot of flickering issues. The flickering was especially apparent when the fluid had to fit inside of solid boundaries, since the liquid-solid interface wouldn’t line up properly. On top of that, we rendered the fluid using Vray, but relied on a irradiance map + light cache approach that wasn’t very well suited for high motion and large amounts of refractive fluid.

This time around, I’ve tried to build a new meshing/rendering pipeline that resolves those problems. My new meshing/rendering pipeline produces stable, detailed meshes that fit correctly into solid boundaries, all with minimal or no flickering. The following video is the same “dambreak” test from my previous test, but fully meshed and rendered using Vray:

PIC/FLIP Simulator Dam Break Test- Final Render

One of the main issues with the old meshing approach was that marching cubes was run directly on the same level set we were using for the simulation, which meant that the resolution of the final mesh was effectively bound to the resolution of the fluid. In a pure semi-Lagrangian simulator, this coupling makes sense, however, in a PIC/FLIP simulator, the resolution of the simulator is dependent on the particle count and not the projection step grid resolution. This property means that even on a simulation with a grid size of 128x64x64, extremely high resolution meshes should be possible if there are enough particles, as long as a level set was constructed directly from the particles completely independently of the projection step grid dimensions.

Fortunately, OpenVDB comes with an enormous toolkit that includes tools for constructing level sets from various type of geometry, including particles, and tools for adaptive level set meshing. OpenVDB also comes with a number of level set operators that allow for artistic tuning of level sets, such as tools for dilating, eroding, and smoothing level set. At the SIGGRAPH 2013 OpenVDB course, Dreamworks had a presentation on how they used OpenVDB’s level set operator tools to extract really nice looking, detailed fluid meshes from relatively low resolution simulations. I also integrated Walt Disney Animation Studios’ Partio library for exporting particle data to standard formats so that I could get particles, level sets, and meshes.

I started by building support for OpenVDB’s adaptive level set meshing directly into my simulator and dumping out OBJ sequences straight to disk. In order to save disk space, I enabled fairly high adaptivity in the meshing. However, upon doing a first render test, I discovered a problem: since OpenVDB’s adaptive meshing optimizes the adaptivity per frame, the result is not temporally coherent with respect to mesh resolution. By itself this property is not a big deal, but it makes reconstructing temporally coherent normals difficult, which can contribute to flickering in final rendering. So, I decided that disk space was not as big deal and just disabled adaptivity in OpenVDB’s meshing for smaller simulations; in sufficiently large sims, the scale of the final render more often than not will make normal issues far less important and disk resource demands become much greater, so the tradeoffs of adaptivity become more worthwhile.

The next problem was getting a stable, fitted liquid-solid interface. Even with a million particles and a 1024x512x512 level set driving mesh construction, the produced fluid mesh still didn’t fit the solid boundaries of the sim precisely. The reason is simple: level set construction from particles works by treating each particle as a sphere with some radius and then unioning all of the spheres together. The first solution I thought of was to dilate the level set and then difference it with a second level set of the solid objects in the scene. Since Houdini has full OpenVDB support and I wanted to test this idea quickly with visual feedback, I prototyped this step in Houdini instead of writing a custom tool from scratch. This approach wound up not working well in practice. I discovered that in order to get a clean result, the solid level set needed to be extremely high resolution to capture all of the detail of the solid boundaries (such as sharp corners). Since the output levelset from VDB’s difference operator has to match the resolution of the highest resolution input, that meant the resultant liquid level set was also extremely high resolution. On top of that, the entire process was extremely slow, even on smaller grids.

The solution I wound up using was to process the mesh instead of the level set, since the mesh represents significantly less data and at the end of the day the mesh is what we want to have a clean liquid-solid interface. The solution is from every vertex in the liquid mesh, raycast to find the nearest point on the solid boundary to each vertex (this can be done either stochastically, or a level set version of the solid boundary can be used to inform a good starting direction). If the closest point on the solid boundary is within some epsilon distance of the vertex, move the vertex to be at the solid boundary. Obviously, this approach is far simpler than attempting to difference level sets, and it works pretty well. I prototyped this entire system in Houdini.

For rendering, I used Vray’s ply2mesh utility to dump the processed fluid meshes directly to .vrmesh files and rendered the result in Vray using pure brute force pathtracing to avoid flickering from temporally incoherent irradiance caching. The final result is the video at the top of this post!

Here are some still frames from the same simulation. The video was rendered with motion blur, these stills do not have any motion blur.

New PIC/FLIP Simulator

2014-01-15T00:00:00+00:00

Over the past month or so, I’ve been writing a brand new fluid simulator from scratch. It started as a project for a course/seminar type thing I’ve been taking with Professor Doug James, but I’ve been working on since the course ended for fun. I wanted to try our implementing the PIC/FLIP method from Zhu and Bridson; in industry, PIC/FLIP has more or less become the de fact standard method for fluid simulation. Houdini and Naiad both use PIC/FLIP implementations as their core fluid solvers, and I’m aware that Double Negative’s in-house simulator is also a PIC/FLIP implementation.

I’ve named my simulator “Ariel”, since I like Disney movies and the name seemed appropriate for a project related to water. Here’s what a “dambreak” type simulation looks like:

PIC/FLIP Simulator Dam Break Test- Ariel View

That “dambreak” test was run with approximately a million particles, with a 128x64x64 grid for the projection step.

PIC/FLIP stands for Particle-In-Cell/Fluid-Implicit Particles. PIC and FLIP are actually two separate methods that each have certain shortcomings, but when used together in a weighted sum, produces a very stable fluid solver (my own solver uses approximately a 90% FLIP to 10% PIC ratio). PIC/FLIP is similar to SPH in that it’s fundamentally a particle based method, but instead of attempting to use external forces to maintain fluid volume, PIC/FLIP splats particle velocities onto a grid, calculates a velocity field using a projection step, and then copies the new velocities back onto the particles for each step. This difference means PIC/FLIP doesn’t suffer from the volume conservation problems SPH has. In this sense, PIC/FLIP can almost be thought of as a hybridization of SPH and semi-Lagrangian level-set based methods. From this point forward, I’ll refer to the method as just FLIP for simplicity, even though it’s actually PIC/FLIP.

I also wanted to experiment with OpenVDB, so I built my FLIP solver on top of OpenVDB. OpenVDB is a sparse volumetric data structure library open sourced by Dreamworks Animation, and now integrated into a whole bunch of systems such as Houdini, Arnold, and Renderman. I played with it two years ago during my summer at Dreamworks, but didn’t really get too much experience with it, so I figured this would be a good opportunity to give it a more detailed look.

My simulator uses OpenVDB’s mesh-to-levelset toolkit for constructing the initial fluid volume and solid obstacles, meaning any OBJ meshes can be used to building the starting state of the simulator. For the actual simulation grid, things get a little bit more complicated; I initially started with using OpenVDB for storing the grid for the projection step with the idea that storing the projection grid sparsely should allow for scaling the simulator to really really large scenes. However, I quickly ran into the ever present memory-speed tradeoff of computer science. I found that while the memory footprint of the simulator stayed very small for large sims, it ran almost ten times slower compared to when the grid is stored using raw floats. The reason is that since OpenVDB under the hood is a B+tree, constant read/write operations against a VDB grid end up being really expensive, especially if the grid is not very sparse. The fact that VDB enforces single-threaded writes due to the need to rebalance the B+tree does not help at all. As a result, I’ve left in a switch that allows my simulator to run in either raw float of VDB mode; VDB mode allows for much larger simulations, but raw float mode allows for faster, multithreaded sims.

Here’s a video of another test scene, this time patterned after a “waterfall” type scenario. This test was done earlier in the development process, so it doesn’t have the wireframe outlines of the solid boundaries:

PIC/FLIP Simulator Waterfall Test- Ariel View

In the above videos and stills, blue indicates higher density/lower velocity, white indicate lower density/higher velocity.

Writing the core PIC/FLIP solver actually turned out to be pretty straightforward, and I’m fairly certain that my implementation is correct since it closely matches the result I get out of Houdini’s FLIP solver for a similar scene with similar parameters (although not exactly, since there’s bound to be some differences in how I handle certain details, such as slightly jittering particle positions to prevent artifacting between steps). Figuring out a good meshing and rendering pipeline turned out to be more difficult; I’ll write about that in my next post.

Takua Chair Renders

2013-12-10T00:00:00+00:00

A while back, I did some test renders with Takua a0.4 to test out the material system. The test model was a model of an Eames Lounge Chair Wood, and the materials were glossy wood and aluminum. Each render was done with a single large, importance sampled area light and took about two minutes to complete.

These renders were the last tests I did with Takua a0.4 before starting the new version. More on that soon!

Throwback- Holiday Card 2011

2013-11-17T00:00:00+00:00

Two years ago, I was asked to create CG@Penn’s 2011 Holiday Card. Shortly after finishing that particular project, I started writing a breakdown post but for some reason never finished/posted it. While going through old content for the move to Github Pages, I found some of my old unfinished posts, and I’ve decided to finish up some of them and post them over time as sort of a series of throwback posts.

This project is particularly interesting because almost every approach I took two years ago to finish this project, I would not bother using today. But its still interesting to look back on!

Amy and Joe wanted something wintery and nonreligious for the card, since it would be sent to a very wide and diverse audience. They suggested some sort of snowy landscape piece, so I decided to make a snow-covered forest. This particular idea meant I had to figure out three key elements:

Conifer trees
Modeling snow ON the trees
Rendering snow

Since the holiday card had to be just a single still frame and had to be done in just a few days, I knew right away that I could (and would have to!) cheat heavily with compositing, so I was willing to try more unknown elements than I normally would throw into a single project. Also, since the shot I had in mind would be a wide, far shot, I knew that I could get away with less up-close detail for the trees.

I started by creating a handful of different base conifer tree models in OnyxTree and throwing them directly into Maya/Vray (this was before I had even started working on Takua Render) just to see how they would look. Normally models directly out of OnyxTree need some hand-sculpting and tweaking to add detail for up-close shots, but here I figured if they looked good enough, I could skip those steps. The result looked okay enough to move on:

The textures for the bark and leaves were super simple. To make the bark texture’s diffuse layer, I pulled a photograph of bark off of Google, modified it to tile in Photoshop, and adjusted the contrast and levels until it was the color I wanted. The displacement layer was simply the diffuse layer converted to black and white and with contrast and brightness adjusted. Normally this method won’t work well for up close shots, but again, since I knew the shot would be far away, I could get away with some cheating. Here’s a crop from the bark textures:

The pine needles were also super cheatey. I pulled a photo out of one of my reference libraries, dropped an opacity mask on top, and that was all for the diffuse color. Everything else was hacked in the leaf material’s shader; since the tree would be far away, I could get away with basic transparency instead of true subsurface scattering. The diffuse map with opacity flattened to black looks like this:

With the trees roughed in, the next problem to tackle was getting snow onto the trees. Today, I would immediately spin up Houdini to create this effect, but back then, I didn’t have a Houdini license and hadn’t played with Houdini enough to realize how quickly it could be done. Not knowing better back then, I used 3dsmax and a plugin called Snowflow (I used the demo version since this project was a one-off). To speed up the process, I used a simplified, decimated version of the tree mesh for Snowflow. Any inaccuracies between the resultant snow layer and the full tree mesh were acceptable, since they would look just like branches and leaves poking through the snow:

I tried a couple of different variations on snow thickness, which looked decent enough to move on with:

The next step was a fast snow material that would look reasonably okay from a distance, and render quickly. I wasn’t sure if the snow should have a more powdery, almost diffuse look, or if it should have a more refractive, frozen, icy look. I wound up trying both and going with a 50-50 blend of the two:

The next step was to compose a shot, make a very quick, simple lighting setup, and do some test renders. After some iterating, I settled for this render as a base for comp work:

The base render is very blueish since the lighting setup was a simple, grey-blueish dome light over the whole scene. The shadows are blotchy since I turned Vray’s irradiance cache settings all the way down for faster rendertimes; I decided that I would rather deal with the blotchy shadows in post and have a shot at making the deadline rather than wait for a very long rendertime. I wound up going with the thinner snow at the time since I wanted the trees to be more recognizable as trees, but in retrospect, that choice was probably a mistake.

The final step was some basic compositing. In After Effects, I applied post-processed DOF using a z-depth layer and Frischluft, color corrected the image, cranked up the exposure, and added vignetting to get the final result:

Looking back on this project two years later, I don’t think the final result looks really great. The image looks okay for two days of rushed work, but there is enormous room for improvement. If I could go back and change one thing, I would have chosen to use the much heavier snow cover version of the trees for the final composition. Also, today I would approach this project very very differently; instead of ping-ponging between multiple programs for each component, I would favor a almost pure-Houdini pipeline. The trees could be modeled as L-systems in Houdini, perhaps with some base work done in Maya. The snow could absolutely be simmed in Houdini. For rendering and lighting, I would use either my own Takua Render or some other fast physically based renderer (Octane, or perhaps Renderman 18’s iterative pathtracing mode) to iterate extremely quickly without having to compromise on quality.

So that’s the throwback breakdown of the CG@Penn Holiday 2011 card! I learned a lot from this project, and looking back and comparing how I worked two years ago to how I work today is always a good thing to do.

Code and Visuals Version 4.0

2013-11-16T00:00:00+00:00

I’d like to introduce the newest version of my computer graphics blog, Code and Visuals! On the surface, everything has been redesigned with a new layer of polish; everywhere, the site is now simpler, cleaner, and the layout is now fully responsive. Under the hood, I’ve moved from Blogger to Jekyll, hosted on Github Pages.

As part of the move to Jekyll, I’ve opted to clean up a lot of old posts as well. This blog started as some combination of a devblog, doodleblog, and photoblog, but quickly evolved into a pure computer graphics blog. In the interest of keeping historical context intact, I’ve ported over most of my older non-computer graphics posts, with minor edits and touchups here and there. A handful of posts I didn’t really like I’ve chosen to leave behind, but they can still be found on the old Blogger-based version of this blog.

The Atom feed URL for Code and Visuals is still the same as before, so that should transition over smoothly.

Why the move from Blogger to Jekyll/Github Pages? Here are the main reasons:

Markdown/Github support. Blogger’s posting interface is all kinds of terrible. With Jekyll/Github Pages, writing a new post is super nice: simply write a new post in a Markdown file, push to Github, and done. I love Markdown and I love Github, so its a great combo for me.
Significantly faster site. Previous versions of this blog have always been a bit pokey speed-wise, since they relied on dynamic page generators (originally my hand-rolled PHP/MySQL CMS, then Wordpress, and then Blogger). However, Jekyll is a static page generator; the site is converted from Markdown and template code into static HTML/CSS once at generation time, and then simply served as pure HTML/CSS.
Easier templating system. Jekyll’s templating system is build on Liquid, which made building this new theme really fast and easy.
Transparency. This entire blog’s source is now available on Github, and the theme is separately available here.

I’ve been looking to replace Blogger for some time now. Before trying out Jekyll, I was tinkering with Ghost, and even fully built out a working version of Code and Visuals on a self-hosted Ghost instance. In fact, this current theme was originally built for Ghost and then ported to Jekyll after I decided to use Jekyll (both the Ghost and Jekyll versions of this theme are in the Github repo). However, Ghost as a platform is still extremely new and isn’t quite ready for primetime yet; while Ghost’s Markdown support and Node.js underpinnings are nice, Ghost is still missing crucial features like the ability to have an archive page. Plus, at the end of the day, Jekyll is just plain simpler; Ghost is still a CMS, Jekyll is just a collection of text files.

I intend to stay on a Jekyll/Github Pages based solution for a long time; I am very very happy with this system. Over time, I’ll be moving all of my other couple of non-computer graphics blogs over to Jekyll as well. I’m still not sure if my main website needs to move to Jekyll though, since it already is coded up as a series of static pages and requires a slightly more complex layout on certain pages.

Over the past few months I haven’t posted much, since over the summer almost all of my Pixar related work was under heavy NDA (and still is and will be for the foreseeable future, with the exception of our SIGGRAPH demo), and a good deal of my work at Cornell’s Program for Computer Graphics is under wraps as well while we work towards paper submissions. However, I have some new personal projects I’ll write up soon, in addition to some older projects that I never posted about.

With that, welcome to Code and Visuals Version 4.0!

Pixar Optix Lighting Preview Demo

2013-07-27T00:00:00+00:00

For the past two months or so, I’ve been working at Pixar Animation Studio as a summer intern with Pixar’s Research Group. The project I’m on for the summer is a realtime, GPU based lighting preview tool implemented on top of NVIDIA’s OptiX framework, entirely inside of The Foundry’s Katana. I’m incredibly pleased to be able to say that our project was demoed at SIGGRAPH 2013 at the NVIDIA booth, and that NVIDIA has a recording of the entire demo online!

The demo was done by our project’s lead, Danny Nahmias, and got an overwhelmingly positive reception. Check out the recording here:

Using NVIDIA® OptiX™ for Lighting Preview in a Katana-Based Production Pipeline

FXGuide also did a podcast about our demo! Check it out here.

I’m just an intern, and the vast majority of the cool work being done on this project is from Danny Nahmias, Phillip Rideout, Mark Meyer, and others, but I’m very very proud, and consider myself extraordinarily lucky, to be part of this team!

Edit: I’ve replaced the original Ustream embed with a Vimeo mirror since the Ustream embed was crashing Chrome for some people. The original Ustream link is here.

Giant Mesh Test

2013-04-29T00:00:00+00:00

My friend/schoolmate Zia Zhu is an amazing modeler, and recently she was kind enough to lend me a ZBrush sculpt she did for use as a high-poly test model for Takua Render. The model is a sculpture of Venus, and is made up of slightly over a million quads, or about two million triangles once triangulated inside of Takua Render.

Here are some nice, pretty test renders I did. As usual, everything was rendered with Takua Render, and there has been absolutely zero post-processing:

Each one of these renders was lit using a single, large area light (with importance sampled direct lighting, of course). The material on the model is just standard lambert diffuse white; I’ll do another set of test renders once I’ve finished rewriting my subsurface scatter system. Each render was set to 2800 samples per pixels and took about 20 minutes to render on a single GTX480. In other words, not spectacular, but not bad either.

The key takeaway from this series of tests was that Takua’s performance still suffers significantly when datasets become extremely large; while the render took about 20 minutes, setup time (including memory transfer, etc) took nearly 5 minutes, which I’m not happy about. I’ll be taking some time to rework Takua’s memory manager.

On a happier note, KD-tree construction performed well! The KD-tree for the Venus sculpt was built out to a depth of 30 and took less than a second to build.

Here’s a bonus image of what the sculpt looks like in the GL preview mode:

Again, all credit for the actual model goes to the incredibly talented Zia Zhu!

Importance Sampled Direct Lighting

2013-04-26T00:00:00+00:00

Takua Render now has correct, fully working importance sampled direct lighting, supported for any type of light geometry! More importantly, the importance sampled direct lighting system is now fully integrated with the overall GI pathtracing integrator.

A naive, standard pathtracing implementation shoots out rays and accumulates colors until a light source is reached, upon which the total accumulated color is multiplied by the emittance of the light source and added to the framebuffer. As a result, even the simplest pathtracing integrator does account for both the indirect and direct illumination within a scene, but since sampling light sources is entirely dependent on the BRDF at each point, correctly sampling the direct illumination component in the scene is extremely inefficient. The canonical example of this inefficiency is a scene with a single very small, very intense, very far away light source. Since the probability of hitting such a small light source is so small, convergence is extremely slow.

To demonstrate/test this property, I made a simple test scene with an extremely bright sun-like object illuminating the scene from a huge distance away:

Using naive pathtracing without importance sampled direct lighting produces an image like this after 16 samples per pixel:

Mathematically, the image is correct, but is effectively useless since so few contributing ray paths have actually been found. Even after 5120 samples, the image is still pretty useless:

Instead, a much better approach is to accumulate colors just like before, but not bother waiting until a light source is hit by the ray path through pure BRDF sampling to multiply emittance. Instead, at each ray bounce, a new indirect ray is generated via the BRDF like before, AND to generate a new direct ray towards a randomly chosen light source via multiple importance sampling and multiply the accumulated color by the resultant emittance. Multiple importance sampled direct lighting works by balancing two different sampling strategies: sampling by light source and sampling by BRDF, and then weighting the two results with some sort of heuristic (such as the power heuristic described in Eric Veach’s thesis).

Sampling by light source is the trickier part of this technique. The idea is to generate a ray that we know will hit a light source, and then weight the contribution from that ray by the probability of generating that ray to remove the bias introduced by artificially choosing a ray direction. There’s a few good ways to do this: one way is to generate an evenly distributed random point on a light source as the target for the direct lighting ray, and then weight the result using the probability distribution function with respect to surface area, transformed into a PDF with respect to solid angle.

Takua Render at the moment uses a slightly different approach, for the sake of simplicity. The approach I’m using is similar to the one described in my earlier post on the topic, but with a disk instead of a sphere. The approach works like this:

Figure out a bounding sphere for the light source
Construct a ray from the point to be lit to the center of the bounding sphere. Let’s call the direction of this ray D.
Find a great circle on the bounding sphere with a normal N, such that N is lined up exactly with D.
Move the great circle along its normal towards the point to be lit by a distance of exactly the radius of the bounding sphere
Treat the great circle as a disk and generate uniformly distributed random points on the disk to shoot rays towards.
Weight light samples by the projected solid angle of the disk on the point being lit.

Alternatively, the weighting can simply be based on the normal solid angle instead of the projected solid angle, since the random points are chosen with a cosine weighted distribution.

The nice thing about this approach is that it allows for importance sampled direct lighting even for shapes that are difficult to sample random points on; effectively, the problem of sampling light sources is abstracted away, at the cost of a slight loss in efficiency since some percentage of rays fired at the disk have to miss the light in order for the weighting to remain unbiased.

I also started work on the surface area PDF to solid angle PDF method, so I might post about that later too. But for now, everything works! With importance sampled direct lighting, the scene from above is actually renderable in a reasonable amount of time. With just 16 samples per pixel, Takua Render now can generate this image:

…and after 5120 samples per pixel, a perfectly clean render:

The other cool thing about this scene is that most of the scene is actually being lit through pure indirect illumination. With only direct illumination and no GI, the render looks like this:

Quick Update on Future Plans

2013-04-20T02:00:00+00:00

Just a super quick update on my future plans:

Next year, starting in September, I’ll be joining Dr. Don Greenberg and Dr. Joseph T. Kider and others at Cornell’s Program for Computer Graphics. I’ll be pursuing a Master of Science in Computer Graphics there, and will most likely be working on something involving rendering (which I suppose is not surprising).

Between the end of school and September, I’ll be spending the summer at Pixar Animation Studios once again, this time as part of Pixar’s Research Group.

Obviously I’m quite excited by all of this!

Now, back to working on my renderer.

Working Towards Importance Sampled Direct Lighting

2013-04-20T01:00:00+00:00

I haven’t made a post in a few weeks now since I’ve been working on a number of different things all of which aren’t quite done yet. Since its been a few weeks, here’s a writeup of one of the things I’m working on and where I am with that.

One of the major features I’ve been working towards for the past few weeks is full multiple importance sampling, which will serve a couple of purposes. First, importance sampling the direct lighting contribution in the image should allow for significantly higher convergence rates for the same amount of compute, allowing for much smoother renders for the same render time. Second, MIS will serve as groundwork for future bidirectional integration schemes, such as Metropolis transport and photon mapping. I’ve been working with my friend Xing Du on understanding the math behind MIS and figuring out how exactly the math should translate into implementation.

So first off, some ground truth tests. All ground truth tests are rendered using brute force pathtracing with hundreds of thousands of iterations per pixel. Here is the test scene I’ve been using lately, with all surfaces reduced to lambert diffuse for the sake of simplification:

The motivation behind importance sampling lights by directly sampling objects with emissive materials comes from the difficulty of finding useful samples from the BRDF only; for example, for the lambert diffuse case, since sampling from only the BRDF produces outgoing rays in totally random (or, slightly better, cosine weighted random) directions, the probability of any ray coming from a diffuse surface actually hitting a light is relatively low, meaning that the contribution of each sample is likely to be low as well. As a result, finding the direct lighting contribution via just BRDF sampling.

For example, here’s the direct lighting contribution only, after 64 samples per pixel with only BRDF sampling:

Instead of sampling direct lighting contribution by shooting a ray off in a random direction and hoping that maybe it will hit a light, a much better strategy would be to… shoot the ray towards the light source. This way, the contribution from the sample is guaranteed to be useful. There’s one hitch though: the weighting for a sample chosen using the BRDF is relatively simple to determine. For example, in the lambert diffuse case, since the probability of any particular random sample within a hemisphere is the same as any other sample, the weighting per sample is even with all other samples. Once we selectively choose the ray direction specifically towards the light though, the weighting per sample is no longer even. Instead, we must weight each sample by the probability of a ray going in that particular direction towards the light, which we can calculate by the solid angle subtended by the light source divided by the total solid angle of the hemisphere.

So, a trivial example case would be if a point was being lit by a large area light subtending exactly half of the hemisphere visible from the point. In this case, the area light subtends Pi steradians, making its total weight Pi/(2*Pi), or one half.

The tricky part of calculating the solid angle weighting is in calculating the fractional unit-spherical surface area projection for non-uniform light sources. In other words, figuring out what solid angle a sphere subtends is easy, but figuring out what solid angle a Stanford Bunny subtends is…. less easy.

The initial approach that Xing and I arrived at was to break complex meshes down into triangles and treat each triangle as a separate light, since calculating the solid angle subtended by a triangle is once again easy. However, since treating a mesh as a cloud of triangle area lights is potentially very expensive, for each iteration the direct lighting contribution from all lights in the scene becomes potentially untenable, meaning that each iteration of the render will have to randomly select a small number of lights to directly sample.

As a result, we brainstormed some ideas for potential shortcuts. One shortcut idea we came up with was that instead of choosing an evenly distributed point on the surface of the light to serve as the target for our ray, we could instead shoot a ray at the bounding sphere for the target light and weight the resulting sample by the solid angle subtended not by the light itself, but by the bounding sphere. Our thinking was that this approach would dramatically simplify the work of calculating the solid angle weighting, while still maintaining mathematical correctness and unbiasedness since the number of rays fired at the bounding sphere that will miss the light should exactly offset the overweighting produced by using the bounding sphere’s subtended solid angle.

I went ahead and tried out this idea, and produced the following image:

First off, for the most part, it works! The resulting direct illumination matches the ground truth and the BRDF-sampling render, but is significantly more converged than the BRDF-sampling render for the same number of samples. BUT, there is a critical flaw: note the black circle around the light source on the ceiling. That black circle happens to fall exactly within the bounding sphere for the light source, and results from a very simple mathematical fact: calculating the solid angle subtended by the bounding sphere for a point INSIDE of the bounding sphere is undefined. In other words, this shortcut approach will fail for any points that are too close to a light source.

One possible workaround I tried was to have any points inside of a light’s bounding sphere to fall back to pure BRDF sampling, but this approach is also undesirable, as a highly visible discontinuity between the differently sampled area develops due to vastly different convergence rates.

So, while the overall solid angle weighting approach checks out, our shortcut does not. I’m now working on implementing the first approach described above, which should produce a correct result, and will post in the next few days.

Stratified versus Uniform Sampling

2013-03-06T00:00:00+00:00

As part of Takua Render’s new pathtracing core, I’ve implemented a system allowing for multiple sampling methods instead of just uniform sampling. The first new sampling method I’ve added in addition to uniform sampling is stratified sampling. Basically, in stratified sampling, instead of spreading samples per iteration across the entire probability region, the probability region is first divided into a number of equal sized, non-overlapping subregions, and then for each iteration, a sample is drawn with uniform probability from within a single subregion, called a strata. The result of stratified sampling is that samples are guaranteed to be more evenly spread across the entire probability domain instead of clustered within a single area, resulting in less visible noise for the same number of samples compared to uniform sampling. At the same time, since stratified sampling still maintains a random distribution within each strata, the aliasing problems associated with a totally even sample distribution are still avoided.

Here’s a video showing a scene rendered in Takua Render with uniform and then stratified sampling. The video also shows a side-by-side comparison in its last third.

Takua Render Sampler Methods Comparison

In the renders in the above video, stratified sampling is being used to choose new ray directions from diffuse surface bounces; instead of choosing a random point over the entire cosine-weighted hemisphere at an intersection point, the renderer first chooses a strata with the same steradian as all other strata, and then chooses a random sample within that solid angle. The strata is chosen sequentially for primary bounces, and then chosen randomly for all secondary bounces to maintain unbiased sampling over the whole render. As a result of the sequential strata selection for primary bounces, images rendered in Takua Render will not converged to an unbiased solution until N iterations have elapsed, where N is the number of strata the probability region is divided into. The number of strata can be set by the user as a value in the scene description which is then squared to get the total strata number. So, if a user specifies a strata level of 16, then the probability region will be divided into 256 strata and a unbiased result will not be reached until 256 or more samples per pixel have been taken.

Here’s the Lamborghini model from last post at 256 samples per pixel with stratified (256 strata) and uniform sampling, to demonstrate how much less perceptible noise there is with the stratified sampler. From a distance, the uniform sampler renders may seem slightly darker side by side due to the higher percentage of noise, but if you compare them using the lightbox, you can see that the lighting and brightness is the same.

…and up-close crops with 400% zoom:

At some point soon I will also be implementing Halton sequence sampling and [0,2]-sequence sampling, but for the time being, stratified sampling is already providing a huge visual boost over uniform! In fact, I have a small secret to confess: all of the renders in the last post were rendered with the stratified sampler!

First Progress on New Pathtracing Core

2013-03-04T00:00:00+00:00

I’ve started work on a completely new pathtracing core to replace the one used in Rev 2. The purpose of totally rewriting the entire pathtracing integrator and brdf systems is to produce something much more modular and robust; as much as possible, I am now decoupling brdf and new ray direction calculation from the actual pathtracing loop.

I’m still in the earliest stages of this rewrite, but I have some test images! Each of the following images was rendered out to somewhere around 25000 samples per pixel (a lot!), at about 5/6 samples per pixel per second. I let the renders run without a hard ending point and terminated them after I walked away for a while and came back, hence the inexact but enormous samples per pixel counts. Each scene was lit with my standard studio-styled lighting setup and in addition to the showcased model, uses a smooth backdrop that consists of about 10000 triangles.

Approximately 100000 face Stanford Dragon:

Approximately 150000 face Deloreon model:

Approximately 250000 face Lamborghini Aventador model:

Short-stack KD-Tree Traversal

2013-03-01T00:00:00+00:00

In my last post, I talked about implementing history flag based kd-tree traversal. While the history flag based stackless traverse worked perfectly fine in terms of traversing the tree and finding the nearest intersection, I discovered over the past week that its performance is… less than thrilling. Unfortunately, the history flag system results in a huge amount of redundant node visits, since the entire system is state based and therefore necessarily needs to visit every node in a branch of the tree both to move down and up the branch.

So instead, I decided to try out a short-stack based approach. My initial concern with short-stack based approaches was the daunting memory requirements that keeping a short stack for a few hundred threads, however, I realized that realistically, a short stack never needs to be any larger than the maximum depth of the kd-tree being traversed. Since I haven’t yet had a need to test a tree with a depth beyond 100, the memory usage required for keeping short stacks is reasonably predictable and manageable; as a precaution, however, I’ve also decided to allow for the system to fall back to a stackless traverse in the case that a tree’s depth causes short stack memory usage to become unreasonable.

The actual short-stack traverse I’m using is a fairly standard while-while traverse based on the 2008 Kun Zhou realtime kd-tree paper and the 2007 Daniel Horn GPU kd-tree paper. I’ve added one small addition though: in addition to keeping a short stack for traversing the kd-tree, I’ve also added an optional second short stack that tracks the last N intersection test objects. The reason for keeping this second short stack is that kd-trees allow for objects to be split across multiple nodes; by tracking which objects we have already encountered, we can safely detect and skip objects that have already been tested. The object tracking short stack is meant to be rather small (say, no more than 10 to 15 objects at a time), and simply loops back and overwrites the oldest values in the stack when it overflows.

The new while-while traversal is significantly faster than the history flag approach, to the order of a 10x or better performance increase in some cases.

In order to validate that the entire kd traversal system works, I did a quick and dirty port of the old Rev 2 pathtracing integrator to run on top of the new Rev 3 framework. The following test images contain about 20000 faces and objects, and clocked in at about 6 samples per pixel per second with a tree depth of 15. Each image was rendered to 1024 samples per pixel:

I also attempted to render these images without any kd-tree acceleration as a control. Without kd-tree acceleration, each sample per pixel took upwards of 5 seconds, and I wound up terminating the renders before they got even close to completion.

The use of my old Rev 2 pathtracing core is purely temporary, however. The next task I’ll be tackling is a total rewrite of the entire pathtracing system and associated lighting and brdf evaluation systems. Previously, this systems have basically been monolithic blocks of code, but with this rewrite, I want to create a more modular, robust system that can recycle as much code as possible between GPU and CPU implementations, the GL debugger, and eventually other integration methods, such as photon mapping.

Stackless KD-Tree Traversal

2013-02-22T00:00:00+00:00

I have a working, reasonably optimized, speedy GPU stackless kd-tree traversal implementation! Over the past few days, I implemented the history flag-esque approach I outlined in this post, and it works quite well!

The following image is a heatmap of a kd-tree built for the Stanford Dragon, showing the cost of tracing a ray through each pixel in the image. Brighter values mean more node traversals and intersection tests had to be done for that particular ray. The image was rendered entirely using Takua Render’s CUDA pathtracing engine, and took roughly 100 milliseconds to complete:

…and a similar heatmap, this time generated for a scene containing two mesh cows, two mesh helixes, and some cubes and spheres in a box:

Although room for even further optimization still exists, as it always does, I am quite happy with the results so far. My new kd-tree construction system and stackless traversal system are both several orders of magnitude faster and more efficient than my older attempts.

Here’s a bit of a cool image: in my OpenGL debugging view, I can now follow the kd-tree traversal for a single ray at a time and visualize the exact path and nodes encountered. This tool has been extremely useful for optimizing… without a visual debugging tool, no wonder my previous implementations had so many problems! The scene here is the same cow/helix scene, but rotated 90 degrees. The bluish green line coming in from the left is the ray, and the green boxes outline the nodes of the kd-tree that traversal had to check to get the correct intersection.

…and here’s the same image as above, but with all nodes that were skipped drawn in red. As you can see, the system is now efficient enough to cull the vast vast majority of the scene for each ray:

The size of the nodes relative to the density of the geometry in their vicinity also speaks towards the efficiency of the new kd-tree construction system: empty spaces are quickly skipped through with enormous bounding boxes, whereas high density areas have much smaller bounding boxes to allow for efficient culling.

Over the next day or so, I fully expect I’ll be able to reintegrate the actual pathtracing core, and have some nice images! Since the part of Takua that needed rewriting the most was the underlying scene and kd-tree system, I will be able to reuse a lot of the BRDF/emittance/etc. stuff from Takua Rev 2.

Revision 3 KD-Tree/ObjCore

2013-02-15T00:00:00+00:00

The one piece of Takua Render that I’ve been proudest of so far has been the KD-Tree and obj mesh processing systems that I built. So of course, over the past week I completely threw away the old versions of KdCore and ObjCore and totally rewrote new versions entirely from scratch. The motive behind this rewrite came mostly from the fact that over the past year, I’ve learned a lot more about KD-Trees and programming in general; as a result, I’m pleased to report that the new versions of KdCore and ObjCore are significantly faster and more memory efficient than previous versions. KdCore3 is now able to process a million objects into an efficient, optimized KD-Tree with a depth of 20 and a minimum of 5 objects per leaf node in roughly one second.

Here’s my kitchen scene, exported to Takua Render’s AvohkiiScene format, and processed through KdCore3. White lines are the wireframe lines for the geometry itself, red lines represent KD-Tree node boundaries:

…and the same image as above, but with only the KD-Tree. You can use the lightbox to switch between the two images for comparisons:

One of the most noticeable improvements in KdCore3 over KdCore2, aside from the speed increases, is in how KdCore3 manages empty space. In the older versions of KdCore, empty space was often repeatedly split into multiple nodes, meaning that ray traversal through empty space was very inefficient, since repeated intersection tests would be required only for a ray to pass through the KD-Tree without actually hitting anything. The images in this old post demonstrate what I mean. The main source of this problem came from how splits were being chosen in KdCore2; in KdCore2, the chosen split was the lowest cost split regardless of axis. As a result, splits were often chosen that resulted in long, narrow nodes going through empty space. In KdCore3, the best split is chosen as the lowest cost split on the longest axis of the node. As a result, empty space is culled much more efficiently.

Another major change to KdCore3 is that the KD-Tree is no longer built recursively. Instead, KdCore3 builds the KD-Tree layer by layer through an iterative approach that is well suited for adaptation to the GPU. Instead of attempting to guess how deep to build the KD-Tree, KdCore3 now just takes a maximum depth from the user and builds the tree no deeper than the given depth. The entire tree is also no longer stored as a series of nodes with pointers to each other, but instead all nodes are stored in a flat array with a clever indexing scheme to allow nodes to implicitly know where their parent and child nodes are within the array. Furthermore, instead of building as a series of nodes with pointers, the tree builds directly into the array format. This array storage format again makes KdCore3 more suitable to a GPU adaptation, and also makes serializing the Kd-Tree out to disk significantly easier for memory caching purposes.

Another major change is how split candidates are chosen; in KdCore2, the candidates along each axis were the median of all contained object center-points, the middle of the axis, and some randomly chosen candidates. In KdCore3, the user can specify a number of split candidates to try along each axis, and then KdCore3 will simply divide each axis into that number of equally spaced points and use those points as candidates. As a result, KdCore3 is far more efficient than KdCore2 at calculating split candidates, can often find a better candidate with more deterministic results due to the removal of random choices, and offers the user more control over the quality of the final split.

The following series of images demonstrate KD-Trees built by KdCore3 for the Stanford Dragon with various settings. Again, feel free to use the lightbox for comparisons.

KdCore3 is also capable of figuring out when the number of nodes in the tree makes traversing the tree more expensive than brute force intersection testing all of the objects in the tree, and will stop tree construction beyond that point. I’ve also given KdCore3 an experiment method for finding best splits based on a semi-Monte-Carlo approach. In this mode, instead of using evenly split candidates, KdCore3 will make three random guesses, and then based on the relative costs of the guesses, begin making additional guesses with a probability distribution weighted towards where ever the lower relative cost is. With this approach, KdCore3 will eventually arrive at the absolute optimal cost split, although getting to this point may take some time. The number of guesses KdCore3 will attempt can be limited by the user, of course.

Finally, another one of the major improvements I made in KdCore3 was simply better use of C++. Over the past two years, my knowledge of how to write fast, effective C++ has evolved immensely, and I now write code very differently than how I did when I wrote KdCore2 and KdCore1. For example, KdCore3 avoids relying on class inheritance and virtual method table lookup (KdCore2 relied on inheritance quite heavily). Normally, virtual method lookup doesn’t add a noticeable amount of execution time to a single virtual method, but when repeated for a few million objects, the slowdown becomes extremely apparent. In talking with my friend Robert Mead, I realized that virtual method table lookup in almost, if not all implementations today necessarily means a minimum of three pointer lookups in memory to find a function, whereas a direct function call is a single pointer lookup.

If I have time to later, I’ll post some benchmarks of KdCore3 versus KdCore2. However, for now, here’s a final pair of images showcasing a scene with highly variable density processed through KdCore3. Note the keavy amount of nodes clustered where large amounts of geometry exist, and the near total emptyness of the KD-Tree in areas where the scene is sparse:

Next up: implementing some form of highly efficient stackless KD-Tree traversal, possibly even using that history based approach I wrote about before!

Bounding Boxes for Ellipsoids

2013-02-08T00:00:00+00:00

Update (2014): Tavian Barnes has written a far better / more detailed post on this topic; instead of reading my post, I suggest you go read Tavian’s post instead.

Warning: this post is going to be pretty math-heavy.

Let’s talk about spheres, or more generally, ellipsoids. Specifically, let’s talk about calculating axis aligned bounding boxes for arbitrarily transformed ellipsoids, which is a bit of an interesting problem I recently stumbled upon while working on Takua Rev 3. I’m making this post because finding a solution took a lot of searching and I didn’t find any single collected source of information on this problem, so I figured I’d post it for both my own reference and for anyone else who may find this useful.

So what’s so hard about calculating tight axis aligned bounding boxes for arbitrary ellipsoids?

Well, consider a basic, boring sphere. The easiest way to calculate a tight axis aligned bounding box (or AABB) for a mesh is to simply min/max all of the vertices in the mesh to get two points representing the min and max points of the AABB. Similarly, getting a tight AABB for a box is easy: just use the eight vertices of the box for the min/max process. A naive approach to getting a tight AABB for a sphere seems simple then: along the three axes of the sphere, have one point on each end of the axis on the surface of the sphere, and then min/max. Figure 1. shows a 2D example of this naive approach, to extend the example to 3D, simply add two more points for the Z axis (I drew the illustrations quickly in Photoshop, so apologies for only showing 2D examples):

This naive approach, however, quickly fails if we rotate the sphere such that its axes are no longer lined up nicely with the world axes. In Figure 2, our sphere is rotated, resulting in a way too small AABB if we min/max points on the sphere axes:

If we scale the sphere such that it becomes an ellipsoid, the same problem persists, as the sphere is just a subtype of ellipsoid. In Figures 3 and 4, the same problem found in Figures 1/2 is illustrated with an ellipsoid:

One possible solution is to continue using the naive min/max axes approach, but simply expand the resultant AABB by some percentage such that it encompasses the whole sphere. However, we have no way of knowing what percentage will give an exact bound, so the only feasible way to use this fix is by making the AABB always larger than a tight fit would require. As a result, this solution is almost as undesirable as the naive solution, since the whole point of this exercise is to create as tight of an AABB as possible for as efficient intersection culling as possible!

Instead of min/maxing the axes, we need to use some more advanced math to get a tight AABB for ellipsoids.

We begin by noting our transformation matrix, which we’ll call M. We’ll also need the transpose of M, which we’ll call MT. Next, we define a sphere S using a 4x4 matrix:

[ r 0 0 0 ]
[ 0 r 0 0 ]
[ 0 0 r 0 ]
[ 0 0 0 -1] 

where r is the radius of the sphere. So for a unit diameter sphere, r = .5. Once we have built S, we’ll take its inverse, which we’ll call SI.

We now calculate a new 4x4 matrix R = M*SI*MT. R should be symmetric when we’re done, such that R = transpose(R). We’ll assign R’s indices the following names:

R = [ r11 r12 r13 r14 ] 
  [ r12 r22 r23 r24 ] 
  [ r13 r23 r23 r24 ] 
  [ r14 r24 r24 r24 ] 

Using R, we can now get our bounds:

zmax = (r23 + sqrt(pow(r23,2) - (r33*r22)) ) / r33; 
  	zmin = (r23 - sqrt(pow(r23,2) - (r33*r22)) ) / r33; 
  	ymax = (r13 + sqrt(pow(r13,2) - (r33*r11)) ) / r33; 
  	ymin = (r13 - sqrt(pow(r13,2) - (r33*r11)) ) / r33; 
  	xmax = (r03 + sqrt(pow(r03,2) - (r33*r00)) ) / r33; 
  	xmin = (r03 - sqrt(pow(r03,2) - (r33*r00)) ) / r33; 

…and we’re done!

Just to prove that it works, a screenshot of a transformed ellipse inside of a tight AABB in 3D from Takua Rev 3’s GL Debug view:

I’ve totally glossed over the mathematical rationale behind this method in this post and focused just on how to quickly get a working implementation, but if you want to read more about the actual math behind how it works, these are the two sources I pulled this from:

Stack Overflow post by user fd

Article by Inigo Quilez

In other news, Takua Rev 3’s new scene system is now complete and I am working on a brand new, better, faster, stackless KD-tree implementation. More on that later!

Revision 3, Old Renders

2013-01-18T00:00:00+00:00

At the beginning of the semester, I decided to re-architect Takua again, hence the lack of updates for a couple of weeks now. I’ll talk more in-depth about the details of how this new architecture works in a later post, so for now I’ll just quickly describe the motivation behind this second round of re-architecting. As I wrote about before, I’ve been keeping parallel CPU and GPU branches of my renderer so far, but the two branches have increasingly diverged. On top of that, the GPU branch of my renderer, although significantly better organized than the experimental CUDA renderer from spring 2012, still is rather suboptimal; after TAing CIS565 for a semester, I’ve developed what I think are some better ways of architecting CUDA code. Over winter break, I began to wonder if merging the CPU and GPU branches might be possible, and if such a task could be done, how I might go about doing it.

This newest re-structuring of Takua accomplishes that goal. I’m calling this new version of Takua “Revision 3”, as it is the third major rewrite.

My new architecture centers around a couple of observations. First, we can observe that the lowest common denominator (so to speak) for structured data in CUDA and C++ is… a struct. Similarly, the easiest way to recycle code between CUDA and C++ is to implement code as inlineable, C style functions that can either be embedded in a CUDA kernel at compile time, or wrapped within a C++ class for use in C++. Therefore, one possible way to unify CPU C++ and GPU CUDA codebases could be to implement core components of the renderer using structs and C-style, inlineable functions, allowing easy integration into CUDA kernels, and then write thin wrapper classes around said structs and functions to allow for nice, object oriented C++ code. This exact system is how I am building Takua Revision 3; the end result should be a unified codebase that can compile to both CPU and GPU versions, and allow for both versions to develop in near lockstep.

Again, I’ll go into a more detailed explanation once this process is complete.

I’ll leave this post with a slightly orthogonal note; whilst in the process of merging code, I found some images from Takua Revision 1 that I never posted for some reason. Here’s a particularly cool pair of images from when I was implementing depth of field. The first image depicts a glass Stanford dragon without any depth of field, and the second image depicts the same exact scene with some crazy shallow aperture (I don’t remember the exact settings). You can tell these are from the days of Takua Revision 1 by the ceiling; I often made the entire ceiling a light source to speed up renders back then, until Revision 2’s huge performance increases rendered cheats like that unnecessary.

Texture Mapping

2012-12-18T00:00:00+00:00

A few weeks back I started work on another piece of super low-hanging fruit: texture mapping! Before I delve into the details, here’s a test render showing three texture mapped spheres with varying degrees of glossiness in a glossy-walled Cornell box. I was also playing with logos for Takua render and put a test logo idea on the back wall for fun:

…and the same scene with the camera tilted down just to show off the glossy floor (because I really like the blurry glossy reflections):

My texturing system can, of course, support textures of arbitrary resolution. The black and white grid and colored UV tile textures in the above render are square 1024x1024, while the Earth texture is rectangular 1024x512. Huge textures are handled just fine, as demonstrated by the following render using a giant 2048x2048, color tweaked version of Simon Page’s Space Janus wallpaper:

Of course UV transformations are supported. Here’s the same texture with a 35 degree UV rotation applied and tiling switched on:

Since memory is always at a premium, especially on the GPU, I’ve implemented textures in a fashion inspired by geometry instancing and node based material systems, such as the system for Maya. Inside of my renderer, I represent texture files as a file node containing the raw image data, streamed from disk via stb_image. I then apply transformations, UV operations, etc through a texture properties node, which maintains a pointer to the relevant texture file node, and then materials point to whatever texture properties nodes they need. This way, texture data can be read and stored once in memory and recycled as many times as needed, meaning that a well formatted scene file can altogether eliminate the need for redundant texture read/storage in memory. This system allows me to create amusing scenes like the following one, where a single striped texture is reused in a number of materials with varied properties:

Admittedly I made that stripe texture really quickly in Photoshop without too much care for straightness of lines, so it doesn’t actually tile very well. Hence why the sphere in the lower front shows a discontinuity in its texture… that’s not glitchy UVing, just a crappy texture!

I’ve also gone ahead and extended my materials system to allow any material property to be driven with a texture. In fact, the stripe room render above is using the same stripe texture to drive reflectiveness on the side walls, resulting in reflective surfaces where the texture is black and diffuse surfaces where the texture is white. Here’s another example of texture driven material properties showing emission being driven using the same color-adjusted Space Janus texture from before:

Even refractive and reflective index of refraction can be driven with textures, which can yield some weird/interesting results. Here are a pair of renders showing a refractive red cube with uniform IOR, and with IOR driven with a Perlin noise map:

The nice thing about a node-style material representation is that I should be able to easily plug in procedural functions in place of textures whenever I get around to implementing some (that way I can use procedural Perlin noise instead of using a noise texture).

Here’s an admittedly kind of ugly render using the color UV grid texture to drive refractive color:

For some properties, I’ve had to add a requirement to specify a range of valid values by the user when using a texture map, since RGB values don’t map well to said properties. An example would be glossiness, where a gloss value range of 0% to 100% leaves little room for detailed adjustment. Of course this issue can be fixed by adding support for floating point image formats such as OpenEXR, which is coming very soon! In the following render, the back wall’s glossiness is being driven using the stripe texture (texture driven IOR is also still in effect on the red refractive cube):

Of course, even with nice instancing schemes, textures potentially can take up a gargantuan amount of memory, which poses a huge problem in the GPU world where onboard memory is at a premium. I still need to think more about how I’m going to deal with memory footprints larger than on-device memory, but at the moment my plan is to let the renderer allocate and overflow into pinned host memory whenever it detects that the needed footprint is within some margin of total available device memory. This concern is also a major reason why I’ve decided to stick with CUDA for now… until OpenCL gets support for a unified address space for pinned memory, I’m not wholly sure how I’m supposed to deal with memory overflow issues in OpenCL. I haven’t reexamine OpenCL in a little while now though, so perhaps it is time to take another look.

Unfortunately, something I discovered while in the process of extending my material system to support texture driven properties is that my renderer could probably use a bit of refactoring for the sake of organization and readability. Since I now have some time over winter break and am planning on making my Github repo for Takua-RT public soon, I’ll probably undertake a bit of code refactoring over the next few weeks.

Blurred Glossy Reflections

2012-12-07T00:00:00+00:00

Over the past few months I haven’t been making as much progress on my renderer as I would have liked, mainly because another major project has been occupying most of my attention: TAing/restructuring the GPU Programming course here at Penn. I’ll probably write a post at the end of the semester with detailed thoughts and comments about that later. More on that later!

I recently had a bit of extra time, which I used to tackle a piece of super low hanging fruit: blurred glossy reflections. The simplest brute force approach blurred glossy reflections is to take the reflected ray direction from specular reflection, construct a lobe around that ray, and sample across the lobe instead of only along the reflected direction. The wider the lobe, the blurrier the glossy reflection gets. The following diagram, borrowed from Wikipedia, illustrates this property:

In a normal raytracer or rasterization based renderer, blurred glossy reflections require something of a compromise between speed and visual quality (much like many other effects!), since using a large number of samples within the glossy specular lobe to achieve a high quality reflection can be prohibitively expensive. This cost-quality tradeoff is therefore similar to the tradeoffs that must be made in any distributed raytracing effect. However, in a pathtracer, we’re already using a massive number of samples, so we can fold the blurred glossy reflection work into our existing high sample count. In a GPU renderer, we have massive amounts of compute as well, making blurred glossy reflections far more tractable than in a traditional system.

The image at the top of this post shows three spheres of varying gloss amounts in a modified Cornell box with a glossy floor and reflective colored walls, rendered entirely inside of Takua-RT. Glossy to glossy light transport is an especially inefficient scenario to resolve in pathtracing, but throwing brute force GPU compute at it allows for arriving at a good solution reasonably quickly: the above image took around a minute to render at 800x800 resolution. Here is another test of blurred glossy reflections, this time in a standard diffuse Cornell box:

…and some tests showing varying degrees of gloss, within a modified Cornell box with glossy left and right walls. Needless to say, all of these images were also rendered entirely inside of Takua-RT.

Finally, here’s another version of the first image in this post, but with the camera in the wrong place. You can see a bit of the stand-in sky I have right now. I’m working on a sun & sky system right now, but since its not ready yet, I have a simple gradient serving as a stand-in right now. I’ll post more about sun & sky when I’m closer to finishing with it… I’m not doing anything fancy like Peter Kutz is doing (his sky renderer blog is definitely worth checking out, by the way), just standard Preetham et. al. style.

Thoughts on Stackless KD-Tree Traversal

2012-09-15T00:00:00+00:00

Edit: Erwin Coumans in the comments section has pointed me to a GDC 2009 talk by Takahiro Harada proposing something called Tree Traversal using History Flags, which is essentially the same as the idea proposed in this post, with the small exception that Harada’s technique uses a bit field to track previously visited nodes on the up traverse. I think that Harada’s technique is actually better than the pointer check I wrote about in this post, since keeping a bit field would allow for tracking the previously visited node without having to go back to global memory to do a node check. In other words, the bit field method allows for less thrashing of global memory, which I should think allows for a nice performance edge. So, much as I suspected, the idea in this post in one that folks smarter than me have arrived upon previously, and my knowledge of the literature on this topic is indeed incomplete. Much thanks to Erwin for pointing me to the Harada talk! The original post is preserved below, in case anyone still has an interest in reading it.

Of course, one of the biggest challenges to implementing a CUDA pathtracer is the lack of recursion on pre-Fermi GPUs. Since I intend for Takua-RT to be able to run on any CUDA enabled CPU, I necessarily have to work with the assumption that I won’t have recursion support. Getting around this problem in the core pathtracer is not actually a significant issue, as building raytracing systems that operate in an iterative fashion as opposed to in a recursive fashion is a well-covered topic.

Traversing a kd-tree without recursion, however, is a more challenging proposition. So far as I can tell from a very cursory glance at existing literature on the topic, there are presently two major approaches: fully stack-less methods that require some amount of pre-processing of the kd-tree, such as the rope-based method presented in Popov et. al. [2007], and methods utilizing a short stack or something similar, such as the method presented in Zhou et. al. [2008]. I’m in the process of reading both of these papers more carefully, and will probably explore at least one of these approaches soon. In the meantime, however, I thought it might be a fun exercise to try to come up with some solution of my own, which I’ll summarize in this post. I have to admit that I have no idea if this is actually a novel approach, or if its something that somebody also came up with and rejected a long time ago and I just haven’t found yet. My coverage of the literature in this area is highly incomplete, so if you, the reader, are aware of a pre-existing version of this idea, please let me know so that I can attribute it properly!

The basic idea I’m starting with is that when traversing a KD-tree (or any other type of tree, for that matter), at a given node, there’s only a finite number of directions one can go in, and a finite number of previous nodes one could have arrived at the current node from. In other words, one could conceivably define a finite-state machine type system for traversing a KD-tree, given an input ray. I say finite-state machine type, because what I shall define here isn’t actually strictly a FSM, as this method requires storing information about the previous state in addition to the current state. So here we go:

We begin by tracking two pieces of information: what the current node we are at is, and what direction we had to take from the previous node to get to the current node. There are three possible directions we could have come from:

Down from the current node’s parent node
Up from the current node’s left child
Up from the current node’s right child

Similarly, there are only three directions we can possibly travel in from the current node:

Up to the current node’s parent node
Down to the current node’s left child
Down to the current node’s right child

When we travel up from the current node to its parent, we can easily figure out if we are traveling up from the right or the left by looking at whether the current node is the parent node’s left or right child.

Now we need a few rules on which direction to travel in given the direction we came from and some information on where our ray currently is in space:

If we came down from the parent node and if the current node is not a leaf node, intersection test our ray with both children of the current node. If the ray only intersects one of the children, traverse down to that child. If the ray intersects both of the children, traverse down to the left child.
If we came down from the parent node and if the current node is a leaf node, carry out intersection tests between the ray and the contents of the node and store the nearest intersection.
If we came up from the left child, intersection test our ray with the right child of the current node. If we have an intersection, traverse down the right child. If we don’t have an intersection, traverse upwards to the parent.
If we came up from the right child, traverse upwards to the parent.

That’s it. With those four rules, we can traverse an entire KD-Tree in a DFS fashion, while skipping branches that our ray does not intersect for a more efficient traverse, and avoiding any use of recursion or the use of a stack in memory.

There is, of course, the edge case that our ray is coming in to the tree from the “back”, so that the right child of each node is “in front” of the left child instead of “behind”, but we can easily deal with this case by simply testing which side of the KD-tree we’re entering from and swapping left and right in our ruleset accordingly.

I haven’t actually gotten around to implementing this idea yet (as of September 15th, when I started writing this post, although this post may get published much later), so I’m not sure what the performance looks like. There are some inefficiencies in how many nodes our traverse will end up visiting, but on the flipside, we won’t need to keep much of anything in memory except for two pieces of state information and the actual KD-tree itself. On the GPU, I might run into implementation level problems that could impact performance, such as too many branching statements or memory thrashing if the KD-tree is kept in global memory and a gazillion threads try to traverse it at once, so these issues will need to be addressed later.

Again, if you, the reader, knows of this idea from a pre-existing place, please let me know! Also, if you see a gaping hole in my logic, please let me know too!

Since this has been a very text heavy post, I’ll close with some pictures of a KD-tree representing the scene from the Takua-RT post. They don’t really have much to do with the traverse method presented in this post, but they are KD-tree related!

"Orbital" KD-Tree

Vimeo’s compression really does not like thin narrow lines, so here are some stills:

TAKUA/Avohkii Renderer

2012-09-10T00:00:00+00:00

Takua-RT "Orbital" Demo

One question I’ve been asking myself ever since my friend Peter Kutz and I wrapped our little GPU Pathtracer experiment is “why am I writing Takua Renderer as a CPU-only renderer?” One of the biggest lessons learned from the GPU Pathtracer experiment was that GPUs can indeed provide vast quantities of compute suitable for use in pathtracing rendering. After thinking for a bit at the beginning of the summer, I’ve decided that since I’m starting my renderer from scratch and don’t have to worry about the tons of legacy that real-world big boy renderers like RenderMan have to deal with, there is no reason why I shouldn’t architect my renderer to use whatever compute power is available.

With that being said, from this point forward, I will be concurrently developing CPU and GPU based implementations of Takua Renderer. I call this new overall project TAKUA/Avohkii, mainly because Avohkii is a cool name. Within this project, I will continue developing the C++ based x86 version of Takua, which will retain the name of just Takua, and I will also work on a CUDA based GPU version, called Takua-RT, with full feature parity. I’m also planning on investigating the idea of an ARM port, but that’s an idea for later. I’m going to stick with CUDA for the GPU version now since I know CUDA better than OpenCL and since almost all of the hardware I have access to develop and test on right now is NVIDIA based (the SIG Lab runs on NVIDIA cards…), but that could change down the line. The eventual goal is to have a set of renderers that together cover as many hardware bases as possible, and can all interoperate and intercommunicate for farming purposes.

I’ve already gone ahead and finished the initial work of porting Takua Renderer to CUDA. One major lesson learned from the GPU Pathtracer experiment was that enormous CUDA kernels tend to run into a number of problems, much like massive monolithic GL shaders do. One problem in particular is that enormous kernels tend to take a long time to run and can result in the GPU driver terminating the kernel, since NVIDIA’s drivers by default assume that device threads taking longer than 2 seconds to run are hanging and cull said threads. In the GPU Pathtracer experiment, we used a giant monolithic kernel for a single ray bounce, which ran into problems as geometry count went up and subsequently intersection testing and therefore kernel execution time also increased. For Takua-RT, I’ve decided to split a single ray bounce into a sequence of micro-kernels that launch in succession. Basically, each operation is now a kernel; each intersection test is a kernel, BRDF evaluation is a kernel, etc. While I suppose I lose a bit of time in having numerous kernel launches, I am getting around the kernel time-out problem.

Another important lesson learned was that culling useless kernel launches is extremely important. I’m checking for empty threads at the end of each ray bounce and culling via string compaction for now, but this can of course be further extended to the micro-kernels for intersection testing later.

Anyhow, enough text. Takua-RT, even in its super-naive unoptimized CUDA-port state right now, is already so much faster than the CPU version that I can render frames with fairly high convergence in seconds to minutes. That means the renderer is now fast enough for use on rendering animations, such as the one at the top of this post. No post-processing whatsoever was applied, aside from my name watermark in the lower right hand corner. The following images are raw output frames from Takua-RT, this time straight from the renderer, without even watermarking:

Each of these frames represents 5000 iterations of convergence, and took about a minute to render on a NVIDIA Geforce GTX480. The flickering in the glass ball in animated version comes from having a low trace depth of 3 bounces, including for glass surfaces.

Jello KD-Tree

2012-09-09T00:00:00+00:00

I’ve started an effort to clean up, rewrite, and enhance my ObjCore library, and part of that effort includes taking my KD-Tree viewer from Takua Render and making it just a standard component of ObjCore. As a result, I can now plug the latest version of ObjCore into any of my projects that use it and quickly wire up support for viewing the KD-Tree view for that project. Here’s the jello sim from a few months back visualized as a KD-Tree:

Jello Sim KD-Tree

I’ve adopted a new standard grey background for OpenGL tests, since I’ve found that the higher amount of contrast this darker grey provides plays nicer with Vimeo’s compression for a clearer result. But of course I’ll still post stills too.

Hopefully at the end of this clean up process, I’ll have ObjCore in a solid enough of a state to post to Github.

Volumetric Renderer Revisited

2012-09-05T00:00:00+00:00

I’ve been meaning to add animation support to my volume renderer for demoreel purposes for a while now, so I did that this week! Here’s a rotating cloud animation:

Animated Cloud Render Test

…and of course, a still or two:

Instead of just rotating the camera around the cloud, I wanted for the cloud itself to rotate but have the noise field it samples stay stationary, resulting in a cool kind of morphing effect with the cloud’s actual shape. In order to author animations easily, I implemented a fairly rough, crude version of Maya integration. I wrote a script that will take spheres and point lights in Maya and build a scene file for my volume renderer using the Maya spheres to define cloud clusters and the point lights to define… well… lights. With an easy bit of scripting, I can do this for each frame in a keyframed animation in Maya and then simply call the volume renderer once for each frame. Here’s what the above animation’s Maya scene file looks like:

Also, since my pseudo-blackbody trick was originally intended to simulate the appearance of a fireball, I tried creating an animation of a fireball by just scaling a sphere:

Animated Pseudo-Blackbody Test

…and as usual again, stills:

So that’s that for the volume renderer for now! I think this might be the end of the line for this particular incarnation of the volume renderer (it remains the only piece of tech I’m keeping around that is more or less unmodified from its original CIS460/560 state). I think the next time I revisit the volume renderer, I’m either going to port it entirely to CUDA, as my good friend Adam Mally did with his, or I’m going to integrate it into my renderer project, Peter Kutz style.

More Experiments with Trees

2012-08-16T00:00:00+00:00

Every once in a while, I return to trying to make a good looking tree. Here’s a frame from my latest attempt:

Have I finally managed to create a tree that I’m happy with? Well….. no. But I do think this batch comes closer than previous attempts! I’ve had a workflow for creating base tree geometry for a while now that I’m fairly pleased with, which is centered around using OnyxTREE as a starting point and then custom sculpting in Maya and Mudbox. However, I haven’t tried actually animating trees before, and shading trees properly has remained a challenge. So, my goal this time around was to see if I could make any progress in animating and shading trees.

As a starting point, I played with just using the built in wind simulation tools in OnyxTREE, which was admittedly difficult to control. I found that having medium to high windspeeds usually led to random branches glitching out and jumping all over the place. I also wanted to make a weeping willow style tree, and even medium-low windspeeds often resulted in the hilarious results:

A bigger problem though was the sheer amount of storage space exporting animated tree sequences from Onyx to Maya requires. The only way to bring Onyx simulations into programs that aren’t 3ds Max is to export the simulation as an obj sequence from Onyx and then import the sequence into whatever program. Maya doesn’t have a native method to import obj sequences, so I wrote a custom Python script to take care of it for me. Here’s a short compilation of some results:

Windy Tree Maya Tests

One important thing I discovered was that the vertex numbering in each obj frame exported from Onyx remains consistent; this fact allowed for an important improvement. Instead of storing a gazillion individual frames of obj meshes, I experimented with dropping a large number of intermediate frames and leaving a relatively smaller number of keyframes which I then used as blendshape frames with more scripting hackery. This method works rather well; in the above video, the weeping willow at the end uses this approach. There is, however, a significant flaw with this entire Onyx based animation workflow: geometry clipping. Onyx’s system does not resolve cases where leaves and entire branches clip through each other… while from a distance the trees look fine, up close the clipping can become quite apparent. For this reason, I’m thinking about abandoning the Onyx approach altogether down the line and perhaps experimenting with building my own tree rigs and procedurally animating them. That’s a project for another day, however.

On the shading front, my basic approach is still the same: use a Vray double sided material with a waxier, more specular shader for the “front” of the leaves and a more diffuse shader for the “back”. In real life, leaves of course display an enormous amount of subsurface scattering, but leaves are a special case for subsurface scatter: they’re really really thin! Normally subsurface scattering is a rather expensive effect to render, but for thin material cases, the Vray double sided material can quite efficiently approximate the subsurface effect for a fraction of the rendertime.

Bark is fairly straightforward to, it all comes down to the displacement and bump mapping. Unfortunately, the limbs in the tree models I made this time around were straight because I forgot to go in and vary them up/sculpt them. Because of the straightness, my tree twigs don’t look very good this time, even with a decent shader. Must remember for next time! Creating displacement bark maps from photographs or images sourced from Google Image Search or whatever is really simple; take your color texture into Photoshop, slam it to black and white, and adjust contrast as necessary:

Here’s a few seconds of rendered output with the camera inside of the tree’s leaf canopy, pointed skyward. It’s not exactly totally realistic looking, meaning it needs more work of course, but I do like the green-ess of the whole thing. More importantly, you can see the subsurface effect on the leaves from the double sided material!

Windy Tree Render Test

Something that continues to prove challenging is how my shaders hold up at various distances. The same exact shader (with a different leaf texture), looks great from a distance, but loses realism when the camera is closer. I did a test render of the weeping willow from further away using the same shader, and it looks a lot better. Still not perfect, but closer than previous attempts:

Willow Wind Test

…and of course, a pretty still or two:

A fun experiment I tried was building a shader that can imitate the color change that occurs as fall comes around. This shader is in no way physically based, it’s using just a pure mix function controlled through keyframes. Here’s a quick test showing the result:

Tree Color Test

Eventually building a physically based leaf BSSDF system might be a fun project for my own renderer. Speaking of which, I couldn’t resist throwing the weeping willow model through my KD-tree library to get a tree KD-tree:

Tree KD-Tree

Since the Vimeo compression kind of borks thin lines, here’s a few stills:

Alright, that’s all for this time! I will most likely return to trees yet again perhaps a few weeks or months from now, but for now, much has been learned!

Random Point Sampling On Surfaces

2012-07-14T00:00:00+00:00

Just a heads up, this post is admittedly more of a brain dump for myself than it is anything else.

A while back I implemented a couple of fast methods to generate random points on geometry surfaces, which will be useful for a number of applications, such as direct lighting calculations involving area lights.

The way I’m sampling random points varies by geometry type, but all methods are pretty simple. Right now the system is implemented such that I can give the renderer a global point density to follow, and points will be generated according to that density value. This means the number of points generated on each piece of geometry is directly linked to the geometry’s surface area.

For spheres, the method I use is super simple: get the surface area of the sphere, generate random UV coordinates, and map those coordinates back to the surface of the sphere. This method is directly pulled from this Wolfram Mathworld page, which also describes why the most naive approach to point picking on a sphere is actually wrong.

My approach for ellipsoids unfortunately is a bit brute force. Since getting the actual surface area for an ellipsoid is actually fairly mathematically tricky, I just approximate it and then use plain old rejection sampling to get a point.

Boxes are the easiest of the bunch; find the surface area of each face, randomly select a face weighted by the proportion of the total surface area that face comprises, and then pick a random x and y coordinate on that face. The method I use for meshes is similar, just on potentially a larger scale: find the surface area of all of the faces in the mesh and select a face randomly weighted by the face’s proportion of the total surface area. Then instead of generating random cartesian coordinates, I generate a random barycentric coordinate, and I’m done.

The method that I’m using right now is purely random, so there’s no guarantee of equal spacing between points initially. Of course, as one picks more and more points, the spacing between any given set of points will converge on something like equally spaced, but that would take a lot of random points. I’ve been looking at this Dart Throwing On Surfaces Paper for ideas, but at least for now, this solution should work well enough for what I want it for (direct lighting). But we shall see!

Also, as I am sure you can guess from the window chrome on that last screenshot, I’ve successfully tested Takua Render on Linux! Specifically, on Fedora!

Thoughts on Ray Bounce Depth

2012-07-05T00:00:00+00:00

I finally got around to doing a long overdue piece of analysis on Takua Render: looking at the impact of ray bounce depth on performance and on the final image.

Of course, in real life, light can bounce around (almost) indefinitely before it is either totally absorbed or enters our eyeballs. Unfortunately, simulating this behavior completely is extremely difficult in any type of raytracing solution because in a raytrace solution, letting a ray bounce around indefinitely until it does something interesting can lead to extremely extremely long render times. Thus, one of the first shortcuts that most raytracing (and therefore pathtracing) systems take is cutting off rays after they bounce a certain number of times. This strategy should not have much of an impact on the final visual quality of a rendered image, since the more a light ray bounces around, the less each successive bounce contributes to the final image anyway.

With that in mind, I did some tests with Takua Render in hopes of finding a good balance between ray bounce depth and quality/speed. The following images or a glossy white sphere in a Cornell Box were rendered on a quad-core 2.5 GhZ Core i5 machine.

For a reference, I started with a render with a maximum ray bounce depth of 50 and 200 samples per pixel:

Then I ran a test render with a maximum of just 2 bounces; essentially, this represents the direct lighting part of the solution only, albeit generated in a Monte Carlo fashion. Since I made the entire global limit 2 bounces, no reflections show up on the sphere of the walls, just the light overhead. Note the total lack of color bleeding and the dark shadow under the ball.

The next test was with a maximum of 5 bounces. In this test, nice effects like color bleeding and indirect illumination are back! However, compared to the reference render, the area under the sphere still has a bit of dark shadowing, much like what one would expect if an ambient occlusion pass had been added to the image. While not totally accurate to the reference render, this image under certain artistic guidelines might actually be acceptable, and renders considerably faster.

Differencing the 5 bounce render from the reference 50 bounce render shows that the 5 bounce one is ever so slightly dimmer and that most of the difference between the two images is in the shadow area under the sphere. Ignore the random fireflying pixels, which is just a result of standard pathtracing variance in the renders:

The next test was 10 bounces. At 10 bounces, the resultant images is essentially visually indistinguishable from the 50 bounce reference, as shown by the differenced image included. This result implies that beyond 10 bounces, the contributions of successive bounces to the final image are more or less negligible.

Finally, a test with a maximum of 20 bounces is still essentially indistinguishable from both the 10 bounce test and the 50 bounce reference:

Interestingly, render times do not scale linearly with maximum bounce depth! The reason for this relationship (or lack thereof) can be found in the fact that the longer a ray bounces around, the more likely it is to find a light source and terminate. At 20 bounces, the odds of a ray finding a light source is very very close to the odds of a ray finding a light source at 50 bounces, explaining the smallness of the gap in render time between 20 and 50 bounces (especially compared to the difference in render time between, say, 2 and 5 bounces).

More KD-Tree Fun

2012-06-16T00:00:00+00:00

Lately progress on my Takua Render project has slowed down a bit, since over this summer I am interning at Dreamworks Animation during weekdays. However, in the evenings and on weekends I am still been working at stuff!

Something that I never got around to doing for no particularly good reason was visualizing my KD-tree implementation. As such, I’ve known for a long time that my KD-tree is suboptimal, but have not actually been able to quickly determine to what degree my KD-tree is inefficient. However, since I now have a number of OpenGL based diagnostic views for Takua Render, I figured I no longer had a good excuse to not visualize my KD-tree. So last night I did just that! Here is what I got for the Stanford Dragon:

Just as I suspected, my KD-tree implementation was far from perfect. Some rough statistics I had my renderer output told me that even with the KD-tree, the renderer was still performing hundreds to even thousands of intersection tests against meshes. The above image explains why: each of those KD-tree leaf nodes are enormous, and therefore contain an enormous amount of objects!

Fortunately, after a bit of tinkering, I discovered that there’s nothing actually wrong with the KD-tree implementation itself. Instead, the sparseness of the tree is coming from how I tuned the tree building operation. With a bit of tinkering, I managed to get a fairly improved tree:

…and with a bit more of tuning and playing with maximum recursion depths:

Previously, my KD-tree construction routine based the construction on only a maximum recursion depth; after the tree reached a certain height, the construction would stop. I’ve now modified the construction routine to use three separate criteria: a maximum recursion depth, minimum node bounding box volume, and a minimum number of objects per node. If any node meets any of the above three conditions, it is turned into a leaf node. As a result, I can now get extremely dense KD-trees that only have on average a low-single-digit number of objects per leaf node, as opposed to the average hundreds of objects per leaf node before:

In theory, this improvement should allow for a fairly significant speedup, since the number of intersections per mesh should now be dramatically lower, leading to much higher ray throughput! I’m currently running some benchmarks to determine just how much of a performance boost better KD-trees will give me, and I’ll post about those results soon!

Subsurface Scattering and New Name

2012-05-20T00:00:00+00:00

I implemented subsurface scattering in my renderer!

Here’s a Stanford Dragon in a totally empty environment with just one light source providing illumination. The dragon is made up of a translucent purple jelly-like material, showing off the subsurface scattering effect:

Subsurface scattering is an important behavior that light exhibits upon hitting some translucent materials; normal transmissive materials will simply transport light through the material and out the other side, but subsurface scattering materials will attenuate and scatter light before releasing the light somewhere not necessarily along a line from the entry point. This is what gives skin and translucent fruit and marble and a whole host of other materials their distinctive look.

There are currently a whole host of methods to rapidly approximate subsurface scattering, including some screen-space techniques that are actually fast enough for use in realtime renderers. However, my implementation at the moment is purely brute-force monte-carlo; while extremely physically accurate, it is also very very slow. In my implementation, when a ray enters a subsurface scattering material, I generate a random scatter direction via isotropic scattering, and then calculate light accumulation attenuation based on an absorption coefficient defined for the material. This approach is very similar to the one taken by Peter and me in our GPU pathtracer.

At some point in the future I might try out a faster approximation method, but for the time being, I’m pretty happy with the visual result that brute-force monte-carlo scattering produces.

Here’s the same subsurface scattering dragon from above, but now in the Cornell Box. Note the cool colored soft shadows beneath the dragon:

Also, I’ve finally settled on a name for my renderer project: Takua Render! So, that is what I shall be calling my renderer from now on!

More Fun with Jello

2012-05-05T02:00:00+00:00

At Joe’s request, I made another jello video! Joe suggested I make a video that shows the simulation both in the actual simulator’s GL view, and rendered out from Maya, so this video does just that. The starting portion of the video shows what the simulation looks like in the simulator GL view, and then shifts to the final render (done with Vray, my pathtracer still is not ready yet!). The GL and final render views don’t quite line up with each other perfectly, but its close enough that you get the idea.

There is a slight change in the tech involved too- I’ve upgraded my jello simulator’s spring array so that simulations should be more stable now. The change isn’t terribly dramatic; all I did was add in more bend and shear springs in my simulation, so jello cubes now “try” harder to return to a perfect cube shape.

This video is making use of my Vray white-backdrop studio setup! The pitcher was just a quick 5 minute model, nothing terribly interesting there.

Fun with Jello

…and of course, some stills:

Smoke Sim + Volumetric Renderer

2012-05-05T01:00:00+00:00

Something I’ve had on my list of things to do for a few weeks now is mashing up my volumetric renderer from CIS460 with my smoke simulator from CIS563.

Now I can cross that off of my list! Here is a 100x100x100 grid smoke simulation rendered out with pseudo Monte-Carlo black body lighting (described in my volumetric renderer post):

Smoke Simulator Pseudo-Blackbody Test

The actual approach I took to integrating the two was to simply pipeline them instead of actually merging the codebases. I added a small extension to the smoke simulator that lets it output the smoke grid to the same voxel file format that the volumetric renderer reads in, and then wrote a small Python script that just iterates over all voxel files in a folder and calls the volumetric renderer over and over again.

I’m actually not entirely happy with the render… I don’t think I picked very good settings for the pseudo-black body, so a lot of the render is overexposed and too bright. I’ll probably tinker with that some later and re-render the whole thing, but before I do that I want to move the volumetric renderer onto the GPU with CUDA. Even with multithreading via OpenMP, the rendertimes per frame are still too high for my liking… Anyway, here are some stills!

April 23rd CIS565 Progress Summary- Speed and Refraction

2012-04-23T00:00:00+00:00

This post is the third update for the GPU Pathtracer project Peter and I are working on!

Over the past few weeks, the GPU Pathtracer has gained two huge improvements: refraction, and major speed gains! In just 15 seconds on Peter’s NVIDIA GTX530 (on a more powerful card in the lab, we get even better speeds) , we can now get something like this:

Admittedly Peter has been contributing more interesting code than I have, which makes sense since in this project Peter is clearly the veteran rendering expert and I am the newcomer. But, I am learning a lot, and Peter is getting more cool stuff done since I can get other stuff done and out of the way!

The posts for this update are:

Performance Optimization: Speed boosts through zero-weight ray elimination
Cool Error Render: Fun debug images from getting refraction to work
Transmission: Glass spheres!
Fast Convergence: Tricks for getting more raw speed

As always, check the posts for details and images!

April 14th CIS563 Progress Summary- Meshes and Meshes and Meshes

2012-04-14T00:00:00+00:00

This post is the second update for the MultiFluids project!

The past week for Dan and me has been all about meshes: mesh loading, mesh interactions, and mesh reconstruction! We integrated in a OBJ to Signed Distance Field convertor, which allowed us to then implement liquid-against-mesh interactions and use meshes to define starting liquid positions. We also figured out how to run marching cubes on signed distance fields, allowing us to export OBJ mesh sequences of our fluid simulation and bring our sims into Maya for rendering!

Here is a really cool render from this week:

The posts for this week are:

Surface Reconstruction via Marching Cubes: Level set goes in, OBJ comes out
Mesh Interactions: Using meshes as interactable objects
Meshes as Starting Liquid Volumes and Maya Integration: Cool tests with a liquid Stanford Dragon

Check out the posts for details, images, and videos!

April 5th CIS565 Progress Summary- Interactivity, Alpha Review, Fresnel Reflections, Antialiasing

2012-04-05T00:00:00+00:00

This post is the second update for the GPU Pathtracer project!

Since the last update, Peter and I added an interactive camera to the renderer to allow realtime movement around the scene! We also had our Alpha Review, which went quite well, and Peter implemented a reflection model. Initially the reflection model used was Schlick’s Approximation, but later Peter replaced that with the full Fresnel equations. I also added super-sampled anti-aliasing for a smoother image.

The posts for this update:

Interactivity and Moveable Camera: We can move around the scene!
Alpha Review Presentation: Slides and other stuff from our Alpha Review
Specular Reflection Test: The first test with Shlick’s Approximation
Fresnel Reflections: Some details on our reflection model
Abstract Art: Some fun buggy renders Peter produced while debugging
Anti-Aliasing: Super-sampled anti-aliasing!

A nice image from the last post:

Check the posts for tons of details, images, and even some video!

April 1st CIS563 Progress Summary- Framework Improvements and Bounding Volumes

2012-04-01T02:00:00+00:00

Here’s the first progress update/blog digest for the MultiFluids project!

Dan and I started by taking our starting framework and tearing it down to its core. We then rebuilt the base code up with our own custom additions, leaving just the core solver intact. From there, we started building some of the basic features our project will require!

Here are the posts for this update:

Framework Improvements and Particles with Properties: Tearing the base code down to the ground and rebuilding it better, faster, and with more features
Bounding Volumes & Lesson 1: Don’t just assume base code is perfect: Dan discovers some flaws in the base code!
Multiple Arbitrary Bounding Volumes: All-important object interaction

A frame from one of our test videos:

Check the posts for details and videos!

April 1st CIS565 Progress Summary- Camera and Pathtracing

2012-04-01T01:00:00+00:00

Here’s the first progress summary/blog digest for the GPU Pathtracer project!

Over the past few days, Peter and I established our framework, got random number generation working on the GPU, built an accumulator, figured out parallelized camera ray projection, got spherical intersection tests working, and got a basic path-traced image!

Here are the posts for this update:

Random Number Generation: Fun with parallelized random number generators and seeding
First Rays on the GPU: Parallel raycasting!
Accumulating Iterations: The heart of any monte-carlo based renderer
We Have Path Tracing: First working renders!

Here’s an image from our very first working render! More soon!

CIS563/CIS565 Final Project Github Repos!

2012-03-27T00:00:00+00:00

For both MultiFluids and the GPU Pathtracer, we will be making our source code accessibly to all on Github!

Of course commercial coding projects and whatnot have very good reasons for keeping their source code locked down and proprietary, but open source is something I very strongly believe in. Open code allows other people to see what one does and give feedback and suggestions for improvement, and also allows other people interested in similar projects to potentially learn and build off of. Everybody wins!

The MultiFluids repository can be found here: https://github.com/betajippity/MultiFluids

The GPU Pathtracer repository can be found here: https://github.com/peterkutz/GPUPathTracer/

…and of course, the relevant blog posts:

GPU Pathtracer: http://gpupathtracer.blogspot.com/2012/03/github-repository.html

MultiFluids: http://chocolatefudgesyrup.blogspot.com/2012/03/github-and-windowsosx.html

CIS563/CIS565 Final Projects- Multiple Interacting Fluids and GPU Pathtracing

2012-03-19T00:00:00+00:00

Over the next month and a half, I will be working on a pair of final projects for two of my classes, CIS565 (GPU Programming, taught by Patrick Cozzi), and CIS563 (Physically Based Animation, taught by Joe Kider).

For CIS563, I will be teaming up with my fellow classmate and good friend Dan Knowlton to develop a liquid fluid simulator capable of simulating multiple fluids interacting against each other. Dan is without a doubt one of the best in our class and easily my equal or superior in all things graphics, so working with him should be a lot of fun. Our project is going to be based primarily on the paper Multiple Interacting Fluids by Losasso et. al. and as a starting point we will be using Chris Batty’s Fluid 3D framework.

For CIS565, I will be working with my fellow Pixarian and friend Peter Kutz, who is somewhat of a physically based rendering titan at Penn. Working with Peter should be a very interesting and exciting learning experience. Peter and I will be developing a CUDA based GPU Pathtracer with the goal of generating convincing photorealistic images extremely rapidly. We will be developing our GPU pathtracer from scratch, although we will obviously draw inspiration from both Peter’s Photorealizer project and my own CPU pathtracer project.

For both projects, we will be keeping blogs where we will post development updates, so I won’t post too much about development details to this here personal blog. Instead, I’m thinking about posting a weekly digest of progress on both projects with links to interesting highlights on the project blogs.

Dan and I will be blogging at http://chocolatefudgesyrup.blogspot.com/. We’ve titled our project “Chocolate Syrup” for two reasons: firstly, Dan likes to codename his project with types of confectionaries, and secondly, chocolate syrup is one type of highly viscous fluid we aim for our simulator to be able to handle!

Peter and I will be blogging at http://gpupathtracer.blogspot.com/. For now we have decided to call our project “Peter and Karl’s GPU Pathtracer”, for obvious reasons.

Details for each project can be found in the first post of each blog, which are the project proposals.

Multiple Interacting Fluids Proposal: http://chocolatefudgesyrup.blogspot.com/2012/03/project-proposal.html

GPU Pathtracer Proposal: http://gpupathtracer.blogspot.com/2012/03/project-proposal.html

Both of these projects should be very very cool, and I’ll be posting often to both development blogs!

Pathtracer with KD-Tree

2012-03-12T00:00:00+00:00

I have finished my KD-Tree rewrite! My new KD-Tree implements the Surface-Area Heuristic for finding optimal splitting planes, and stops splitting once a node has either reached a certain sufficiently small surface area, or has a sufficiently small number of elements contained within itself. Basically, very standard KD-Tree stuff, but this time, properly implemented. As a result, I can now render meshes much quicker than before.

Here’s a cow in a Cornell Box. Each iteration of the cow took about 3 minutes, which is a huge improvement over my old raytracer, but still leaves a lot of room for improvement:

…and of course, the obligatory Stanford Dragon test. Each iteration took about 4 minutes for both of these images (the second one I let converge for a bit longer than the first one), and I made these renders a bit larger than the cow one:

So! Of course the KD-Tree could still use even more work, but for now it works well enough that I think I’m going to start focusing on other things, such as more interesting BSDFs and other performance enhancements.

First Pathtraced Image!

2012-03-11T00:00:00+00:00

Behold, the very first image produced using my pathtracer!

Granted, the actual image is not terribly interesting- just a cube inside of a standard Cornell box type setup, but it was rendered entirely using my own pathtracer! Aside from being converted from a BMP file to a PNG, this render has not been modified in any way whatsoever outside of my renderer (I have yet to name it). This render is the result of a thousand iterations. Here are some comparisons of the variance in the render at various iteration levels (click through to the full size versions to get an actual sense of the variance levels):

Each iteration took about 15 seconds to finish.

Unfortunately, I have not been able to move as quickly with this project as I would like, due to other schoolwork and TAing for CIS277. Nonetheless, here’s where I am right now:

Currently the renderer is in a very very basic primitive state. Instead of extending my raytracer, I’ve opted for a completely from scratch start. The only piece of code brought over from the raytracer was the OBJ mesh system I wrote, since that was written to be fairly modular anyway. Right now my pathtracer works entirely through indirect lighting and only supports diffuse surfaces… like I said, very basic! Adding direct lighting should speed up render convergence, especially for scenes with small light sources. Also, right now the pathtracer only uses single direction pathtracing from the camera into the scene… adding bidirectional pathtracing should lead to another performance boost.

I’m still working on rewriting my KD-tree system, that should be finished within the next few days.

Something that is fairly high on my list of things to do right now is redesign the architecture for my renderer… right now, for each iteration, the renderer traces a path through a pixel all the way to its recursion depth before moving on to the next pixel. As soon as possible I want to move the renderer to use an iterative (as opposed to recursive) accumulated approach for each iteration (slightly confusing terminology, here i mean iteration as in each render pass), which, oddly enough, is something that my old raytracer already does. I’ve already started moving towards the accumulated approach; right now, I store the first set of raycasts from the camera and reuse those rays in each iteration.

One cool thing that storing the initial ray cast allows me to do is to generate a z-depth version of the render for “free”:

Okay, hopefully by my next post I’ll have the KD-tree rewrite done!

Smoke Sim- Preconditioning and Huge Grids

2012-03-07T00:00:00+00:00

I have added preconditioning to my smoke simulator! For the preconditioner, I am using Incomplete Cholesky, which is the preconditioner recommended in chapter 4 of the Bridson Fluid Course Notes. I’ve also troubleshooted by vorticity implementation, so the simulation should produce more interesting/stable vortices now.

The key reason for implementing the preconditioner is simple: speed. With a faster convergence comes an added bonus: being able to do larger grids due to less time required per solve. Because of that speed increase, I can now run my simulations on 3D grids.

In previous years, the CIS563 smoke simulator framework usually hit a performance cliff at grids beyond around 50x50x50, but last year Peter Kutz managed to push his smoke simulator to 90x90x36 by implementing a sparse A-Matrix structure, as opposed to storing every single data point, including empty ones, for the grid. This year’s smoke simulation framework was updated to include some of Peter’s improvements, and so Joe reckons that we should be able to push our smoke simulation grids pretty far. I’ve been scaling up starting from 10x10x10, and now I’m at 100x100x50:

Smoke Simulator 100x100x50 Test

This simulation took about 24 hours to run on a 2008 MacBook Pro with a 2.8 Ghz Core 2 Duo, but that is actually pretty good for fluid simulation! According to my rather un-scientific estimates, the simulation would take about 4 or 5 days without the preconditioner, and even longer without the sparse A-Matrix. I bet I can still push this further, and I’m starting to think about multithreading the simulation with OpenMP to get even more performance and even larger grids. We shall see.

One more thing: rendering this thing. So far I have not been doing any fancy rendering, just using the default OpenGL render that our framework came with. However, I want to get this into my volumetric renderer at some point and maybe even try out the pseudo-black body stuff with it. Eventually I want to try rendering this out with my pathtracer too!

Smoke Simulation Basics!

2012-03-03T00:00:00+00:00

For CIS563 (Physically Based Animation), our current assignment is to write a fluid simulator capable of simulating smoke inside of a box. For this assignment, we’re using a semi-lagrangian approach based on Robert Bridson’s 2007 SIGGRAPH Course Notes on Fluid Simulation.

I won’t go into the nitty-gritty details of the math behind the simulation (for that, consult the Bridson notes), but I’ll give a quick summary. Basically, we start with a specialized grid structure called the MAC (marker and cell) grid, where each grid cell stores information relevant to the point in space the cell represents, such as density, velocity, temperature, etc. We update values across the grid by pretending a particle carried the cell’s values into the cell and using the velocity to extrapolate in time the particle’s previous position, and look up the values from the grid cell the particle was previously in. We then use that information to perform advection and projection and solve the system through a preconditioned conjugate gradient solver.

So far I have implemented density advection, projection, buoyancy (via temperature advection), and vorticity. For the integration scheme I’m just using basic Eularian, which was the default for the framework we started with. Eularian seems stable enough for the smoke sim, but I might try to go ahead and implement RK4 later anyway, since I suspect RK4 won’t smooth out details as much as basic Eularian.

I’m still missing the actual preconditioner, so for now I’m only testing the simulation on a 2D grid, since otherwise the simulation times will be really really long.

Here is a test on a 100x100 2D grid!

Smoke Simulator 100x100x1 Test

Jello Sim Maya Integration

2012-02-25T00:00:00+00:00

I ported my jello simulation to Maya!

Well, sort of.

Instead of building a full Maya plugin like my good friend Dan Knowlton did, I opted for a simpler approach: I write out the vertex positions for each jello cube for each time step to a giant text file, and then use a custom Python script in Maya to read the vertex positions from the text file and animate a cube inside of Maya. It is a bit hacky and not nearly as elegant as the full-Maya-plugin approach, but it works in a pinch.

I think beng able to integrate my coding projects into artistic projects is very important, since at the end of the day, the main point of computer graphics is to be able to produce a good looking image. As such, I thought putting some jello into my kitchen scene would be fun, so here is the result, rendered out with Vray (some day I want to replace Vray with my own renderer though!):

Jello Test

The rendering process I’m using isn’t perfect yet… the fact that the jello cubes are being simulated with relatively few vertices is extremely apparent in the above video, as can be seen in how angular the edges of the jello become when it wiggles. At the moment, I can think of two possible fixes: one, simple run the simulation with a higher vertex count, or two, render the jello as a subdivision surface with creased edges. Since the second option should in theory allow for better looking renders without impacting simulation time, I think I will try the subdivision method forst.

But for now, here are some pretty still frames:

Multijello Simulation

2012-02-18T00:00:00+00:00

The first assignment of the semester for CIS563 is to write a jello simulator using a particle-mass-spring system. The basic jello system involves building a particle grid where all of the particles are connected using a variety of springs, such as bend and shear springs, and then applying forces across the spring grid. In order to step the entire simulation forward in time, we also have to implement a stable integration scheme, such as RK4. For each step forward in time, we have to do intersection tests for each particle against solid objects in the simulation, such as the ground plane or boxes or spheres.

The particle-mass-spring we used is based directly on the Baraff/Witkin 2001 SIGGRAPH Physically Based Animation Course Notes.

For the actual assignment, we were only required to support a single jello interacting against boxes, spheres, cylinders, and the ground. However, I think basic primitives are a tad boring… so I went ahead and integrated mesh collisions as well. The mesh collision stuff is actually using the same OBJ mesh system and KD-Tree system that I am using for my pathtracer! I am planning on cleaning up my OBJ/KD-Tree system and releasing it on Github or something soon, as I think I will still find even more uses for it in graphics projects.

Of course, a natural extension of mesh support is jello-on-jello interaction, which is why I call my simulator “multijello” instead of just singular jello. For jello-on-jello, my approach is to update one jello at a time, and for each jello, treat all other jellos in the simulation as just more OBJ meshes. This solution yields pretty good results, although some interpenetration happens if the time step is too large or if jello meshes are too sparse.

Here’s a video showcasing some things my jello simulator can do:

Experiments in Jello Simulation

Pathtracer Time

2012-01-04T00:00:00+00:00

This semester I am setting out on an independent study under the direction of Joe Kider to build a pathtracer (obviously inspired by my friend and fellow DMD student Peter Kutz). Global illumination rendering techniques are becoming more and more relevant in industry today as hardware performance in the past few years has begun to reach a point where GI in commercial productions is suddenly no longer unfeasibly expensive. Some houses like Sony Imageworks have already moved to full GI renderers like Arnold, while other studios like Pixar are in the process of adopting GI based renderers or extending their existing renderers to support GI lighting. This industry move, coupled with the fact that GI quite simply produces very pretty results, sparked my initial interesting in GI techniques like pathtracing. Having built a basic raytracer last semester, I decided in typical over-confident style: “how hard could it be?”

Here’s my project abstract:

Both path tracing and bidirectional scatter distribution functions (BSDFs) are ideas that have existed within the field of computer graphics for many years and have seen numerous implementations in a variety of rendering pack- ages. Similarly, creating images of convincing plant life is a technical challenge that a host of solutions now exist for. However, achieving dynamic plant effects such as the change of a plants coloring during the transition from summer to fall is a task that to date has been mostly been accomplished using procedural techniques and various compositing tricks.

The goal of this project is to build a path tracing based renderer that is designed specifically with the intent to facil- itate achieving dynamic plant effects with a more physically based approach by introducing a new time component to the existing bidirectional scatter distribution model. By allowing BSDFs to vary over not only space but also over time, plant effects such as leaf decay could be achieved through shaders with appearances that are driven through physically based mathematical models instead of procedural techniques. In other words, this project has two main prongs: develop a robust path tracer with at least basic functionality, and then develop and implement a time- dependent BSDF model within the path tracer.

…and here’s some background that I wrote up for my proposal…

1. INTRODUCTION

Efficiently rendering convincing images with direct and indirect lighting has been a major problem in the field of computer graphics since the field’s very inception, as con- vincingly realistic graphics in games and movies depends upon lighting that can accurately mimic that of reality. Known generally as global illumination, the indirect light- ing problem has in the past decade seen a number of solu- tions such as path tracing and photon mapping that can generate convincingly realistic images with reasonable computational resource consumption and efficiency. One of the key discoveries that enabled the development of modern global illumination techniques is the concept of Bidirectional Scattering Distribution Functions, or BSDFs. Developed as a superset and generalization of two other concepts known as bidirectional reflectance distribution functions (BRDFs) and bidirectional transmittance distribu- tion functions (BTDFs), BSDF is a general mathematical function that describes how light is scattered by a certain surface, given the material properties of the surface. BSDFs are useful today for representing the material properties of an object at a single point in time; however, in reality mate- rial properties can change and morph over time, as exem- plified by the natural phenomena of leaf color changes from summer to fall. This project will attempt to build a prototype of a path tracing renderer with a BSDF model modified to include an additional time component to allow for material properties to change over time in a way representative of how material properties change over time in reality. The hope is that such a renderer will prove to be useful in future attempts to recreate natural phenomena using physically based models, such as leaf decay.

…and the actual goal of the project…

1.1 Design Goals

The project’s goal is to develop a reasonably robust and efficient path tracing renderer with a BSDF model modified to include an additional time component. In order to prove the feasibility of such a modified BSDF model, the end goal is to be able to use the renderer to produce images of plant life with changing surface material properties, in addition to standard test image such as Cornell Box tests that validate the functionality of the underlying basic path tracer.

…and finally, what I’m hoping I’ll actually be able to produce at the end of this independent study:

1.2 Projects Proposed Features and Functionality

The proposed renderer should allow a user to load a sce- ne with an arbitrary number of lights, materials, and objects and render out a realistic, global illumination based render. The renderer should be able to render implicitly defined objects such as spheres and cubes in addition to meshes defined in the .obj format. The renderer should also allow users to specify changes in object/light/camera transfor- mations over time in addition to changes in materials and BSDFs over time and render out a series of frames showing the scene at various points in time. A graphical interface would be a nice additional feature, but is not a priority of this project.

I’ll be posting at least weekly updates to this blog showing my progress. In my next post, I’ll go over some of the papers and sources Joe gave me to look over and explain some of the basic mechanics of how a pathtracer works. Apologies for the casual reader for this particular post being extremely text heavy; I shall have images to show soon!

Basic Raytracer and Fun with KD-Trees

2011-12-22T00:00:00+00:00

The last assignment of the year for CIS460/560 (I’m still not sure what I’m supposed to call that class) is the dreaded RAYTRACER ASSIGNMENT.

The assignment is actually pretty straightforward: implement a recursive, direct lighting only raytracer with support for Blinn-Phong shading and support for basic primitive shapes (spheres, boxes, and polygon extrusions). In other words, pretty much a barebones implementation of the original Turner Whitted raytracing paper.

I’ve been planning on writing a global-illumination renderer (perhaps based on pathtracing or photon mapping?) for a while now, so my own personal goal with the raytracer project was to use it as a testbed for some things that I know I will need for my GI renderer project. With that in mind, I decided from the start that my raytracer should support rendering OBJ meshes and include some sort of acceleration system for OBJ meshes.

The idea behind the acceleration system goes like this: in the raytracer, one obviously needs to cast rays into the scene and track how they bounce around to get a final image. That means that every ray needs to have intersection tests against objects in the scene in order to determine what ray is hitting what object. Intersection testing against mathematically defined primitives is simple, but OBJ meshes present more of a problem; since an OBJ mesh is composed of a bunch of triangles or polygons, the naive way to intersection test against an OBJ mesh is to check for ray intersections with every single polygon inside of the mesh. This naive approach can get extremely expensive extremely quickly, so a better approach would be to use some sort of spatial data structure to quickly figure out what polygons are within the vicinity of the ray and therefore need intersection testing.

After talking with Joe and trawling around on Wikipedia for a while, I picked a KD-Tree as my spatial data structure for accelerated mesh intersection testing. I won’t go into the details of how KD-Trees work, as the Wikipedia article does a better job of it than I ever could. I will note, however, that the main resources I ended up pulling information from while looking up KD-Tree stuff are Wikipedia, Jon McCaffrey’s old CIS565 slides on spatial data structure, and the fantastic PBRT book that Joe pointed me towards.

Implementing the KD-Tree for the first time took me the better part of two weeks, mainly because I was misunderstanding how the surface area splitting heuristic works. Unfortunately, I probably can’t post actual code for my raytracer, since this is a class assignment that will repeated in future incarnations of the class. However, I can show images!

The KD-Tree meant I could render meshes in a reasonable amount of time, so I rendered an airplane:

The airplane took about a minute or so to render, which got me wondering how well my raytracer would work if I threw the full 500000+ poly Stanford Dragon at it. This render took about five or six minutes to finish (without the KD-Tree in place, this same image takes about 30 minutes to render):

Of course, the natural place to go after one dragon is three dragons. Three dragons took about 15 minutes to render, which is pretty much exactly a three-fold increase over one dragon. That means my renderer’s performance scales more or less linearly, which is good.

For fun, and because I like space shuttles, here is a space shuttle. Because the space shuttle has a really low poly count, this image took under a minute to render:

For reflections, I took a slightly different approach from the typical recursive method. The normal recursive approach to a raytracer is to begin with one ray, and trace that ray completely through recursion to its recursion depth limit before moving onto the next pixel and ray. However, such an approach might not actually be idea in a GI renderer. For example, from what I understand, in pathtracing a better raytracing approach is to actually trace everything iteratively; that is, trace the first bounce for all rays and store where the rays are, then trace the second bounce for all rays, then the third, and so on and so forth. Basically, such an approach allows one to set an unlimited trace depth and just let the renderer trace and trace and trace until one stops the renderer, but the corresponding cost of such a system is slightly higher memory usage, since ray positions need to be stored for the previous iteration.

Adding reflections did impact my render times pretty dramatically. I have a suspicion that both my intersection code and my KD-Tree are actually far from ideal, but I’ll have to look at that later. Here’s a test with reflections with the airplane:

…and here is a test with three reflective dragons. This image took foooorrreeevvveeeerrrr to render…. I actually do not know how long, as I let it run overnight:

I also added support for multiple lights with varying color support:

Here are some more images rendered with my raytracer:

In conclusion, the raytracer was a fun final project. I don’t think my raytracer is even remotely suitable for actual production use, and I don’t plan on using it for any future projects (unlike my volumetric renderer, which I think I will definitely be using in the future). However, I will definitely be using stuff I learned from the raytracer in my future GI renderer project, such as the KD-tree stuff and the iterative raytracing method. I will probably have to give my KD-tree a total rewrite, since it is really really far from optimal here, so that is something I’ll be starting over winter break! Next stop, GI renderer, CIS563, and CIS565!

As an amusing parting note, here is the first proper image I ever got out of my raytracer. Awww yeeeaaahhhhhh:

A Volumetric Renderer for Rendering Volumes

2011-10-14T00:00:00+00:00

The first assignment of the semester for CIS460 was to write, from scratch in C++, a volumetric renderer. Quite simply, a volumetric renderer is a program that can create a 2D image from a 3D discretized data set. Such data set is more often referred to as a voxel grid. In other words, a volumetric renderer makes pictures from voxels. Such renderers are useful in visualizing medical imaging data and some forms of 3D scans and blah blah blah…

…or you can make pretty clouds.

One of the first things I ever tried to make when I first was introduced to Maya was a cloud. I quickly learned that there simply is no way to get a nice fluffy cloud using polygonal modeling techniques. Ever since then I’ve kept the idea of making clouds parked in the back of my head, so when we were assigned the task of writing a volumetric renderer that could produce clouds, obviously I was pretty excited.

The coolest part of studying computer graphics from the computer science side of things has got to be the whole idea of “well, I want to make X, but I can’t seem to find any tool that can do X, so I guess…. I’LL JUST WRITE MY OWN PROGRAM TO MAKE X.”

I won’t go into detailed specifics about implementing the volumetric renderer, as that is a topic well covered by many papers written by authors much smarter than me. Also, future CIS460 students may stumble across this blog, and half the fun of the assignment is figuring out the detailed implementation for oneself. I don’t want to ruin that for them ;) Instead, I’ll give a general run-through of how this works.

The way the volumetric renderer works is pretty simple. You start with a big ol’ grid of voxels, called… the voxel grid or voxel buffer. From the camera, you shoot an imaginary ray through each pixel of what will be the final picture and trace that ray to see if it enters the voxel buffer. If the ray does indeed hit the voxel buffer, then you slowly sample along the ray a teeny step at a time and accumulate the color of the pixel based on the densities of the voxels traveled through. Lighting information is easy too: for each voxel reached, figure out how much stuff there is between that voxel and any light sources, and use a fancy equation to weight the amount of shadow a voxel receives. “But where does that voxel grid come from?”, you may wonder. In the case of my renderer, the voxel grid can either be loaded in from text files containing voxel data in a custom format, or the grid can be generated by sampling a Perlin noise function for each voxel in the grid.

So obviously volumetric renderers are pretty good for rendering clouds, as one can simply represent a cloud as a bunch of discrete points where each point has some density value. However, discretizing the world has a distinct disadvantage: artifacting. In the above render, some pixel-y artifacting is visible because the voxel grid I used wasn’t sufficiently high resolution enough to make each voxel indistinguishable. The problem is even more obvious in this render, where I stuck the camera right up into a cloud:

(sidenote for those reading out of interest in CIS460: I implemented multiple arbitrary light sources in my renderer, which is where those colors are coming from)

There are four ways to deal with the artifacting issue. The first is to simply move the camera further away. Once the camera is sufficiently far away, even a relatively low resolution grid will look pretty smooth:

A second way is to simply dramatically increase the resolution of the voxel mesh. This technique can be very very very memory expensive though. Imagine a 100x100x100 voxel grid where each voxel requires 4 bytes of memory… the total memory required is about 3.8 MB, which isn’t bad at all. But lets say we want a grid 5 times higher in resolution… a 500^3 grid needs 476 MB! Furthermore, a 1000x1000x1000 grid requires 3.72 GB! Of course, we could try to save memory by only storing non-empty voxels through the use of a hashmap or something, but that is more computationally expensive and gives no benefit in the worst case scenario of every voxel having some density.

A third alternative is to use trilinear interpolation or some other interpolation scheme to smooth out the voxel grid as its being sampled. This technique can lead to some fairly nice results:

At least in the case of my renderer, there is a fourth way to deal with the artifacting: instead of preloading the voxel buffer with values from Perlin noise, why not just get rid of the notion of a discretized voxel buffer altogether and directly sample the Perlin noise function when raymarching? The result would indeed be a perfectly smooth, artifact free render, but the computational cost is extraordinarily high compared to using a voxel buffer.

Of course, one could just box blur the render afterwards as well. But doing so is sort of cheating.

I also played with trying to get my clouds to self illuminate, with the hope of possibly eventually making explosion type things. Ideally I would have done this by properly implementing a physically accurate black body system, but I did not have much time before the finished assignment was due to implement such a system. So instead, my friend Stewart Hills and I came up with a fake black body system where the emmitance of each voxel was simply determined by how far the voxel is from the outside of the cloud. For each voxel, simply raycast in several random directions until each raymarch hits zero density, pick the shortest distance, and plug that distance into some exponential falloff curve to get the voxel’s emittance. Here’s a self-glowing cloud:

…not even close to physically accurate, but pretty good looking for a hack that was cooked up in a few hours! A closeup shot:

So! The volumetric renderer was definitely a fun assignment, and now I’ve got a cool way to make clouds! Hopefully I’ll be able to integrate this renderer into some future projects!

Building/Installing Alembic for OSX

2011-10-06T00:00:00+00:00

Alembic is a new open-source computer graphics interchange framework being developed by Sony Imageworks and ILM. The basic idea is that moving animation rigs and data and whatnot between packages can be a very tricky procedure since every package has its own way to handle animation, so why not bake out all of that animation data into a common interchange format? So, for example, instead of having to import a Maya rig into Houdini, you could rig/animate in Maya, bake out the animation to Alembic, bring that into Houdini to conduct simulations with, and then bake out the animation and bring it back into Maya. This is a trend that a number of studios including Sony, ILM, Pixar, etc. have been moving toward for some time.

I’ve been working on a project lately (more on that later) that makes use of Alembic, but I found that the only way to actually get Alembic is to build it from source. That’s not terribly difficult, but there’s not really any guides out there for folks who might not be as comfortable with building things from source. So, I wrote up a little guide!

Here’s how to build Alembic for OSX (10.6 and 10.7):

Alembic has a lot of dependencies that can be annoying to build/install by hand, so we’re going to cheat and use Homebrew. To install Homebrew:

/usr/bin/ruby -e "$(curl -fsSL https://raw.github.com/gist/323731)"

Get/build/install cmake with Homebrew:

brew install cmake

Get/build/install Boost with Homebrew:

brew install Boost

Get/build/install HDF5 with Homebrew:

brew install HDF5 HDF5 has to make install itself, so this may take some time to run. Be patient.

Unfortunately, ilmbase is not a standard UNIX package, so we can’t use Homebrew. We’ll have to build ilmbase manually. Get it from:

http://download.savannah.nongnu.org/releases/openexr/ilmbase-1.0.2.tar.gz Untar/unzip to a readily accessible directory and cd into the ilmbase directory. Run: ./configure After that finishes, we get to the annoying part: ilmbase by default makes use of a deprecated GCC 3.x compiler flag called Wno-long-double, which no longer exists in GCC 4.x. We’ll have to deactivate this flag in ilmbase’s makefiles manually in order to build correctly. In each and every of the following files:

`/Half/Makefile
/HalfTest/Makefile
/Iex/Makefile
/IexTest/Makefile
/IlmThread/Makefile
/Imath/Makefile
/ImathTest/Makefile`

Find the following line: CXXFLAGS = -g -O2 -D_THREAD_SAFE -Wno-long-double and delete it from the makefile. Once all of that is done, you can make and then make install like normal. Now move the ilmbase folder to somewhere safe. Something like /Developer/Dependencies might work, or alternatively /usr/include/

Time to actually build Alembic. Get the source tarball from:

http://code.google.com/p/alembic/wiki/GettingAlembic

Untar/unzip into a readily accessible directory and then create a build root directory parallel to the source root you just created:

mkdir ALEMBIC_BUILD

The build root doesn’t necesarily have to be parallel, but here we’ll assume it is for the sake of consistency.
Now cd into ALEMBIC_BUILD and bootstrap the Alembic build process. The bootstrap script is a python script:

python ../[Your Alembic Source Root]/build/bootstrap/alembic_bootstrap.py

The script will ask you for a whole bunch of paths:

For “Please enter the location where you would like to build the Alembic”, enter the full path to your ALEMBIC_BUILD directory.

For “Enter the path to lexical_cast.hpp:”, enter the full path to your lexical_cast.hpp, which should be something like /usr/local/include/boost/lexical_cast.hpp

For “Enter the path to libboost_thread:”, your path should be something like /usr/local/lib/libboost_thread-mt.a For “Enter the path to zlib.h”, your path should be something like /usr/include/zlib.h

For “Enter the path to libz.a”, we’re actually not going to link against libz.a. We’ll be using libz.dylib instead, which should be at something like /usr/lib/libz.dylib

For “Enter the path to hdf5.h”, your path should be something like /usr/local/include/hdf5.h

For “Enter the path to libhdf5.a”, your path should be something like /usr/local/Cellar/hdf5/1.x.x/lib/libhdf5.a (unless you did not use Homebrew for installing hdf5, in which case libhdf5.a will be in whatever lib directory you installed it to)

For “Enter the path to ImathMath.h”, your path should be something like /usr/local/include/OpenEXR/ImathMath.h

For “Enter the path to libImath.a”, your path should be something like /usr/local/lib/libImath.a

Now hit enter, and let the script finish running!
If everything is bootstrapped correctly, you can now make. This will take a while, be patient.
Once the make finishes successfully, run make test to check for any problems.
Finally, run make install, and we’re done! Alembic should install to something like /usr/bin/alembic-1.x.x/.

Installing Numpy for Maya 2012 64-bit on OSX 10.7

2011-09-05T00:00:00+00:00

On OSX 10.6, installing Numpy for Maya 2012 was simple enough. You could do it either by directly copying the Numpy install folder into Maya’s Python’s site-packages folder or by adding a sys.path.append to Maya’s UserSetup.py. The process was quite simple since OSX 10.6’s default preinstalled version of Python was 2.6.x and Maya 2012 uses Python 2.6.x as well.

However, OSX 10.7 comes with Python 2.7.x, so a few extra steps are needed:

For Maya 2012 64-bit:

OSX 10.7 comes with Python 2.7.x, but we need 2.6.x, so install 2.6.x using the official installer from here: http://www.python.org/ftp/python/2.6.6/python-2.6.6-macosx10.3.dmg
Since we’re using 64-bit Maya with 64-bit Python, we’ll need a 64-bit build of Numpy. The official version distributed on scipy.numpy.org is 32-bit, so we’ll need a 64-bit build. Thankfully, there is an unofficial 64-bit build in the form of the Scipy Superpack for Mac OSX. Even though we’re on OSX 10.7, we’ll want the OSX 10.6 variety of the script since the OSX 10.7 is Python 2.7.x dependent: http://idisk.mac.com/fonnesbeck-Public/superpack_10.6_2011.07.10.sh

EDIT (01/12/2012): I’ve been informed by Michael Frederickson that the link originally posted to the unofficial 64 bit Scipy Superpack build for 10.6 no longer works. Fortunately, I’ve backed up both the script and the required dependencies. The install script can be found here: http://yiningkarlli.com/files/osx10.7numpy2.6/superpack_10.6_2011.07.10.sh
Go to where the script downloaded to and in Terminal:

chmod +x superpack_10.6_2011.07.10.sh ./superpack_10.6_2011.07.10.sh

If you don’t already have GNU Fortran, make sure to answer ‘yes’ when the script asks.
Once the script is done installing, in Terminal:

ls /Library/Python/2.7/site-packages/ | grep numpy

You should get something like: numpy-2.0.0.dev_b5cdaee_20110710-py2.6-macosx-10.6-universal.egg

Even though we installed Numpy for Python 2.6.x, on Lion it installs to the 2.7 folder for some reason. No matter, you can either leave it there or move it to 2.6.
Go to /Users/[your username]/Library/Preferences/Autodesk/maya/2012-x64/scripts
If you don’t have a file named userSetup.py, make one and open it in a text editor. If yes, open it.
Add these lines to the file:

import os import sys sys.path.append('/Library/Python/2.7/site-packages/[thing you got from step 4]')
Sidenote: installing Python 2.6.x sets your default OSX Python to 2.6.x, but if you want to go back to 2.7.x, just edit your ~/.bash_profile and remove these lines:

PATH="/Library/Frameworks/Python.framework/Versions/2.6/bin:<span>$</span>{PATH}" export PATH

….and you should be done! In Maya, you should be able to just use import numpy and you’ll be good to go!

GH House Project, a.k.a. Why Backups are Important

2011-09-01T00:00:00+00:00

Here is a cautionary tale about why backing up one’s harddrive is EXTREMELY IMPORTANT.

Over the summer, I started making a little scene based off of the GH House Challenge from RonenBekerman.com, partially as a way to learn Vray and partially just for fun. I was working off of my laptop for the entire project, since I was in California at the time and didn’t have access to more powerful machines at home. Being out in California for the summer, I brought as little stuff with me as possible.

One of the things I decided to leave home was my backup Time Machine drive. “Oh, I won’t need this over the summer, what are the odds of file corruption or harddrive issues anyhow? I’ll be fine”, I thought to myself.

Which means, of course, that halfway through the summer a bunch of my files got corrupted and were therefore lost forever, and of course that block of lost data included my in-progress GH House project. NEVER ASSUME THAT YOU DO NOT NEED BACKUP.

What follows are some random in-progress renders that survived through being in posts I made to Facebook and Tumblr.

Here are a series of small in-progress renders showing shading and lighting tests:

I also started playing with some ideas for the interior:

…and finally, some larger in-progress renders. These renders represent where the project was when I lost all of the data:

In the end, the fact that I lost the project isn’t as important as the fact that I learned quite a lot from tinkering with this project. However, losing all of the data for this project was definitely a major bummer. But, lesson learned: BACK UP ALL THE TIME.

Animation Final Project Stills

2011-05-08T00:00:00+00:00

For my Computer Animation class’s final, I decided to go for a change in pace and work in 2D instead of in Maya. I want to tweak a few things before I post the finished animation, but I have two more finals to get through first. So for now, here are some stills:

Why cd when you can go?

2011-05-05T00:00:00+00:00

I learned a sweet trick from fellow Penn CIS student Alexey Komissarouk’s blog today: the ‘go’ command!

So in a standard *nix bash CLI, you have you’re typical cd command. We all know how to use cd.

But have you ever accidentally cd’d a file? “cd /stuff/blah.txt” makes no sense and just gets you a “Not a directory” error. So then you have to backtrack and use vim or emacs or nano or whatever… blarg. If you’re using emacs or vim, you like efficiency and you’ve already lost efficiency by wasting a perfectly good moment trying to cd into a file.

Enter the ‘go’ command!

Add this bit of code to your .bashrc file and replace $EDITOR with the CLI text editor of your choice:

go()
{
if [ -f $1 ]
then
$EDITOR $1
else
cd $1 && ls
fi
}

and you’re all done! Now when you go to a directory, bash will cd and when you go to a file, bash will fire up vim or emacs or whatever.

As a side note, it might be fun to modify the ‘go’ command even further to automatically launch actions for other filetypes as well, like run javac whenever a .java is encountered or launch .jar files or run gcc or make whenever C++ makefiles are encountered. That’s left as an exercise to the reader though!

Chairs…. now with Balloons!

2011-04-29T00:00:00+00:00

Oops, I haven’t posted in a while…

A few weeks back I decided to try out overhauling one of my previous projects with VRay. I figured the chairs project would be fun, so…

Wwwwaaaayyyy prettier than before. I really like VRay, although I feel that setting it up is a bit more involved than MentalRay is. Still haven’t made too many inroads with Photorealistic Renderman yet, so I can’t comment on that quite yet.

Oh, also, as you can see, I added balloons too. I like balloons.

I decided to add balloons after seeing an article on RonenBerkerman.com a while back about shading balloons using VRay in 3DSMax. I’m using VRay in Maya, however, so I had to figure out how to recreate the shader in Maya’s Hypershade. The shader network winded up looking like this:

It’s *almost* fully procedural, minus that one black and white ramp image that I wound up using for a lot of things. Replacing that image with a procedural ramp shader to make the entire shader fully procedural probably wouldn’t be very hard at all, but I got lazy :p

I was originally going to post breakdowns of all of the settings for each node in the shading network as well, but again, I’m lazy. So instead, here’s the shader in a Maya .ma file!

A few more renders:

As soon as my last finals are over in about a week, I’ll catch up with my backlog of things that need to be posted. I’m planning on posting a series of posts introducing some concepts in graphics programming that I learned in CIS277 this semester. I’m not going to go super duper in depth (for that, take CIS277! Dr. Norm Badler is an awesome professor.), but at the very least I’ll highlight some of the cooler things I learned. That class was really neat, we wound up writing our own 2D animation software from scratch and our final team project assignment was to build our own 3D modeling software. Basically, we made mini-Maya. My team (Adam Mally, Stewart Hills, and me) got some really neat stuff to work.

Speaking of Stewart, Stewart and I both will be interning at Pixar this summer! We got into their Pixar Undergraduate Program… uh… program. PUP essentially is a 10 week crash course on Pixar’s production pipeline, so we’ll be learning about everything from modeling to simulation to using Photorealistic Renderman. I’m really looking forward to that.

VRay Tree

2011-03-28T00:00:00+00:00

After being frustrated with Mentalray for a few weeks, I’ve decided to start experimenting with VRay. VRay is… pretty amazing.

I’ve been continuing my tree experiments using VRay. VRay’s Sun&Sky system is much nicer than Mentalray’s system and VRay has this crazy useful two-sided material for flat two dimensional planes… such as leaves. Here’s what I managed to cook up over the weekend:

I’m still working out some kinks in my new tree workflow. If I get to a good place, I’ll post a full breakdown in a few days.

Mo’ Tree (and Grass) Experimenting

2011-03-19T00:00:00+00:00

I experimented with subsurface scatter based shaders for plant leaves today! I’m still working on it, so I won’t be writing up what I’ve found until a bit later. But for now, here’s what I’ve managed to get!

The grass is just Maya fur with a custom shader (woot subsurface scatter!) and the tree is modeled after a Japanese Maple and was made the same way as the one I posted a few days back.

Back to exploring!

Autumn Tree!

2011-03-17T00:00:00+00:00

Every couple of months I find myself trying to make trees again in Maya. Today I found myself tackling the tree problem yet again…

I’ve found that using XFrog’s plant modeler program is my favorite way to create base meshes for plants. It sure as heck beats hand modeling all of those leaves… Speaking of models and leaves, the method I’ve settled on for tree leaves is to just use planes where the leaves should go and then make the planes look like leaves through alpha mapping.

Anyhoo, here’s where I managed to get tonight:

The displacement on the bark is really really weird right now and the color of the leaves is weird too. I think I’m going to try subsurface scattering on the leaves… see if that helps. More updates later…

Demoreel Update!

2011-03-11T00:00:00+00:00

So… after interviewing with Paul Kanyuk from Pixar, I’ve decided to update my reel a bit…

Demoreel Spring 2011 v2.1

Breakdown is here (PDF).

So why the updated reel? Interviewing with Paul and chatting with the other two people from Pixar that visited Penn was a really interesting. Paul had a lot of suggestions for my work during our interview, so I’ve decided to go ahead and incorporate a lot of the changes that Paul suggested.

So changelog time!

Overall Changes:

New song! The new song is an instrumental version of Baby Universe from the We Love Katamari OST. As usual, the version of the reel I’m actually sending out to studios for internships has not music, though.
I’ve replaced “Postcards from Prague” with a new project, “Chairs”
Shuffled around the order of some pieces.

Apples:

Recomposited with slightly better z-depth using a new depth of field plugin for After Effects I found called Frischluft Lenscare. Apparently Alex Roman uses it, and if Alex Roman uses it, then gosh golly I’d better give it a try. Hahaha.
Slightly tweaked color grading
There’s a little more footage of the second shot of the apples bouncing than there was in the previous reel

Hermit Crab:

Recomposited with tweaked ambient occlusion in the turntable. Paul pointed out that there were some odd light leaking issues on the underside of the shell’s opening, so I’ve increased the intensity of the AO there to try to make it a bit darker.
Fixed a small problem with the transition between the Untextured Lambert and the Fully Textured parts of the turntable

White Room:

Every shot’s depth of field has been redone using Frischluft Lenscare
The first shot of the underside of the stairs was lengthened, rerendered, recomposited, and re color graded.
The second shot has new color grading and altered AO.
The third shot was rerendered with new contrast settings and re color gaded.

Clock:

Paul pointed out that a major flaw with the clock was that the highlight on the glass washed out everything under the glass, so I fixed that by changing the reflective properties of the glass slightly and giving the glass more of a curve to break up the highlight
Turntables were sped up to help the reel’s overall pacing
Textures on the clock face are sharper than before

Raincoat Girls:

Turntable was rerendered to get rid of the light blue band that appeared partway through the turntables in the previous reel
Environment shots were re color graded

Demoreel and New Site!

2011-02-21T00:00:00+00:00

Look! I finally cut together a demoreel!

Demoreel Spring 2011 v1.2

I’m getting interviewed by PIXAR! Actually, Pixar gets a slightly different version of my reel. The Pixar version simply has no music… they’re not very big fans of music in demoreels, apparently.

Which is why I got off of my lazy bum and finally cut a reel together.

Oh, I have a reel breakdown too! Check it out here (PDF).

The song in my reel is “I Like Van Halen Because My Sister Says They Are Cool” by El Ten Eleven. I’ve recut it slightly to fit the length of the reel.

I also finally put together a personal site thing at www.yiningkarlli.com.

Speaking of new sites… I should probably get around to redesigning Omjii.com soon. Hm.

How 'Bout Them Apples?

2011-02-19T00:00:00+00:00

Earlier this week my mom gave my roommates and me a ginormous sack of apples, so I’ve been eating apples all week. Which is good, because I love apples.

So I had an apple sitting on my desk, and I had Maya open, and I was a little bit bored, so… I made some apples in Maya!

Falling Apples

Over the past few months I’ve developed a bit of an… odd… workflow for texturing/shading irregularly shaped objects (apples… muddy boots… hermit crabs…). I start with modeling and whatnot in Maya, as usual:

Then I go into Photoshop and use various reference (images found online, photos taken with my Nikon D60, etc) to paint a tile of the texture I want. For example, for the apples I took some photos of the apples and then extracted textures from the photos to create this texture tile:

Next, I bring the object mesh and the texture tile into Mudbox and use Mudbox’s projection stencil tool to paint the mesh using the texture tile as the stencil. The nice thing about bringing things into Mudbox for texturing is that I don’t really have to worry too much about UV mapping. Mudbox will automagically take care of all of the UV stuff as long as the imported mesh doesn’t have any overlapping UV coordinates. So instead of messing with the UV editor in Maya before texturing, I can just use Maya’s Automatic UV mapping tool to make sure that no UVs overlap and bring that into Mudbox. After painting in Mudbox, I got a texture image like this:

After texture painting in Mudbox, deriving spec and bump maps in Photoshop is a relatively straightforward affair. Once texturing and shading is done, I render out the beauty pass and z-depth pass and other passes…

…and bring all those passes into After Effects for compositing and color grading, and I’m done!

Watermelon Smash

2011-02-04T00:00:00+00:00

For animation class, we were given an assignment where we each had to pick a random mixed drink name and use that name as the basis of a 10 second animation in After Effects. I picked something called a Watermelon Smash (it contains…. watermelon cubes… and I don’t remember what else). So… here’s a watermelon smashing something!

Watermelon Smash

We were actually allowed to use anything we wanted for the actual animation, the only rule was that we had to composite the final result together in After Effects. I wound up doing the ocean, splashes, and sparks in Flash (with tons of help from Joseph Gilland’s book Elemental Magic) and painting the boat and the watermelon in Photoshop. The watermelon itself is animated entirely using the puppet warp tool in After Effects.

I’m not terribly happy with the sound. That needs some reworking probably…

Here’s a few stills breaking down how the entire thing was composited:

…and here’s some random stills:

Recent Stuff

2011-01-27T00:00:00+00:00

This is just a quick dump of recent things I’ve been working on, detailed posts to come later.

Raincoat Girl Turntable

2010-12-15T00:00:00+00:00

I’m done with my character model! Until I can think of a better name, I’m just calling her “Raincoat Girl”.

Here’s a still:

Getting the hair and cloth sims to work right took aaaggggeeesssss. Thank goodness Marissa Krupen knows so much and helped me out a lot.

I wound up cheating on the subsurface scatter for the skin. I couldn’t get it to look right on its own, so I wound up using a layered shader with the subsurface on one layer and a normal texture map on the other layer. I think the end result looks okay.

Turntable!

Raincoat Girl

I’ll post more stills later, but now I really need to study for that Finance final that I’ve been avoiding… I also have to make an environment for my character to go in to, and I still have that final project for 3D modeling to finish (I haven’t even started…).

Character Model Face

2010-12-12T00:00:00+00:00

Okay, I’m close to finished with the face…

I’ve decided not to give her eyelashes. They didn’t look very good. I also noticed that in some Pixar characters, Pixar chose to keep the lips the same color as the rest of the skin… I kind of like that style choice, so I’m going to steal it (read: Karl is too lazy to paint lips).

Cloth Simulation Progress

2010-12-04T00:00:00+00:00

I’m working on a character model right now loosely based on this sketch. Here’s how the cloth simulation stuff is looking right now:

There is still much work to be done.

City Street- Playing with Z-Depth and Ambient Occlusion

2010-11-19T00:00:00+00:00

I haven’t managed to make any progress on actually finishing this project since my last post, but I have had a bit of time to play with ambient occlusion and z-depth mapping. So… same render as before, but now with depth of field and some ambient occlusion:

…and the z-depth map:

…and the ambient occlusion map. I did the leaves on the trees by transparency mapping the planes where the leaves went on the model, but because of that I wasn’t sure how I was supposed to ambient occlude the trees. So I removed them for the ambient occlusion map:

I actually found an alternate way to render out the z-depth map, but I’m not entirely sure this is as physically accurate as the standard way Maya does z-depth:

Hopefully more soon!

City Street Progress

2010-11-17T00:00:00+00:00

I’ve been working on a little city street for a few days now. I want to capture the kind of old European feel that one can find in places like Edinburgh.

Right now this is about 65% done. I think I’m going to try to make it look like an old postcard.

Clock Miniproject

2010-11-08T00:00:00+00:00

Over the weekend I decided to do a little mini-project to try out some new tricks I’ve learned with rendering. I decided to try to make as photorealistic of an image as possible of a clock. Here’s what I came up with:

The clock face is noticeably pixelated; I’m not entirely sure why that is. For some reason Mental Ray is not sampling the texture file at a very high frequency, I’ll work on that next I suppose.

A little breakdown video of the compositing that went into the clock:

Clock Rendering/Compositing Breakdown

Hermit Crab

2010-11-06T00:00:00+00:00

The hermit crab is complete!

Hermit Crab Redux

Modeled in Maya, textured in Mudbox, rendered with MentalRay.

I’m perfectly aware that no hermit crab would ever actually live in a conch shell that large, but I thought the image of a small crab in a huge shell was amusing.

Some stills from some different angles…

As the title “redux” suggests, the crab above is actually the second version of the hermit crab I’ve made. I originally finished about a week earlier with a different version, but then after getting some suggestions from Professor Scott White, my 3D modeling professor, I decided to redesign the conch shell. Here’s what Hermit Crab Mark I looked like:

Hermit Crab

I actually still want to change some things. If I have time, I’m going to go back and change the displacement mapping on the conch to get the groves to all go in a more uniform direction. Also 9 seconds into the turntable, you might notice there’s a slight shiny spot on the conch. That’s a mistake I made in the specular map that I definitely want to fix. I also want to try placing some small low intensity lights really close to the crab’s eyes to bring out the gloss that’s visible in the Mark I crab. In the Mark II crab, the shadow from the flaring part of the conch makes the crab’s eyes look matte. The crab’s claws need some color tweaking as well; the color doesn’t quite perfectly match the rest of the crab.

The DMD director, Amy Calhoun, told me that no modeler is ever satisfied with a model. So true.

Hermit Crab Ready For Texturing!

2010-10-31T00:00:00+00:00

My hermit crab is ready for texturing and lighting and rendering! I’m going with Mudbox for texture painting for sure. I’m still not entirely sure how I’m going to get all the prickly parts of the legs done… I’ll probably just do a displacement map or something.

Hermit Crab Progress

2010-10-28T00:00:00+00:00

I’m working on a hermit crab in 3D Modeling class! The shell was really hard to make… I wound up making a small segment, duplicating special it, and then stitching all the segments together by hand. So… the crab itself is only some legs right now. I have a lot of work to do on this still…

I’m thinking about trying Mudbox for texturing this thing. The UVs on the shell aren’t pretty, and I don’t want to spend a gazillion hours unwrapping those UVs….

More later.

Puddle! Redux

2010-10-23T02:00:00+00:00

Ana suggested a few changes. Much credit to her, the painting looks much much better now:

I also played with giving the girl goggles of some sort and a snorkel, but I’m not sure this idea works so well.

Puddle!

2010-10-23T01:00:00+00:00

I felt like painting (in Photoshop) today. I’m pretty happy with how this one turned out.

Little Kiddo

2010-10-18T00:00:00+00:00

In Penn’s SIGGRAPH chapter, we’re spending the next few months designing our own little characters in Maya!

I drew a little girl complete with little kid size coat and rubber rain boots:

I don’t really have any idea yet of what kind of adventure she’ll go on. I’ll figure that out as I go along, I suppose. I picked a little kiddo mainly because I love how crazy exaggerated little kids often make their expressions. Just check out the absolutely beautifully animated short Playing with light - Mon ami le robot.

More later!

Give Gifi!

2010-09-29T00:00:00+00:00

A few weeks ago I joined a startup founded by a few Penn alums called Venmo! My project at Venmo for the past few weeks has been helping my friend and co-worker, Ayaka Nonaka, with a new app from Venmo called Gifi, which is a Foursquare/Venmo mashup that lets people leave Venmo money at geographic locations. I’ve been working on Gifi’s website and overall look. Here’s some of the artwork I did for Gifi:

Playing with Maya

2010-07-10T00:00:00+00:00

I’ve been playing with Maya for the past few months. 3D animation is a direction I’d like to start moving in.

I’m starting to get the hang of lighting things and whatnot, although I still do not know much. I’m taking a 3D modeling course in the fall, hopefully I’ll get much better by then.

Some glasses and strawberries and grapes on a table:

A Table With Some Stuff

A Sneak Peak...

2010-04-08T00:00:00+00:00

A lot of you guys already know what I’ve been up to for the past few weeks, but for anybody who I haven’t told, here’s a peek:

The Foyer

2010-04-04T00:00:00+00:00

For most of March, our Digital Design Foundations assignment was to create a room from an odd perspecitve using Illustrator and Photoshop. Here’s what I came up with:

Almost all of this is Illustrator, including the wood, which took ages to do. I loosely based this off of the foyer back at home. My intent with this piece was to practice doing lighting work, and I must say I’m rather happy with it.

I made a quick little video of all the stages this piece went through:

The Foyer

Just for fun, here are some color variations that I made by putting it through Lightroom:

I call this variant the “Coraline” version:

George Harrison Portrait

2010-02-23T00:00:00+00:00

For Digital Design Foundations, our latest project was to do a portrait of a famous person with interesting looking hair. We were supposed to do these in Illustrator with black, white, and two colors of our choosing (the two colors allowed us to use as many opacity settings as we wanted for each color).

I decided to do George Harrison from the Beatles:

The background pattern is influenced by the psychedelic pattern on the inner sleeve of Sgt. Pepper’s Lonely Hearts Club Band.

Here are some studies that show the progression of the portrait and some of the variations his mustache went though:

Elemental Magic Workshop with Joseph Gilland

2010-01-27T00:00:00+00:00

Last week over the Martin Luther King Jr. Day weekend, I attended a workshop on effects animation at Penn’s School of Design. The workshop was run by Joseph Gilland, who ran effects animation at Walt Disney Feature Animation for a while and worked on films such as Lilo and Stitch and Mulan.

The workshop ran for three days and focused on Mr. Gilland’s “organic approach” to visual effects animation- basically, his idea is that effects animation should focus on more traditional, hand-animated techniques rather than the complex CGI simulation stuff that’s all the rage today. After his workshop, I think I agree with him- the stuff he showed us was simply breathtaking.

During the workshop, we did some studies and prototyping for various visual effects ideas we had. I chose to do an exploding aquarium (I think Mr. Gilland started referring me as “the crazy guy” after I chose that.). I had another idea: a hand reaching through smoke. I had the idea that the hand would be reaching through smoke into a bank vault or something. Here’s a sketch of the two initial ideas:

Mr. Gilland was kind enough to talk through the concept with me though. He sketched this initial concept for me (mad cool!):

After talking with Mr. Gilland and looking at some of his suggestions, I went about doing three separate studies of what the glass, water, and fireball might look like. Some pencil sketches:

Then I scanned the three studies and composited/colorized them together in Photoshop. The blue is water, the yellow/orange is the fireball, and the red represents the shattering glass:

I’m going to try to actually do some animation tests for this- although obviously the end result will have to be much simpler visually than the above sketches. I have a feeling I’m going to be consulting the effects library Mr. Gilland gave us a LOT.

Mr. Gilland has a book on visual effects animation titled Elemental Magic. I really recommend checking it out.

Experimenting with Time Lapse

2009-12-21T00:00:00+00:00

Only two days left until my last final for the semester, so what do I do? Not study! I SHOULD be studying, but then the entire Northeast coast got slammed with a snowstorm. The snow looked really cool outside my window, which meant… photography experiment time!

Recently I’ve been experimenting with Adobe After Effects and Adobe Premiere Pro. More on that later. I also recently got Nikon Camera Control Pro 2, which is Nikon’s tool for remote controlling their DSLRs from computers, which means I can now remote trigger my Nikon D60 over USB from my MacBook Pro. Awesomeness. Time for some snowstorm time lapse experimenting!

So on Friday night/Saturday morning, I pointed my D60 out the window and set Camera Control to take a picture every 40 seconds for 5 hours starting from 5 AM. Unfortunately, I forgot the charge the battery to the camera died 80 minutes into the experiment. Also, apparently the movement of the camera’s internal mirror is enough to cause the camera to shift a bit if not stabilized. As a result, the video is really short and not very stable. It’s not particularly good, but it’s a start:

Snowstorm Sunrise Time Lapse Test- 12/19/2009

This time lapse experiment also served a secondary purpose- to test out the planned workflow that we’re going to try using with the upcoming Omjii Show. I composited all of the video in Adobe After Effects and Adobe Premiere Pro and then used Apple Color to color grade the video. If you’re wondering why I’m using Apple Color but am using Premiere Pro instead of Final Cut Pro, it’s because I tend to favor things that plug into Adobe’s Creative Suite workflow but Adobe doesn’t have a color grader, whereas Apple has a really nice one.

Later in the afternoon, I decided to give the time lapse another shot. This time i remembered to charge the battery and stabilize the camera. The result:

Snowstorm Sunset Time Lapse- 12/19/2009

The problem with attempting time-lapses with a DSLR is that the length of the time you can cover is limited by your battery, unless you have an extended battery or something. Another attempt, this time from Sunday:

Sunset Over UPenn Time Lapse- 12/20/2009

I’m still working on getting the technique down, but I’ll post improved attempts and a detailed run-through of the process once I figure out how to stabilize better, among other things.

An Introduction

2009-12-01T00:00:00+00:00

Welcome to Code and Visuals, my blog for tracking my exploration of the world of computer graphics!

This post says December 2009 on it, but it’s actually backdated. I’m adding this post backdated in order to serve as a bit of an introduction. This blog began elsewhere but eventually became my computer graphics blog. Upon moving the hosting of this blog to Github Pages, I’ve decided to clear out some older off-topic posts, although those posts will remain available on the old Blogger version of this blog.

I started this blog around the time I joined Penn’s Digital Media Design program in 2009. Most of the older posts on this blog are pretty silly, but hopefully they show that I’ve made progress since then!

Code & Visuals

Zootopia 2

Table of Contents

References

SIGGRAPH 2025 Course Notes- Path Guiding Surfaces and Volumes in Disney's Hyperion Renderer- A Case Study

SIGGRAPH 2025 Talk- A Texture Streaming Pipeline for Real-Time GPU Ray Tracing

Photography Show at Disney Animation

Footnotes

New Unified Site Design

Moana 2

Table of Contents

References

Footnotes

DigiPro 2024 Paper- Cache Points For Production-Scale Occlusion-Aware Many-Lights Sampling And Volumetric Scattering

References

Porting Takua Renderer to Windows on Arm

Table of Contents

Introduction

OpenGL on arm64 Windows 11

Building Embree on arm64 Windows 11

Running x86-64 code on arm64 Windows 11

Conclusion

References

Wish

Table of Contents

References

Footnotes

SIGGRAPH 2023 Conference Paper- Progressive Null-tracking for Volumetric Rendering

References

Strange World

Table of Contents

References

SIGGRAPH 2022 Talk- "Encanto" - Let's Talk About Bruno's Visions

References

Encanto

Table of Contents

References

Rendering on the Apple M1 Max Chip

Table of Contents

Introduction

Disclaimers

The M1 Max Chip

Application to Ray Tracing

Results

Conclusion

Bonus Images and Acknowledgements

Comparing SIMD on x86-64 and arm64

Table of Contents

Introduction

4-wide Ray Bounding Box Intersection

Test Program Setup

Defining Structs Usable with Both SSE and Neon

Williams et al. 2005 Ray-Box Intersection Test: Scalar Implementations

SSE Implementation

Neon Implementation

Auto-vectorized Implementation

ISPC Implementation

Final Results and Conclusions

Addendum 2022-09-07

References

SIGGRAPH 2021 Talk- Unbiased Emission and Scattering Importance Sampling for Heterogeneous Volumes

References

Porting Takua Renderer to 64-bit ARM- Part 2

Table of Contents

Introduction

Porting to arm64 macOS

Universal Binaries

Rosetta 2: Running x86-64 on Apple Silicon

TSO Memory Ordering on the M1 Processor

Embree on arm64 using sse2neon

(More) Differences in arm64 versus x86-64

(More) Performance Testing

Conclusion to Part 2

Acknowledgements

References

Us Again

Table of Contents

References

Footnotes

Porting Takua Renderer to 64-bit ARM- Part 1

A Deep Dive on x86-64 versus arm64 Through the Lens of Compiling `std::atomic::compare_exchange_weak()`