Code & Visuals
2024-02-06T18:13:00+00:00
https://blog.yiningkarlli.com
Yining Karl Li
https://blog.yiningkarlli.com/2021/11/encanto.html
Encanto
2021-11-29T00:00:00+00:00
2021-11-29T00:00:00+00:00
Yining Karl Li
<p>For the first time since 2016, Walt Disney Animation Studios is releasing not just one animated feature in a year, but two!
The second Disney Animation release of 2021 is <a href="https://movies.disney.com/encanto">Encanto</a>,
which marks a major milestone as Disney Animation’s 60th animated feature film.
Encanto is a musical set in Colombia about a girl named Mirabel and her family: the amazing, fantastical, magical Madrigals.
I’m proud of every Disney Animation project that I’ve had the privilege to work on, but I have to admit that this year was something different and something very special to me, because this year we completed both Raya and the Last Dragon and Encanto, which are together two of my favorite Disney Animation projects so far.
Earlier this year, I wrote about the <a href="https://blog.yiningkarlli.com/2021/03/raya-and-the-last-dragon.html">amazing work that went into Raya and the Last Dragon</a> and why I loved working on that project; with Encanto now in theaters, I now get to share why I’ve loved working on Encanto so much as well!</p>
<p>Disney Animation feature films take many years and hundreds of people to make, and often the film’s story can remain in a state of flux for much of the film’s production.
All of the above isn’t unusual; large-scale creative endeavors like filmmaking often entail an extremely complex and challenging process.
More often than not, a film requires time and many iterations to really find its voice and gain that spark that makes it a great film.
Encanto, however, is a film that a lot of my coworkers and I realized was going to be really special very early on in production.
Now obviously, that hunch didn’t mean that making Encanto was easy by any means; every film requires tons of hard work from the most amazing, inspiring, talented artists and engineers that I know.
But, I think in the end, that initial hunch about Encanto was proven correct: the finished Encanto has a story that is bursting with warmth and meaning, has one of Disney Animation’s best main characters to date with a huge cast of charming supporting characters, has the most beautiful, magical animation and visuals we’ve ever done, and sets all of the above to a wonderful soundtrack with a bunch of catchy, really cleverly written new songs.
Both the production process and final film for Encanto were a strong reminder for me of why I love working on Disney Animation films in the first place.</p>
<p>From a technical perspective, Encanto also represents something very special in the history of Disney Animation’s continual advancements in animation technology.
To understand why this is, a very brief history review about Disney Animation’s modern production pipeline and toolset is helpful.
In retrospect, Disney Animation’s 50th animated feature film, Tangled, was probably one of the most important films the studio has ever made from a technical perspective, because the production of Tangled required a near-total ground-up rebuild of the studio’s production pipeline and tools that wound up laying the technical foundations for Disney Animation’s modern era.
While every film we’ve made since Tangled has seen us make enormous technical strides in a variety of eras, the starting point of the production pipeline we’ve used and evolved for every CG film up until Encanto were put into place during Tangled.
The fact that Encanto is Disney Animation’s 60th animated feature film is therefore fitting; Encanto is the first film made using the successor to the production pipeline that was first built for Tangled, and just like how Tangled laid the technical foundations for the subsequent ten films that followed, Encanto lays the technical foundations for many more future films to come!
As presented in the USD Birds of a Feather session at SIGGRAPH 2021, this new production pipeline is built on the open-source Universal Scene Description project and brings massive upgrades to almost every piece of software and every custom tool that our artists use.
An absolutely monumental amount of work was put into building a new USD-based world at Disney Animation, but I think the effort was extremely worthwhile: thanks to the work done on Encanto, Disney Animation is now well set up for another decade of technical innovation and another decade of pushing animation as a medium forward.
I’m hoping that we’ll be able to present much more on this topic at SIGGRAPH 2022!</p>
<p>Moving to a new production pipeline meant also moving Disney’s Hyperion Renderer to work in the new production pipeline.
To me, one of the biggest advantages of an in-house production renderer is the ability for the renderer development team to work extremely closely with other teams in the studio in an integrated fashion, and moving Hyperion to work well in the new USD-based world exemplifies just how important this collaboration is.
We couldn’t have pulled off this effort without the huge amount of amazing work that engineers and TDs and artists from many other departments pitched in.
However, having to move an existing renderer to a new pipeline isn’t the only impact on rendering that the new USD-based world has had.
One of the most exciting things about the new pipeline is all of the new possibilities and capabilities that USD and Hydra unlocks; one of the biggest projects our rendering team worked on during Encanto’s production was a new, very exciting next-generation rendering project.
I can’t talk too much about this project yet; all I can say is that we see it as a major step towards the future of rendering at Disney Animation, and that even in its initial deployment on Encanto, we’ve already seen huge fundamental improvements to how our lighters work every day.
Hopefully we’ll be able to reveal more soon!</p>
<p>Of course, just because Encanto saw huge foundational changes to how we make movies doesn’t mean that there weren’t the usual fun and interesting show-specific challenges as well.
Encanto presented many new, weird, fun problems for the rendering team to think about.
Geometry fracturing was a major effect used extensively throughout Encanto, and in order to author and render fractured geometry as efficiently as possible, the rendering team had to devise some really clever new geometry-processing features in Hyperion.
Encanto’s cinematography direction called for a beautiful, really colorful look that required pushing artistic controllability in our lighting capabilities even further, and to that end our team developed a bunch of cool new artistic control enhancements in Hyperion’s volume rendering and light shaping systems.
One of my favorite show-specific challenges that I got to work on for Encanto was for the holographic effect in Bruno’s emerald crystal prophecies.
For a variety of reasons, the artists wanted this effect done completely in-render; coming up with an in-render solution required many iterations and prototypes and experiments carried out over several months through a close collaboration between a number of artists and TDs and the rendering team.
Encanto also saw continued advancements to Hyperion’s state-of-the-art deep-learning denoiser and stereo rendering solutions and saw continued advancements in Hyperion’s shading models and traversal system.
These advancements helped us tackle many of the interesting complexity and scaling challenges that Encanto presented; effects like Isabella’s flowers and the glowing magical particles associated with the Madrigal family’s miracle pushed instancing counts to incredible new record levels, and for the first time ever on a Disney Animation film, we actually rendered some of the gorgeous costumes in the movie not as displaced triangle meshes with fuzz on top, but as <em>actual woven curves at the thread-level</em>.
The latter proved crucial to creating the chiffon and tulle in Isabella’s outfit and was a huge part in creating the look of Mirabel’s characteristic custom-embroidered skirt.
My mind was thoroughly blown when I saw those renders for the first time; on every film, I’m constantly amazed and impressed by what our artists can do with the tools we provide them with.
Again, I’m hoping that we’ll be able to share much more about all of these things later; keep an eye on SIGGRAPH 2022!</p>
<p>Encanto also saw rendering features that we first developed for previous films pushed even further and used in interesting new ways.
We first deployed a path guiding implementation in Hyperion back on Frozen 2, but path guiding wound up not seeing too much use on Raya and the Last Dragon since Raya’s setting was mostly outdoors, and path guiding doesn’t help as much in direct-lighting dominant scenarios such as outdoor scenes.
However, since a huge part of Encanto takes place inside of the magical Madrigal casita, indoor indirect illumination was a huge component of Encanto’s lighting.
We found that path guiding provided enormous benefits to render times in many indoor scenes, and especially in settings like the Madrigal family’s kitchen at night, where lighting was almost entirely provided by outdoor light sources coming in through windows and from candles and stuff.
I think this case was a great example of how we benefit from how closely our lighting artists and our rendering engineers work together on many shows over time; because we had all worked together on similar problems before, we all had shared experiences with past solutions that we were able to draw on together to quickly arrive at a common understanding of the new challenges on Encanto.
Another good example of how this collaboration continues to pay dividends over time is in the choices of lens and bokeh effects that were used on Encanto.
For Raya and the Last Dragon, we learned a lot about creating non-uniform bokeh and interesting lensing effects, and what we learned on Raya in turn helped further inform early cinematography and lensing experiments on Encanto.</p>
<p>In addition to all of the cool renderer development work that I usually do, I also got to take part in something a little bit different on Encanto.
Every year, the lighting department brings on a handful of trainees, who are put through several months of in-studio “lighting school” to learn our tools and pipeline and approach to lighting before lighting real shots on the film itself.
This year, I got to join in with the lighting trainees while they were going through lighting training; this experience wound up being one of my favorites from the past year.
I think that having to sit down and actually learn and use software the same way that the users have to is an extraordinarily valuable experience for any software engineer that is building tools for users.
Even though I’ve been working at Disney Animation for six years now, and even though I know the internals of how our renderer works extensively, I still learned a ton from having to actually use Hyperion to light shots and address notes from lighting supervisors and stuff!
Encanto’s lighting style required really leaning on the tools that we have for art-directing and pushing and modifying fully physical lighting, which really changed my perspective on some of these tools.
For most rendering engineers and researchers, features that allow for breaking purely physical light transport are often seen as annoying and difficult to implement but necessary concessions to the artists.
Having now used these features in order to hit artistic notes on short time frames though, I now have a better understanding of just how critical a component these features can be in an artist’s toolbox.
I owe a huge amount of thanks to Disney Animation’s technology department leadership and to the lighting department for having made this experience possible and for having strongly supported this entire “exchange program”; I’d strongly recommend that every rendering engineer should go try lighting some shots sometime!</p>
<p>Finally, here are a handful of stills from the movie, 100% created using Disney’s Hyperion Renderer by our amazing artists.
I’ve ordered the frames randomly, to try to prevent spoiling anything important.
These frames showcase just how gorgeous Encanto looks, but since they’re pulled from only the marketing materials that have been released so far, they only represent a small fraction of how breathtakingly beautiful and colorful the total film is.
Hopefully I’ll be able to share a bunch more cool and beautiful stills closer to SIGGRAPH 2022.
I highly recommend seeing Encanto on the biggest screen you can; if you are a computer graphics enthusiast, go see it twice: the first time for the wonderful, magical story and the second time for the incredible artistry that went into every single shot and every single frame!
I love working on Disney Animation films because Disney Animation is a place where some of the most amazing artists and engineers in the world work together to simultaneously advance animation as a storytelling medium, as a visual medium, and as a technology goal.
Art being inspired by technology and technology being challenged by art is a legacy that is deeply baked into the very DNA of Disney Animation, and that approach is exemplified by every single frame in Encanto:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Nov/encanto/CASA_001.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Nov/encanto/CASA_001.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Nov/encanto/CASA_002.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Nov/encanto/CASA_002.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Nov/encanto/CASA_003.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Nov/encanto/CASA_003.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Nov/encanto/CASA_004.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Nov/encanto/CASA_004.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Nov/encanto/CASA_005.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Nov/encanto/CASA_005.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Nov/encanto/CASA_006.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Nov/encanto/CASA_006.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Nov/encanto/CASA_007.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Nov/encanto/CASA_007.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Nov/encanto/CASA_008.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Nov/encanto/CASA_008.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Nov/encanto/CASA_009.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Nov/encanto/CASA_009.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Nov/encanto/CASA_010.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Nov/encanto/CASA_010.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Nov/encanto/CASA_011.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Nov/encanto/CASA_011.jpg" alt="" /></a></p>
<p>All images in this post are courtesy of and the property of Walt Disney Animation Studios.</p>
<p>Also, be sure to catch our new short, Far From the Tree, which is accompanying Encanto in theaters.
Far From the Tree deserves its own discussion later; all I’ll write here is that I’m sure it’s going to be fascinating for rendering and computer graphics enthusiasts to see!
Far From the Tree tells the story of a parent and child raccoon as they explore a beach; the short has a beautiful hand-drawn watercolor look that is actually CG rendered out of Disney’s Hyperion Renderer and extensively augmented with hand-crafted elements.
Be sure to see Far From the Tree in theaters with Encanto!</p>
https://blog.yiningkarlli.com/2021/10/takua-on-m1-max.html
Rendering on the Apple M1 Max Chip
2021-10-25T00:00:00+00:00
2021-10-25T00:00:00+00:00
Yining Karl Li
<p>Over the past year, I ported my hobby renderer, Takua Renderer, to 64-bit ARM.
I wrote up the entire process and everything I learned as a three-part blog post series covering topics ranging from assembly-level comparison between x86-64 and arm64, to deep dives into various aspects of Apple Silicon, to a comparison of x86-64’s SSE and arm64’s Neon vector instructions.
In the intro to part 1 of my arm64 series, I wrote about my <a href="https://blog.yiningkarlli.com/2021/05/porting-takua-to-arm-pt1.html#motivation">motivation for exploring arm64</a>, and in the <a href="https://blog.yiningkarlli.com/2021/07/porting-takua-to-arm-pt2.html#conclusion">conclusion to part 2</a> of my arm64 series, I wrote the following about the Apple M1 chip:</p>
<blockquote>
<p>There’s really no way to understate what a colossal achievement Apple’s M1 processor is; compared with almost every modern x86-64 processor in its class, it achieves significantly more performance for much less cost and much less energy. The even more amazing thing to think about is that the M1 is Apple’s low end Mac processor and likely will be the slowest arm64 chip to ever power a shipping Mac; future Apple Silicon chips will only be even faster.</p>
</blockquote>
<p>Well, those future Apple Silicon chips are now here!
Last week (relative to the time of posting), Apple announced new 14 and 16-inch MacBook Pro models, powered by the new Apple M1 Pro and Apple M1 Max chips.
Apple reached out to me last week immediately after the announcement of the new MacBook Pros, and as a result, for the past week I’ve had the opportunity to use a prerelease M1 Max-equipped 2021 14-inch MacBook Pro as my daily computer.
So, to my extraordinary surprise, this post is the unexpected Part 4 to what was originally supposed to be a two-part series about Takua Renderer on arm64.
This post will serve as something of a coda to my Takua Renderer on arm64 series, but will also be fairly different in structure and content to the previous three parts.
While the previous three parts dove deep into extremely technical details about arm64 assembly and Apple Silicon and such, this post will focus on a single question: now that professional-grade Apple Silicon chips exist in the wild, <em>how well do high-end rendering workloads run on workstation-class arm64</em>?</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Oct/takua-on-m1-max/macbookpro14.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Oct/takua-on-m1-max/preview/macbookpro14.jpg" alt="Figure 1: The new 2021 14-inch MacBook Pro with an Apple M1 Max chip, running Takua Renderer." /></a></p>
<p>Before we dive in, I want to get a few important details out of the way.
First, this post is not really a product review or anything like that, and I will not be making any sort of endorsement or recommendation on what you should or should not buy; I’ll just be writing about my experiences so far.
Many amazing tech reviewers exist out there, and if what you are looking for is a general overview and review of the new M1 Pro and M1 Max based MacBook Pros, I would suggest you go check out reviews by The Verge, Anandtech, MKBHD, Dave2D, LinusTechTips, and so on.
Second, as with everything in this blog, the contents of this post represent only my personal opinion and do not in any way represent any kind of official or unofficial position, endorsement, or opinion on any matter from my employer, Walt Disney Animation Studios.
When Apple reached out to me, I received permission from Disney Animation to go ahead on a purely personal basis, and beyond that nothing with this entire process involves Disney Animation.
Finally, Apple is not paying me or giving me anything for this post; the 14-inch MacBook Pro I’ve been using for the past week is strictly a loaner unit that has to be returned to Apple at a later point.
Similarly, Apple has no say over the contents of this post; Apple has not even seen any version of this post before publishing.
What is here is only what I think!</p>
<p>Now that a year has passed since the first Apple Silicon arm64 Macs were released, I do have my hobby renderer up and running on arm64 with everything working, but I’ve only rendered relatively small scenes so far on arm64 processors.
The reason I’ve stuck to smaller scenes is because high-end workstation-class arm64 processors so far just have not existed; while large server-class arm64 processors with large core counts and tons of memory do exist, these server-class processors are mostly found in huge server farms and supercomputers and are not readily available for general use.
For general use, the only arm64 options so far have been low-power single-board computers like the Raspberry Pi 4 that are nowhere near capable of running large rendering workloads, or phones and tablets that don’t have software or operating systems or interfaces suitable for professional 3D applications, or M1-based Macs.
I have been using an M1 Mac Mini for the past year, but while the M1 performance-wise punches way above what a 15 watt TDP typically would suggest, the M1 only supports up to 16 GB of RAM and only represents Apple’s <em>entry</em> into Apple Silicon based Macs.
The M1 Pro and M1 Max, however, are are Apple’s first high powered arm64-based chips targeted at professional workloads, meant for things like high-end rendering and many other creative workloads; by extension, the M1 Pro and M1 Max are also the first arm64 chips of their class in the world with wide general availability.
So, in this post, answering the question “how well do high-end rendering workloads run on workstation-class arm64” really means examining how well the M1 Pro and M1 Max can do rendering.</p>
<p>Spoiler: the answer is <em>extremely well</em>; all of the renders in the post were rendered on the 14-inch MacBook Pro with an M1 Max chip.
Here is a screenshot of Takua Renderer running on the 14-inch MacBook Pro with an M1 Max chip:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Oct/takua-on-m1-max/takua-on-m1max.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Oct/takua-on-m1-max/takua-on-m1max.jpg" alt="Figure 2: Takua Renderer running on arm64 macOS 12 Monterey, on a 14-inch MacBook Pro with an M1 Max chip." /></a></p>
<p>The 14-inch MacBook Pro I’ve been using for the past week is equipped with the maximum configuration in every category: a full M1 Max chip with a 10-core CPU, 32-core GPU, 64 GB of unified memory, and 8 TB of SSD storage.
However, for this post, I’ll only focus on the 10-core CPU and 64 GB of RAM, since Takua Renderer is currently CPU-only (more on that later); for a deep dive into the M1 Pro and M1 Max’s entire system-on-a-chip, I’d suggest taking a look at <a href="https://www.anandtech.com/show/17019/apple-announced-m1-pro-m1-max-giant-new-socs-with-allout-performance">Anandtech’s great initial impressions</a> and later <a href="https://www.anandtech.com/show/17024/apple-m1-max-performance-review">in-depth review</a>.</p>
<p>The first M1 Max spec that jumped out at me is the 64 GB of unified memory; having this amount of memory meant I could finally render some of the largest scenes I have for my hobby renderer.
To test out the M1 Max with 64 GB of RAM, I chose the forest scene from my <a href="https://blog.yiningkarlli.com/2018/10/bidirectional-mipmap.html">Mipmapping with Bidirectional Techniques</a> post.
This scene has enormous amounts of complex geometry; almost every bit of vegetation in this scene has highly detailed displacement mapping that has to be stored in memory, and the large amount of textures in this scene is what drove me to implement a texture caching system in my hobby renderer in the first place.
In total, this scene requires just slightly under 30 GB of memory just to store all of the subdivided, tessellated, and displaced scene geometry, and requires an additional few more GB for the texture caching system (the scene can render with just a 1 GB texture cache, but having a larger texture cache helps significantly with performance).</p>
<p>I have only ever published two images from this scene: the <a href="https://blog.yiningkarlli.com/content/images/2018/Oct/forest.cam0.0.jpg">main forest path view</a> in the mipmapping blog post, and a closeup of a tree stump as the title image on my personal website.
I originally had several more camera angles set up that I wanted to render images from, and I actually did render out 1080p images.
However, to showcase the detail of the scene better, I wanted to wait until I had 4K renders to share, but unfortunately I never got around to doing the 4K renders.
The reason I never did the 4K renders is because I only have one large personal workstation that has both enough memory and enough processing power to actually render images from this scene in a reasonable amount of time, but I needed this workstation for other projects.
I also have a few much older spare desktops that do have just barely enough memory to render this scene, but unfortunately, those machines are so loud and so slow and produce so much heat that I prefer not to run them at all if possible, and I especially prefer not running them on long render jobs when I have to work-from-home in the same room!
However, over the past week, I have been able to render a bunch of 4K images from my forest scene on the M1 Max 14-inch MacBook Pro; quite frankly, being able to do this on a laptop is incredible to me.
Here is the title image from my personal website, but now rendered at 4K resolution on the M1 Max 14-inch MacBook Pro:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Oct/takua-on-m1-max/forest.cam2.0.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Oct/takua-on-m1-max/preview/forest.cam2.0.jpg" alt="Figure 3: Forest scene title image from my personal website. Rendered using Takua Renderer on a M1 Max 14-inch MacBook Pro. Click through for full 4K version." /></a></p>
<p>The M1 Max-based MacBook Pro is certainly not the first laptop to ever ship with 64 GB of RAM; the previous 2019 16-inch MacBook Pro was also configurable up to 64 GB of RAM, and there are crazy PC laptops out there that can be configured up even higher.
However, this is where the M1 Max and M1 Pro’s CPU performance comes into play: while previous laptops could support 64 GB of RAM and more, actually utilizing large amounts of RAM was difficult since previous laptop CPUs often couldn’t keep up!
Being able to fit a large data set into memory is one thing, but being able to run processing fast enough to actually make use of large data sets in a reasonable amount of time is the other half of the puzzle.
My wife has a 2019 16-inch MacBook Pro with 32 GB of memory, which is <em>just</em> enough to render my forest scene.
However, as seen in <a href="#results">the benchmark results later in this post</a>, the 2019 16-inch MacBook Pro’s Intel Core-i7 9750H CPU with 6 cores and 12 threads is over twice as slow as the M1 Max at rendering this scene <em>at best</em>, and can be even slower depending on thermals, power, and more.
Rendering each of the images in this post took a few hours on the M1 Max, but on the Core-i7 9750H, the renders have to become overnight jobs with the 16-inch MacBook Pro’s fans running at full speed.
With only a week to write this post, a few hours per image versus an overnight job per image made the difference between having images ready for this post versus not having any interesting renders to show at all!</p>
<p>Actually, the M1 Max isn’t just fast for a chip in a laptop; the M1 Max is stunningly competitive even with <em>desktop</em> workstation CPUs.
For the past few years, the large personal workstation that I offload large projects onto has been a machine with dual Intel Xeon E5-2680 workstation processors with 8 cores / 16 threads each for a total of 16 cores and 32 threads.
Even though the Xeon E5-2680s are ancient at this point, this workstation’s performance is still on-par with that of the current Intel-based 2020 27-inch iMac.
The M1 Max is faster then the dual-Xeon E5-2680 workstation at rendering my forest scene, and considerably so.
But of course, a comparison with aging Sandy Bridge era Xeons isn’t exactly a fair sporting competition; the M1 Max has almost a decade of improved processor design and die shrinks to give it an advantage.
So, I also tested the M1 Max against… the current generation 2019 Mac Pro, which uses a Intel Xeon W-3245 CPU with 16 cores and 32 threads.
As expected, the M1 Max loses to the 2019 Mac Pro… <em>but not by a lot</em>, and for a fraction of the power used.
The Intel Xeon W-3245 has a 205 watt TDP just for the CPU alone and has to be utilized in a huge desktop tower with an extremely elaborate custom-engineered cooling solution, whereas the M1 Max 14-inch MacBook Pro has a reported whole-system TDP of just 60 watts!</p>
<p>How does Apple pack so much performance with such little energy consumption into their arm64 CPU designs?
A number of factors come into play here, ranging from partnering with TSMC to manufacture on cutting-edge 5 nm process nodes to better microarchitecture design to better software and hardware integration; outside of Apple’s processor engineering labs, all anyone can really do is just hypothesize and guess.
However, there are some good guesses out there!
Several plausible theories have to do with the choice to use the arm64 instruction set; the argument goes that having been originally designed for low-power use cases, arm64 is better suited for efficient energy consumption than x86-64, and scaling up a more efficient design to huge proportions can mean more capable chips that use less power than their traditional counterparts.
Another theory revolving around the arm64 instruction set has to do with microarchitecture design considerations.
The M1, M1 Pro, and M1 Max’s high-performance “Firestorm” cores <a href="https://www.anandtech.com/show/16226/apple-silicon-m1-a14-deep-dive/2">have been observed</a> to have an absolutely humongous reorder buffer, which enables extremely deep out-of-order execution capabilities; modern processors attain a lot of their speed by reordering incoming instructions to do things like hide memory latency and bypass stalled instruction sequences.
The M1 family’s high-performance cores posses an out-of-order window that is around twice as large as that in Intel’s current Willow Cove microarchitecture and around three times as large as that in AMD’s current Zen3 microarchitectures.
Having a huge reordering buffer supports the M1 family’s high-performance cores also having a high level of instruction-level parallelism enabled by extremely wide instruction execution and extremely wide instruction decoding.
While wide instruction decoding is certainly possible on x86-64 and other architectures, scaling wide instruction-issue designs in a low power budget is generally accepted to be a very challenging chip design problem.
The theory goes that arm64’s fixed instruction length and relatively simple instructions make implementing extremely wide decoding and execution far more practical for Apple, compared with what Intel and AMD have to do in order to decode x86-64’s variable length, often complex compound instructions.</p>
<p>So what does any of the above have to do with ray tracing?
One concrete application has to do with opacity mapping in a ray tracing renderer.
Opacity maps are used to produce finer geometric detail on surfaces by using a texture map to specify whether a part of a given surface should actually exist or not.
Implementing opacity mapping in a ray tracer creates a surprisingly large number of design considerations that need to be solved for.
For example, texture lookups are usually done as part of a renderer’s shading system, which in a ray tracer only runs after ray intersection has been carried out.
However, evaluating whether or not a given hit point against a surface should be ignored or not <em>after</em> exiting the entire ray traversal system leads to massive inefficiencies due to the need to potentially re-enter the entire ray traversal system from scratch again.
As an example: imagine a tree where all of the leaves are modeled as rectangular cards, and the shape of each leaf is produced using an opacity map on each card.
If the renderer wants to test if a ray hits any part of the tree, and the renderer is architected such that opacity map lookups only happen in the shading system, then the renderer may need to cycle back and forth between the traversal and shading systems for every leaf encountered in a straight line path through the tree (and trees have a lot of leaves!).
An alternative way to handle opacity hits is to allow for direct texture map lookups or to evaluate opacity procedurally from within the traversal system itself, such that the renderer can immediately decide whether to accept a hit or not without having to exit out and run the shading system; this approach is what most renderers use and is what ray tracing libraries like Embree and Optix largely expect.
However, this method produces a different problem: tight inner loop ray traversal code is now potentially dependent on slow texture fetches from memory!
Both of these approaches to implementing opacity mapping have downsides and potential performance impacts, which is why often times just modeling detail into geometry instead of using opacity mapping can actually result in <em>faster</em> ray tracing performance, despite the heavier geometry memory footprint.
However, opacity mapping is often a lot easier to set up compared with modeling detail into geometry, and this is where a deep out-of-order buffer coupled with good branch prediction can make a big difference in ray tracing performance; these two tools combined can allow the processor to proceed with a certain amount of ray traversal work without having to wait for opacity map decisions.
Problems similar to this, coupled with the lack of out-of-order and speculative execution on GPUs, play a large role in why GPU ray tracing renderers often have to be architecture fairly differently from CPU ray tracing renderers, but that’s a topic for another day.</p>
<p>I give the specific example above because it turns out that the M1 Max’s deep reordering capabilities seem to make a fairly noticeable difference in my Takua Renderer’s performance when opacity maps are used extensively!
In the following rendered image, the ferns have an extremely detailed, complex appearance that depends heavily on opacity maps to cut out leaf shapes from simple underlying geometry.
In this case, I found that the slowdown introduced by using opacity maps in a render on the M1 Max is proportionally much lower than the slowdown introduced when using opacity maps in a render on the x86-64 machines that I tested.
Of course, I have no way of knowing if the above theory for why the M1 Max seems to handle renders that use opacity maps better is correct, but whichever way, the end results look very nice and renders faster than on any other computer that I have!</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Oct/takua-on-m1-max/forest.cam3.0.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Oct/takua-on-m1-max/preview/forest.cam3.0.jpg" alt="Figure 4: Detailed close-up of a fern in the forest scene. Rendered using Takua Renderer on a M1 Max 14-inch MacBook Pro. Click through for full 4K version." /></a></p>
<p>In terms of whether the M1 Pro or the M1 Max is better for CPU rendering, I only have the M1 Max to test, but my guess is that there shouldn’t actually be too large of a difference as long as the scene fits in memory.
However, the above guess comes with a major caveat revolving around memory bandwidth.
Where the M1 Pro and M1 Max differ is in the maximum number of GPU cores and maximum amount of unified memory configurable; the M1 Pro can go up to 16 GPU cores and 32 GB of RAM, while the M1 Max can go up to 32 GPU cores and 64 GB of RAM.
Outside of the GPU and maximum amount of memory, the M1 Pro and M1 Max chips actually share identical CPU configurations: both of them have a 10-core arm64 CPU with 8 high-performance cores and 2 energy-efficient cores, implementing a custom in-house Apple-designed microarchitecture.
However, for some workloads, I would not be surprised if the M1 Max is actually slightly faster since the M1 Max also has twice the memory bandwidth over the M1 Pro (400 GB/s on M1 Max versus 200 GB/s M1 Pro); this difference comes from the M1 Max having twice the number of memory controllers.
While consumer systems such as game consoles and desktop GPUs often do ship with memory bandwidth numbers comparable or even better than the M1 Max’s 400 GB/s, seeing these levels of memory bandwidth in even workstation CPUs is relatively unheard of.
For example, AMD’s monster flagship Ryzen Threadripper 3990X is currently the most powerful high-end desktop CPU on the planet (outside of server processors), but the 3990X’s maximum memory bandwidth tops out at 95.37 GiB/s, or 165.944 GB/s; seeing the M1 Max MacBook Pro ship with over twice the memory bandwidth compared to the Threadripper 3990X is pretty wild.
The M1 Max also has twice the amount of system-level cache as the M1 Pro; on the M1 family of chips, the system-level cache is loosely analogous to L3 cache on other processors, but serves the entire system instead of just the CPU cores.</p>
<p>Production-grade CPU ray tracing is a process that depends heavily on being able to pin fast CPU cores at close to 100% utilization for long periods of time, while accessing extremely large datasets from system memory.
In an ideal world, intensive computational tasks should be structured in such a way that data can be pulled from memory in a relatively coherent, predictable manner, allowing the CPU cores to rely on data in cache over fetching from main memory as much as possible.
Unfortunately, making ray tracing coherent enough to utilize cache well is an extremely challenging problem.
Operations such as BVH traversal, which finds the closest point in a scene that a ray intersects, essentially represent an arbitrarily random walk through potentially vast amounts of geometry stored in memory, and any kind of incoherent walk through memory makes overall CPU performance dependent on memory performance.
As a result, operations like BVH traversal tend to be heavily bottlenecked by memory latency and memory bandwidth.
I expect that the M1 Max’s strong memory bandwidth numbers should provide a some performance boost for rendering compared to the M1 Pro.
A complicating factor, however, is <a href="https://www.anandtech.com/show/17024/apple-m1-max-performance-review/2">how the additional memory bandwidth on the M1 Max is utilized</a>; not all of it is available to just the CPU, since the M1 Max’s unified memory needs to also serve the system’s GPU, neural processing systems, and other custom onboard logic blocks.
The actual real-world impact should be easily testable by rendering the same scene on a M1 Pro and a M1 Max chip both with 32 GB of RAM, but in the week that I’ve had to test the M1 Max so far, I haven’t had the time or ability to be able to carry out this test on my own.
Stay tuned; I’ll update this post if I am able to try this test soon!</p>
<p>I’m very curious to see if the increased memory bandwidth on the M1 Max will make a difference over the M1 Pro on this forest scene in particular, due to how dense some of the geometry is and therefore how deep some of the BVHs have to go.
For example, every single pine needle in this next image is individually modeled geometry, and every tree trunk has sub-pixel-level tessellation and displacement; being able to render this image on a MacBook Pro instead of a giant workstation is incredible:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Oct/takua-on-m1-max/forest.cam1.0.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Oct/takua-on-m1-max/preview/forest.cam1.0.jpg" alt="Figure 5: Forest canopy made up of pine trees, with every pine needle modeled as geometry. Rendered using Takua Renderer on a M1 Max 14-inch MacBook Pro. Click through for full 4K version." /></a></p>
<p>In the previous posts about running Takua Renderer on arm64 processors, I included performance testing results across a wide variety of machines ranging from the Raspberry Pi 4B to the M1 Mac Mini all the way up to my dual Intel Xeon E5-2680 workstation.
However, all of those tests weren’t necessarily indicative of what real world rendering performance on huge scenes would be like, since all of those tests had to use scenes that were small enough to fit in to a M1 Mac Mini’s 16 GB memory footprint.
Now that I have access to a M1 Max MacBook Pro with 64 GB of memory, I can present some initial performance comparisons with larger machines rendering my forest scene.
I think these results are likely more indicative of what real-world production rendering performance looks like, since the forest scene is the closest thing I have to true production complexity (I haven’t ported the Disney’s Moana Island data set to work in my renderer yet).</p>
<p>The machines I tested this time are a 2021 14-inch MacBook Pro with an Apple M1 Max chip with 10 cores (8 performance, 2 efficiency) and 10 threads, a 2019 16-inch MacBook Pro with an Intel Core i7-9750H CPU with 6 cores and 12 threads, a 2019 Mac Pro with an Intel Xeon W-3245 CPU with 16 cores and 32 threads, and a Linux workstation with dual Intel Xeon E5-2680 CPUs with 8 cores and 16 threads per CPU for a total of 16 cores and 32 threads.
The Xeon E5-2680 workstation is, quite franky, ancient, and makes for something of a strange comparison point, but it’s the main workstation that I use for personal rendering projects at the moment, so I included it.
I don’t exactly have piles of the latest server and workstation chips just laying around my house, so I had to work with what I got!
However, I was also able to borrow access to a Windows workstation with an AMD Threadripper 3990X CPU, which weighs in with 64 cores and 128 threads.
I figured that the Threadripper 3990X system is not at all a fair comparison point for the exact opposite reason why the Xeon E5-2680 is not a fair comparison point, but I thought I’d throw it in anyway out of sheer curiosity.
Notably, the regular Apple M1 chip does not make an appearance in these tests, since the forest scene doesn’t fit in memory on the M1.
I also borrowed a friend’s Razer Blade 15 to test, but wound up not using it since I discovered that it has the same Intel Core i7-9750H CPU as the 2019 16-inch MacBook Pro, but only has half the memory and therefore can’t fit the scene.</p>
<p>In the case of the two MacBook Pros, I did all tests twice: once with the laptops plugged in, and once with the laptops running entirely on battery power.
I wanted to compare plugged-in versus battery performance because of Apple’s claim that the new M1 Pro/Max based MacBook Pros perform the same whether plugged-in or on battery.
This claim is actually a huge deal; laptops traditionally have had to throttle down CPU performance when unplugged to conserve battery life, but the energy efficiency of Apple Silicon allows Apple to no longer have to do this on M1-family laptops.
I wanted to verify this claim for myself!</p>
<div id="results"></div>
<p>In the results below, I present three tests using the forest scene.
The first test measures how long Takua Renderer takes to run subdivision, tessellation, and displacement, which has to happen before any pixels can actually be rendered.
The subdivision/tessellation/displacement process has an interesting performance profile that looks very different from the performance profile of the main path tracing process.
Subdivision within a single mesh is not easily parallelizable, and even with a parallel implementation, scales very poorly beyond just a few threads.
Takua Renderer attempts to scale subdivision widely by running subdivision on multiple meshes in parallel, with each mesh’s subdivision task only receiving an allocation of at most four threads.
As a result, the subdivision step actually benefits slightly more from single-threaded performance over a larger number of cores and greater multi-threaded performance.
The second test is rendering the main view of the forest scene from my mipmapping blog post, at 1920x1080 resolution.
I chose to use 1920x1080 resolution since most of the time this is a more common maximum resolution to be using while working on artistic iteration.
The third test is rendering the fern view of the forest scene from Figure 2 of this post, at final 4K 3840x2160 resolution.
For both of the main rendering tests, I only ran the renderer for 8 samples per pixel, since I didn’t want to sit around for days to collect all of the data.
For each test, I did five runs, discarded the highest and lowest results, and averaged the remaining three results to get the numbers below.
Wall time (as in a clock on a wall) measures the actual amount of real-world time that each test took, while core-seconds is an approximation of how long each test would have taken running on a single core.
So, wall time can be thought of as a measure of total computation <em>power</em>, whereas core-seconds is more a measure of computational <em>efficiency</em>; in both cases, lower numbers are better:</p>
<table>
<thead>
<tr>
<th style="text-align: right"> </th>
<th style="text-align: center">Forest Subdivision/Displacement</th>
<th style="text-align: left"> </th>
</tr>
<tr>
<th style="text-align: right">Processor:</th>
<th style="text-align: center">Wall Time:</th>
<th style="text-align: left">Core-Seconds:</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: right">Apple M1 Max (Plugged in):</td>
<td style="text-align: center">128 s</td>
<td style="text-align: left">approx 1280 s</td>
</tr>
<tr>
<td style="text-align: right">Apple M1 Max (Battery):</td>
<td style="text-align: center">128 s</td>
<td style="text-align: left">approx 1280 s</td>
</tr>
<tr>
<td style="text-align: right">Intel Core i7-9750H (Plugged in):</td>
<td style="text-align: center">289 s</td>
<td style="text-align: left">approx 3468 s</td>
</tr>
<tr>
<td style="text-align: right">Intel Core i7-9750H (Battery):</td>
<td style="text-align: center">307 s</td>
<td style="text-align: left">approx 3684 s</td>
</tr>
<tr>
<td style="text-align: right">Intel Xeon W-3245:</td>
<td style="text-align: center">179 s</td>
<td style="text-align: left">approx 5728 s</td>
</tr>
<tr>
<td style="text-align: right">Intel Xeon E5-2680 x2:</td>
<td style="text-align: center">222 s</td>
<td style="text-align: left">approx 7104 s</td>
</tr>
<tr>
<td style="text-align: right">AMD Threadripper 3990X:</td>
<td style="text-align: center">146 s</td>
<td style="text-align: left">approx 18688 s</td>
</tr>
</tbody>
</table>
<table>
<thead>
<tr>
<th style="text-align: right"> </th>
<th style="text-align: center">Forest Rendering (Main Camera)</th>
<th style="text-align: left"> </th>
</tr>
<tr>
<th style="text-align: right"> </th>
<th style="text-align: center">1920x1080, 8 spp, PT</th>
<th style="text-align: left"> </th>
</tr>
<tr>
<th style="text-align: right">Processor:</th>
<th style="text-align: center">Wall Time:</th>
<th style="text-align: left">Core-Seconds:</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: right">Apple M1 Max (Plugged in):</td>
<td style="text-align: center">127.143 s</td>
<td style="text-align: left">approx 1271.4 s</td>
</tr>
<tr>
<td style="text-align: right">Apple M1 Max (Battery):</td>
<td style="text-align: center">126.421 s</td>
<td style="text-align: left">approx 1264.2 s</td>
</tr>
<tr>
<td style="text-align: right">Intel Core i7-9750H (Plugged in):</td>
<td style="text-align: center">288.089 s</td>
<td style="text-align: left">approx 3457.1 s</td>
</tr>
<tr>
<td style="text-align: right">Intel Core i7-9750H (Battery):</td>
<td style="text-align: center">347.898 s</td>
<td style="text-align: left">approx 4174.8 s</td>
</tr>
<tr>
<td style="text-align: right">Intel Xeon W-3245:</td>
<td style="text-align: center">106.332 s</td>
<td style="text-align: left">approx 3402.6 s</td>
</tr>
<tr>
<td style="text-align: right">Intel Xeon E5-2680 x2:</td>
<td style="text-align: center">158.255 s</td>
<td style="text-align: left">approx 5064.2 s</td>
</tr>
<tr>
<td style="text-align: right">AMD Threadripper 3990X:</td>
<td style="text-align: center">38.887 s</td>
<td style="text-align: left">approx 4977.5 s</td>
</tr>
</tbody>
</table>
<table>
<thead>
<tr>
<th style="text-align: right"> </th>
<th style="text-align: center">Forest Rendering (Fern Camera)</th>
<th style="text-align: left"> </th>
</tr>
<tr>
<th style="text-align: right"> </th>
<th style="text-align: center">3840x2160, 8 spp, PT</th>
<th style="text-align: left"> </th>
</tr>
<tr>
<th style="text-align: right">Processor:</th>
<th style="text-align: center">Wall Time:</th>
<th style="text-align: left">Core-Seconds:</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: right">Apple M1 Max (Plugged in):</td>
<td style="text-align: center">478.247 s</td>
<td style="text-align: left">approx 4782.5 s</td>
</tr>
<tr>
<td style="text-align: right">Apple M1 Max (Battery):</td>
<td style="text-align: center">496.384 s</td>
<td style="text-align: left">approx 4963.8 s</td>
</tr>
<tr>
<td style="text-align: right">Intel Core i7-9750H (Plugged in):</td>
<td style="text-align: center">1084.504 s</td>
<td style="text-align: left">approx 13014.0 s</td>
</tr>
<tr>
<td style="text-align: right">Intel Core i7-9750H (Battery):</td>
<td style="text-align: center">1219.59 s</td>
<td style="text-align: left">approx 14635.1 s</td>
</tr>
<tr>
<td style="text-align: right">Intel Xeon W-3245:</td>
<td style="text-align: center">345.292 s</td>
<td style="text-align: left">approx 11049.3 s</td>
</tr>
<tr>
<td style="text-align: right">Intel Xeon E5-2680 x2:</td>
<td style="text-align: center">576.279 s</td>
<td style="text-align: left">approx 18440.9 s</td>
</tr>
<tr>
<td style="text-align: right">AMD Threadripper 3990X:</td>
<td style="text-align: center">108.2596 s</td>
<td style="text-align: left">approx 13857.2 s</td>
</tr>
</tbody>
</table>
<p>When rendering the main camera view, the 2021 14-inch MacBook Pro used on average about 7% of its battery charge, while the 2019 16-inch MacBook Pro used on average about 39% of its battery charge.
When rendering the fern view, the 2021 14-inch MacBook Pro used on average about 19% of its battery charge, while the 2019 16-inch MacBook Pro used on average about 48% of its battery charge.
Overall by every metric, the 2021 14-inch MacBook Pro achieves an astounding victory over the 2019 16-inch MacBook Pro: a little over twice the performance for a fraction of the total power consumption.
The 2021 14-inch MacBook Pro also lives up to Apple’s claim of identical performance plugged in and on battery power, whereas in the results above, the 2019 16-inch MacBook Pro suffers anywhere between a 25% to 50% performance hit just from switching to battery power.
The 2021 14-inch MacBook Pro’s performance win is even more astonishing when considering that the 2019 16-inch MacBook Pro is the previous flagship that the new M1 Pro/Max MacBook Pros are the direct successors to.
Seeing this kind of jump in a single hardware generation is unheard of in modern tech and represents a massive win for both Apple and for the arm64 ISA.
The M1 Max also handily beats the old dual Intel Xeon E5-2680 that I am currently using by a comfortable margin; for my personal workflow, this means that I can now do everything that I previously needed a large loud power-hungry workstation for on the 2021 14-inch MacBook Pro, and I can do everything <em>faster</em> on the 2021 14-inch MacBook Pro too.</p>
<p>The real surprises to me came with the 2019 Mac Pro and the Threadripper 3990X workstation.
In both of those cases, I expected the M1 Max to lose, but the 2021 14-inch MacBook Pro came surprisingly close to the 2019 Mac Pro’s performance in terms of wall time.
Even more importantly as a predictor of future scalability, the M1 Max’s efficiency as measured by core-seconds comes in at far far superior to both the Intel Xeon W-3245 and the AMD Threadripper 3900X.
Imagining what a hypothetical future Apple Silicon iMac or Mac Pro with an even more scaled up M1 variant, or perhaps some kind of multi-M1 Max chiplet or multisocket solution, is extremely exciting!
I think that with the upcoming Apple Silicon based large iMac and Mac Pro, Apple has a real shot at beating both Intel and AMD’s highest end CPUs to win the absolute workstation performance crown.</p>
<p>Of course, what makes the M1 Max’s performance numbers possible is the M1 Max’s energy efficiency; this kind of performance-per-watt is simply unparalleled in the desktop (meaning non-mobile, not desktop form factor) processor world.
The M1 architecture’s energy efficiency is what allows Apple to scale the design out into the M1 Pro and M1 Max and hopefully beyond.
Below is a breakdown of energy utilization for each of the rendering tests above; the total energy used for each render is the wall clock render time multiplied by the maximum TDP of each processor to get watt-seconds, which is then translated to watt-hours.
I assume maximum TDP for each processor since I ran Takua Renderer with processor utilization set to 100%.
For the two MacBook Pros, I’m just reporting the plugged-in results.</p>
<table>
<thead>
<tr>
<th style="text-align: right"> </th>
<th style="text-align: center">Forest Rendering (Main Camera)</th>
<th style="text-align: left"> </th>
</tr>
<tr>
<th style="text-align: right"> </th>
<th style="text-align: center">1920x1080, 8 spp, PT</th>
<th style="text-align: left"> </th>
</tr>
<tr>
<th style="text-align: right">Processor:</th>
<th style="text-align: center">Max TDP:</th>
<th style="text-align: left">Total Energy Used:</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: right">Apple M1 Max:</td>
<td style="text-align: center">60 W</td>
<td style="text-align: left">2.1191 Wh</td>
</tr>
<tr>
<td style="text-align: right">Intel Core i7-9750H:</td>
<td style="text-align: center">45 W</td>
<td style="text-align: left">3.6011 Wh</td>
</tr>
<tr>
<td style="text-align: right">Intel Xeon W-3245:</td>
<td style="text-align: center">205 W</td>
<td style="text-align: left">6.0550 Wh</td>
</tr>
<tr>
<td style="text-align: right">Intel Xeon E5-2680 x2:</td>
<td style="text-align: center">260 W</td>
<td style="text-align: left">11.4295 Wh</td>
</tr>
<tr>
<td style="text-align: right">AMD Threadripper 3990X:</td>
<td style="text-align: center">280 W</td>
<td style="text-align: left">3.0246 Wh</td>
</tr>
</tbody>
</table>
<table>
<thead>
<tr>
<th style="text-align: right"> </th>
<th style="text-align: center">Forest Rendering (Fern Camera)</th>
<th style="text-align: left"> </th>
</tr>
<tr>
<th style="text-align: right"> </th>
<th style="text-align: center">3840x2160, 8 spp, PT</th>
<th style="text-align: left"> </th>
</tr>
<tr>
<th style="text-align: right">Processor:</th>
<th style="text-align: center">Max TDP:</th>
<th style="text-align: left">Total Energy Used:</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: right">Apple M1 Max:</td>
<td style="text-align: center">60 W</td>
<td style="text-align: left">7.9708 Wh</td>
</tr>
<tr>
<td style="text-align: right">Intel Core i7-9750H:</td>
<td style="text-align: center">45 W</td>
<td style="text-align: left">13.5563 Wh</td>
</tr>
<tr>
<td style="text-align: right">Intel Xeon W-3245:</td>
<td style="text-align: center">205 W</td>
<td style="text-align: left">19.6625 Wh</td>
</tr>
<tr>
<td style="text-align: right">Intel Xeon E5-2680 x2:</td>
<td style="text-align: center">260 W</td>
<td style="text-align: left">41.6202 Wh</td>
</tr>
<tr>
<td style="text-align: right">AMD Threadripper 3990X:</td>
<td style="text-align: center">280 W</td>
<td style="text-align: left">8.4202 Wh</td>
</tr>
</tbody>
</table>
<p>At least for my rendering use case, the Apple M1 Max is easily the most energy efficient processor, even without taking into account that the 60 W TDP of the M1 Max is for the entire system-on-a-chip including CPU, GPU, and more, while the TDPs for all of the other processors are <em>just</em> for a CPU and don’t take into account the rest of the system.
The M1 Max manages to beat the 2019 16-inch MacBook Pro’s Intel Core i7-9750H in absolute performance by a factor of two whilst using anywhere between a two-thirds to half of the energy, and the M1 Max comes close to matching the 2019 Mac Pro’s absolute performance while using about a third of the energy.
Of course the comparison with the Intel Xeon E5-2680 workstation isn’t exactly fair since the M1 Max is manufactured using a 5 nm process while the ancient Intel Xeon E5-2580s were manufactured on a 35 nm process a decade ago, but I think the comparison still underscores just how far processors have advanced over the past decade leading up to the M1 Max.
The only processor that really comes near the M1 Max in terms of energy efficiency is the AMD Threadripper 3990X, which makes sense since the AMD Threadripper 3990X and the M1 Max are the closest cousins in this list in terms of manufacturing process; both are using leading-edge TSMC photolithography.
However, on a whole, the M1 Max is still more efficient than the AMD Threadripper 3990X, and again, the AMD Threadripper 3990X TDP is for just a CPU, not an entire SoC!
Assuming near-linear scaling, a hypothetical M1-derived variant that is scaled up 4.5 times to a 270 W TDP should be able to handily defeat the AMD Threadripper 3990X in absolute performance.</p>
<p>The wider takeaway here though is that in order to give the M1 Max some real competition, one has to skip laptop chips entirely and reach for not just high end desktop chips, but for server-class workstation hardware to really beat the M1 Max.
For workloads that push the CPU to maximum utilization for sustained periods of time, such as production-quality path traced rendering, the M1 Max represents a fundamental shift in what is possible in a laptop form factor.
Something even more exciting to think about is how the M1 Max really is the <em>middle</em> tier Apple Silicon solution; presumably the large iMac and Mac Pro will push things into even more absurd territory.</p>
<p>So those are my initial thoughts on the Apple M1 Max chip and my initial experiences with getting my hobby renderer up and running on the 2021 14-inch MacBook Pro.
I’m extremely impressed, and not just with the chip!
This post mostly focused on the chip itself, but the rest of the 2021 MacBook Pro lineup is just as impressive.
For rendering professionals and enthusiasts alike, one aspect of the 2021 MacBook Pros that will likely be just as important as the processor is the incredible screen.
The 2021 MacBook Pros ship with what I believe is an industry first: a micro-LED backlit 120 Hz display with an extended dynamic range that can go up to 1600 nits peak brightness.
The screen is absolutely gorgeous, which is a must for anyone who spends their time generating pixels with a 3D renderer!
One thing on my to-do list was to add extended dynamic range support to <a href="https://tom94.net">Thomas Müller</a>’s excellent <a href="https://github.com/Tom94/tev">tev image viewer</a>, which is a popular tool in the rendering research community.
However, it turns out that Thomas already added extended dynamic range support, and it looks amazing on the 2021 MacBook Pro’s XDR display.</p>
<p>In this post I didn’t go into the M1 Max’s GPU at all, even though the GPU in many ways might actually be even more interesting than the CPU (which is saying a lot considering how interesting the CPU is).
On paper at least, the M1 Max’s GPU aims for roughly mobile NVIDIA GeForce RTX 3070 performance, but how the M1 Max and a mobile NVIDIA GeForce RTX 3070 actually will compare for ray traced rendering is difficult to say without actually conducting some tests.
On one hand, the M1 Max’s unified memory architecture grants its GPU far more memory than any NVIDIA mobile GPU by a huge margin, and the M1 Max’s unified memory architecture opens up a wide variety of interesting optimizations that are otherwise difficult to do when managing separate pools of CPU and GPU memory.
On the other hand though, the M1 Max’s GPU lacks the dedicated hardware ray tracing acceleration that modern NVIDIA and AMD GPUs and the upcoming Intel discrete GPUs all have, and in my experience so far, dedicated hardware ray tracing acceleration makes a huge difference in GPU ray tracing performance.
Maybe Apple will add hardware ray tracing acceleration in the future; Metal already has software ray tracing APIs, and there already is a precedent for Apple Silicon including dedicated hardware for accelerating relatively niche, specific professional workflows.
As an example, the M1 Pro and M1 Max include hardware ProRes acceleration for high-end video editing.
Over the next year, I am undertaking a large-scale effort to port the entirety of Takua Renderer to work on GPUs through CUDA on NVIDIA GPUs, and through Metal on Apple Silicon devices.
Even though I’ve just gotten started on this project, I’ve already learned a lot of interesting things comparing CUDA and Metal compute; I’ll have much more to say on the topic hopefully soon!</p>
<p>Beyond the CPU and GPU and screen, there are still even more other nice features that the new MacBook Pros have for professional workflows like high-end rendering, but I’ll skip going through them in this post since I’m sure they’ll be thoroughly covered by all of the various actual tech reviewers out on the internet.</p>
<p>To conclude for now, here are two more bonus images that I rendered on the M1 Max 14-inch MacBook Pro.
I originally planned on just rendering the earlier three images in this post, but to my surprise, I found that I had enough time to do a few more!
I think that kind of encapsulates the M1 Pro and M1 Max MacBook Pros in a nutshell: I expected incredible performance, but was surprised to find even my high expectations met and surpassed.</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Oct/takua-on-m1-max/forest.cam4.0.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Oct/takua-on-m1-max/preview/forest.cam4.0.jpg" alt="Figure 6: A mossy log, ferns, and debris on the forest floor. Rendered using Takua Renderer on a M1 Max 14-inch MacBook Pro. Click through for full 4K version." /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Oct/takua-on-m1-max/forest.cam5.0.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Oct/takua-on-m1-max/preview/forest.cam5.0.jpg" alt="Figure 7: Sunlight transmitting through pine leaves in the forest canopy. Rendered using Takua Renderer on a M1 Max 14-inch MacBook Pro. Click through for full 4K version." /></a></p>
<p>A huge thanks to everyone at Apple that made this post possible!
Also a big thanks to Rajesh Sharma and Mark Lee for catching typos and making some good suggestions.</p>
https://blog.yiningkarlli.com/2021/09/neon-vs-sse.html
Comparing SIMD on x86-64 and arm64
2021-09-07T00:00:00+00:00
2021-09-07T00:00:00+00:00
Yining Karl Li
<p>I recently wrote a big two-part series about a ton of things that I learned throughout the process of porting my hobby renderer, Takua Renderer, to 64-bit ARM.
In the <a href="https://blog.yiningkarlli.com/2021/07/porting-takua-to-arm-pt2.html">second part</a>, one of the topics I covered was how the Embree ray tracing kernels library gained arm64 support by utilizing the sse2neon project to emulate x86-64 SSE2 SIMD instructions using arm64’s Neon SIMD instructions.
In the second part of the series, I had originally planned on diving much deeper into comparing writing vectorized code using SSE intrinsics versus using Neon intrinsics versus other approaches, but the comparison write-up became so large that I wound up leaving it out of the original post with the intention of making the comparison into its own standalone post.
This post is that standalone comparison!</p>
<p>As discussed in my porting to arm64 series, a huge proportion of the raw compute power in modern CPUs is located in vector <a href="https://en.wikipedia.org/wiki/SIMD">SIMD instruction set extensions</a>, and lots of things in computer graphics happen to be be workload types that fit vectorization very well.
Long-time readers of this blog probably already know what SIMD instructions do, but for the unfamiliar, here’s a very brief summary.
SIMD stands for <em>single instruction, multiple data</em>, and is a type of parallel programming that exploits <em>data level parallelism</em> instead of concurrency.
What the above means is that, unlike multithreading, in which multiple different streams of instructions simultaneously execute on different cores over different pieces of data, in a SIMD program, a single instruction stream executes simultaneously over different pieces of data.
For example, a 4-wide SIMD multiplication instruction would simultaneously execute a single multiply instruction over four pairs of numbers; each pair is multiplied together at the same time as the other pairs.
SIMD processing makes processors more powerful by allowing the processor to process more data within the same clock cycle; many modern CPUs implement SIMD extensions to their base scalar instruction sets, and modern GPUs are at a very high level broadly similar to huge ultra-wide SIMD processors.</p>
<p>Multiple approaches exist today for writing vectorized code.
The four main ways available today are: directly write code using SIMD assembly instructions, write code using compiler-provided vector intrinsics, write normal scalar code and rely on compiler auto-vectorization to emit vectorized assembly, or write code using ISPC: the Intel SPMD Program Compiler.
Choosing which approach to use for a given project requires considering many different tradeoffs and factors, such as ease of programming, performance, and portability.
Since this post is looking at comparing SSE2 and Neon, portability is especially interesting here.
Auto-vectorization and ISPC are the most easily portable approaches, while vector intrinsics can be made portable using sse2neon, but each of these approaches requires different trade-offs in other areas.</p>
<p>In this post, I’ll compare vectorizing the same snippet of code using several different approaches.
On x86-64, I’ll compare implementations using SSE intrinsics, using auto-vectorization, and using ISPC emitting SSE assembly.
On arm64, I’ll compare implementations using Neon intrinsics, using SSE intrinsics emulated on arm64 using sse2neon, using auto-vectorization, and using ISPC emitting Neon assembly.
I’ll also evaluate how each approach does in balancing portability, ease-of-use, and performance.</p>
<p><strong>4-wide Ray Bounding Box Intersection</strong></p>
<p>For my comparisons, I wanted to use a small but practical real-world example.
I wanted something small since I wanted to be able to look at the assembly output directly, and keeping things smaller makes the assembly output easier to read all at once.
However, I also wanted something real-world to make sure that whatever I learned wasn’t just the result of a contrived artificial example.
The comparison that I picked is a common operation inside of ray tracing: 4-wide ray bounding box intersection.
By 4-wide, I mean intersecting the same ray against four bounding boxes at the same time.
Ray bounding box intersection tests are a fundamental operation in BVH traversal, and typically account for a large proportion (often a majority) of the computational cost in ray intersection against the scene.
Before we dive into code, here’s some background on BVH traversal and the role that 4-wide ray bounding box intersection plays in modern ray tracing implementations.</p>
<p>Acceleration structures are a critical component of ray tracing; tree-based acceleration structures convert tracing a ray against a scene from being a O(<em>N</em>) problem into a O(<em>log(N)</em>) problem, where <em>N</em> is the number of objects that are in the scene.
For scenes with lots of objects and for objects made up of lots of primitives, lowering the worst-case complexity of ray intersection from linear to logarithmic is what makes the difference between ray tracing being impractical and practical.
From roughly the late 90s through to the early 2010s, a number of different groups across the graphics field put an enormous amount of research and effort into establishing the best possible acceleration structures.
Early on, the broad general consensus was that KD-trees were the most efficient acceleration structure for ray intersection performance, while BVHs were known to be faster to build than KD-trees but less performant at actual ray intersection.
However, advancements over time improved BVH ray intersection performance <a href="https://doi.org/10.1145/1572769.1572771">[Stich et al. 2009]</a> to the point where today, BVHs are now the dominant acceleration structure used in pretty much every production ray tracing solution.
For a history and detailed survey of BVH research over the past twenty-odd years, please refer to Meister et al. <a href="https://doi.org/10.1111/cgf.142662">[2021]</a>.
One interesting thing to note when looking through the past twenty years of ray tracing acceleration research are the author names; many of these authors are the same people that went on to create the modern underpinnings of Embree, Optix, and the ray acceleration hardware found in NVIDIA’s RTX GPUs.</p>
<p>A BVH is a tree structure where bounding boxes are placed over all of the objects that need to be intersected, and then these bounding boxes are grouped into (hopefully) spatially local groups.
Each group is then enclosed in another bounding box, and these boxes are grouped again, and so on and so forth until a top-level bounding box is reached that contains everything below.
In university courses, BVHs are traditionally taught as being binary trees, meaning that each node within the tree structure bounds two children nodes.
Binary BVHs are the simplest possible BVH to build and implement, hence why they’re usually the standard version taught in schools.
However, the actual branching factor at each BVH node doesn’t have to be binary; the branching factor can be any integer number greater than 2.
BVHs with 4 and even 8 wide branching factors have largely come to dominate production usage today.</p>
<p>The reason production BVHs today tend to have wide branching factors originates in the need to vectorize BVH traversal in order to utilize the maximum possible performance of SIMD-enabled CPUs.
Early attempts at vectorizing BVH traversal centered around tracing groups, or packets, of multiple rays through a BVH together; packet tracing allows for simultaneously intersecting N rays against a single bounding box at each node in the hierarchy <a href="https://doi.org/10.1111/1467-8659.00508">[Wald et al. 2001]</a>, where N is the vector width.
However, packet tracing only really works well for groups of rays that are all going in largely the same direction from largely the same origin; for incoherent rays, divergence in the traversal path each incoherent ray needs to take through the BVH destroys the efficacy of vectorized packet traversal.
To solve this problem, several papers concurrently proposed a different solution to vectorizing BVH traversal <a href="https://doi.org/10.1109/RT.2008.4634620">[Wald et al. 2008</a>, <a href="https://doi.org/10.1109/RT.2008.4634618">Ernst and Greiner 2008</a>, <a href="https://doi.org/10.1111/j.1467-8659.2008.01261.x">Dammertz et al. 2008]</a>: instead of simultaneously intersecting N rays against a single bounding box, this new solution simultaneously intersects a single ray against N bounding boxes.
Since the most common SIMD implementations are at least 4 lanes wide, BVH implementations that want to take maximum advantage of SIMD hardware also need to be able to present four bounding boxes at a time for vectorized ray intersection, hence the move from a splitting factor of 2 to a splitting factor of 4 or even wider.
In addition to being more performant when vectorized, a 4-wide splitting factor also tends to reduce the depth and therefore memory footprint of BVHs, and 4-wide BVHs have also been demonstrated to be able to outperform 2-wide BVHs even without vectorization <a href="https://psychopath.io/post/2017_08_03_bvh4_without_simd">[Vegdahl 2017]</a>.
Vectorized 4-wide BVH traversal can also be combined with the previous packet approach to yield even better performance for coherent rays <a href="https://doi.org/10.1145/1572769.1572793">[Tsakok 2009]</a>.</p>
<p>All of the above factors combined are why BVHs with wider branching factors are more popularly used today on the CPU; for example, the widely used Embree library <a href="https://doi.org/10.1145/2601097.2601199">[Wald et al. 2014]</a> offers 4-wide as the <em>minimum</em> supported split factor, and supports even wider split factors when vectorizing using wider AVX instructions.
On the GPU, the story is similar, although a little bit more complex since the GPU’s SIMT (as opposed to SIMD) parallelism model changes the relative importance of being able to simultaneously intersect one ray against multiple boxes.
GPU ray tracing systems today use a variety of different split factors; AMD’s RDNA2-based GPUs implement hardware acceleration for a 4-wide BVH <a href="https://gpuopen.com/rdna2-isa-available/">[AMD 2020]</a>.
NVIDIA does not publicly disclose what split factor their RTX GPUs assume in hardware, since their various APIs for accessing the ray tracing hardware are designed to allow for changing out for different, better future techniques under the hood without modification to client applications.
However, we can guess that support for multiple different splitting factors seems likely given that Optix 7 uses different splitting factors depending on whether an application wants to prioritize BVH construction speed or BVH traversal speed <a href="https://raytracing-docs.nvidia.com/optix7/guide/index.html">[NVIDIA 2021]</a>.
While not explicitly disclosed, as of writing, we can reasonable guess based off of what Optix 6.x implemented that Optix 7’s fast construction mode implements a TRBVH <a href="https://doi.org/10.1145/2492045.2492055">[Karras and Aila 2013]</a>, which is a binary BVH, and that Optix 7’s performance-optimized mode implements a 8-wide BVH with compression <a href="https://doi.org/10.1145/3105762.3105773">[Ylitie et al. 2017]</a>.</p>
<p>Since the most common splitting factor in production CPU cases in a 4-wide split, and since SSE and Neon are both 4-wide vector instruction sets, I think the core simultaneous single-ray-4-box intersection test is a perfect example case to look at!
To start off, we need an efficient intersection test between a single ray and a single axis-aligned bounding box.
I’ll be using the commonly utilized solution by Williams et al. <a href="https://doi.org/10.1080/2151237X.2005.10129188">[2005]</a>; improved techniques with better precision <a href="http://jcgt.org/published/0002/02/02/">[Ize 2013]</a> and more generalized flexibility <a href="http://jcgt.org/published/0007/03/04/">[Majercik 2018]</a> do exist, but I’ll stick with the original Williams approach in this post to keep things simple.</p>
<p><strong>Test Program Setup</strong></p>
<p>Everything in this post is implemented in a small test program that I have <a href="https://github.com/betajippity/sseneoncompare">put in an open Github repository</a>, licensed under the Apache-2.0 License.
Feel free to clone the repository for yourself to follow along using or to play with!
To build and run the test program yourself, you will need a version of <a href="https://cmake.org">CMake</a> that has ISPC support (so, CMake 3.19 or newer), a modern C++ compiler with support for C++17, and a version of <a href="https://ispc.github.io">ISPC</a> that supports Neon output for arm64 (so, ISPC v1.16.1 or newer); further instructions for building and running the test program is included in the repository’s README.md file.
The test program compiles and runs on both x86-64 and arm64; on each processor architecture, the appropriate implementations for each processor architecture are automatically chosen for compilation.</p>
<p>The test program runs each single-ray-4-box intersection test implementation N times, where N is an integer that can be set by the user as the first input argument to the program.
By default, and for all results in this post, N is set to 100000 runs.
The four bounding boxes that the intersection tests run against are hardcoded into the test program’s main function and are reused for all N runs.
Since the bounding boxes are hardcoded, I had to take some care to make sure that the compiler wasn’t going to pull any optimization shenanigans and not actually run all N runs.
To make sure of the above, the test program is compiled in two separate pieces: all of the actual ray-bounding-box intersection functions are compiled into a static library using <code class="language-plaintext highlighter-rouge">-O3</code> optimization, and then the test program’s main function is compiled separately with all optimizations disabled, and then the intersection functions static library is linked in.</p>
<p>Ideally I would have liked to set up the project to compile directly to a Universal Binary on macOS, but unfortunately CMake’s built-in infrastructure for compiling multi-architecture binaries doesn’t really work with ISPC at the moment, and I was too lazy to manually set up custom CMake scripts to invoke ISPC multiple times (once for each target architecture) and call the macOS <code class="language-plaintext highlighter-rouge">lipo</code> tool; I just compiled and ran the test program separately on an x86-64 Mac and on an arm64 Mac.
However, on both the x86-64 and arm64 systems, I used the same operating system and compilers.
For all of the results in this post, I’m running on macOS 11.5.2 and I’m compiling using Apple Clang v12.0.5 (which comes with Xcode 12.5.1) for C++ code and ISPC v1.16.1 for ISPC code.</p>
<p>For the rest of the post, I’ll include results for each implementation in the section discussing that implementation, and then include all results together in a <a href="#results">results section</a> at the end.
All results were generated by running on a 2019 16 inch MacBook Pro with a Intel Core i7-9750H CPU for x86-64, and on a 2020 M1 Mac Mini for arm64 and Rosetta 2.
All results were generated by running the test program with 100000 runs per implementation, and I averaged results across 5 runs of the test program after throwing out the highest and lowest result for each implementation to discard outliers.
The timings reported for each implementation are the average across 100000 runs.</p>
<p><strong>Defining structs usable with both SSE and Neon</strong></p>
<p>Before we dive into the ray-box intersection implementations, I need to introduce and describe the handful of simple structs that the test program uses.
The most widely used struct in the test program is <code class="language-plaintext highlighter-rouge">FVec4</code>, which defines a 4-dimensional float vector by simply wrapping around four floats.
<code class="language-plaintext highlighter-rouge">FVec4</code> has one important trick: <code class="language-plaintext highlighter-rouge">FVec4</code> uses a union to accomplish type punning, which allows us to access the four floats in <code class="language-plaintext highlighter-rouge">FVec4</code> either as separate individual floats, or as a single <code class="language-plaintext highlighter-rouge">__m128</code> when using SSE or a single <code class="language-plaintext highlighter-rouge">float32x4_t</code> when using Neon.
<code class="language-plaintext highlighter-rouge">__m128</code> on SSE and <code class="language-plaintext highlighter-rouge">float32x4_t</code> on Neon serve the same purpose; since SSE and Neon use 128-bit wide registers with four 32-bit “lanes” per register, intrinsics implementations for SSE and Neon need a 128-bit data type that maps directly to the vector register when compiled.
The SSE intrinsics implementation defined in <code class="language-plaintext highlighter-rouge"><xmmintrin.h></code> uses <code class="language-plaintext highlighter-rouge">__m128</code> as its single generic 128-bit data type, whereas the Neon intrinsics implementation defined in <code class="language-plaintext highlighter-rouge"><arm_neon.h></code> defines separate 128-bit types depending on what is being stored.
For example, Neon intrinsics use <code class="language-plaintext highlighter-rouge">float32x4</code> as its 128-bit data type for four 32-bit floats, and uses <code class="language-plaintext highlighter-rouge">uint32x4</code> as its 128-bit data type for four 32-bit unsigned integers, and so on.
Each 32-bit component in a 128-bit vector register is often known as a <em>lane</em>.
The process of populating each of the lanes in a 128-bit vector type is sometimes referred to as a <em>gather</em> operation, and the process of pulling 32-bit values out of the 128-bit vector type is sometimes referred to as a <em>scatter</em> operation; the <code class="language-plaintext highlighter-rouge">FVec4</code> struct’s type punning makes gather and scatter operations nice and easy to do.</p>
<p>One of the comparisons that the test program does on arm64 machines is between an implementation using native Neon intrinsics, and an implementation written using SSE intrinsics that are emulated with Neon intrinsics under the hood on arm64 via the sse2neon project.
Since for this test program, SSE intrinsics are available on both x86-64 (natively) and on arm64 (through sse2neon), we don’t need to wrap the <code class="language-plaintext highlighter-rouge">__m128</code> member of the union in any <code class="language-plaintext highlighter-rouge">#ifdefs</code>.
We do need to <code class="language-plaintext highlighter-rouge">#ifdef</code> out the Neon implementation on x86-64 though, hence the check for <code class="language-plaintext highlighter-rouge">#if defined(__aarch64__)</code>.
Putting everything above all together, we can get a nice, convenient 4-dimensional float vector in which we can access each component individually and access the entire contents of the vector as a single intrinsics-friendly 128-bit data type on both SSE and Neon:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>struct FVec4 {
union { // Use union for type punning __m128 and float32x4_t
__m128 m128;
#if defined(__aarch64__)
float32x4_t f32x4;
#endif
struct {
float x;
float y;
float z;
float w;
};
float data[4];
};
FVec4() : x(0.0f), y(0.0f), z(0.0f), w(0.0f) {}
#if defined(__x86_64__)
FVec4(__m128 f4) : m128(f4) {}
#elif defined(__aarch64__)
FVec4(float32x4_t f4) : f32x4(f4) {}
#endif
FVec4(float x_, float y_, float z_, float w_) : x(x_), y(y_), z(z_), w(w_) {}
FVec4(float x_, float y_, float z_) : x(x_), y(y_), z(z_), w(0.0f) {}
float operator[](int i) const { return data[i]; }
float& operator[](int i) { return data[i]; }
};
</code></pre></div></div>
<div class="codecaption">Listing 1: <code class="language-plaintext highligher-rouge">FVec4</code> definition, which defines a 4-dimensional float vector that can be accessed as either a single 128-bit vector value or as individual 32-bit floats.</div>
<p>The actual implementation in the test project has a few more functions defined as part of <code class="language-plaintext highlighter-rouge">FVec4</code> to provide basic arithmetic operators.
In the test project, I also define <code class="language-plaintext highlighter-rouge">IVec4</code>, which is a simple 4-dimensional integer vector type that is useful for storing multiple indices together.
Rays are represented as a simple struct containing just two <code class="language-plaintext highlighter-rouge">FVec4</code>s and two floats; the two <code class="language-plaintext highlighter-rouge">FVec4</code>s store the ray’s direction and origin, and the two floats store the ray’s tMin and tMax values.</p>
<p>For representing bounding boxes, the test project has two different structs.
The first is <code class="language-plaintext highlighter-rouge">BBox</code>, which defines a single axis-aligned bounding box for purely scalar use.
Since <code class="language-plaintext highlighter-rouge">BBox</code> is only used for scalar code, it just contains normal floats and doesn’t have any vector data types at all inside:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>struct BBox {
union {
float corners[6]; // indexed as [minX minY minZ maxX maxY maxZ]
float cornersAlt[2][3]; // indexed as corner[minOrMax][XYZ]
};
BBox(const FVec4& minCorner, const FVec4& maxCorner) {
cornersAlt[0][0] = fmin(minCorner.x, maxCorner.x);
cornersAlt[0][1] = fmin(minCorner.y, maxCorner.y);
cornersAlt[0][2] = fmin(minCorner.z, maxCorner.z);
cornersAlt[1][0] = fmax(minCorner.x, maxCorner.x);
cornersAlt[1][1] = fmax(minCorner.y, maxCorner.y);
cornersAlt[1][2] = fmax(minCorner.x, maxCorner.x);
}
FVec4 minCorner() const { return FVec4(corners[0], corners[1], corners[2]); }
FVec4 maxCorner() const { return FVec4(corners[3], corners[4], corners[5]); }
};
</code></pre></div></div>
<div class="codecaption">Listing 2: Struct holding a single bounding-box.</div>
<p>The second bounding box struct is <code class="language-plaintext highlighter-rouge">BBox4</code>, which stores four axis-aligned bounding boxes together.
<code class="language-plaintext highlighter-rouge">BBox4</code> internally uses <code class="language-plaintext highlighter-rouge">FVec4</code>s in a union with two different arrays of regular floats to allow for vectorized operation and individual access to each component of each corner of each box.
The internal layout of <code class="language-plaintext highlighter-rouge">BBox4</code> is not as simple as just storing four <code class="language-plaintext highlighter-rouge">BBox</code> structs; I’ll discuss how the internal layout of <code class="language-plaintext highlighter-rouge">BBox4</code> works a little bit later in this post.</p>
<p><strong>Williams et al. 2005 Ray-Box Intersection Test: Scalar Implementations</strong></p>
<p>Now that we have all of the data structures that we’ll need, we can dive into the actual implementations.
The first implementation is the reference scalar version of ray-box intersection.
The implementation below is pretty close to being copy-pasted straight out of the Williams et al. 2005 paper, albeit with some minor changes to use our previously defined data structures:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>bool rayBBoxIntersectScalar(const Ray& ray, const BBox& bbox, float& tMin, float& tMax) {
FVec4 rdir = 1.0f / ray.direction;
int sign[3];
sign[0] = (rdir.x < 0);
sign[1] = (rdir.y < 0);
sign[2] = (rdir.z < 0);
float tyMin, tyMax, tzMin, tzMax;
tMin = (bbox.cornersAlt[sign[0]][0] - ray.origin.x) * rdir.x;
tMax = (bbox.cornersAlt[1 - sign[0]][0] - ray.origin.x) * rdir.x;
tyMin = (bbox.cornersAlt[sign[1]][1] - ray.origin.y) * rdir.y;
tyMax = (bbox.cornersAlt[1 - sign[1]][1] - ray.origin.y) * rdir.y;
if ((tMin > tyMax) || (tyMin > tMax)) {
return false;
}
if (tyMin > tMin) {
tMin = tyMin;
}
if (tyMax < tMax) {
tMax = tyMax;
}
tzMin = (bbox.cornersAlt[sign[2]][2] - ray.origin.z) * rdir.z;
tzMax = (bbox.cornersAlt[1 - sign[2]][2] - ray.origin.z) * rdir.z;
if ((tMin > tzMax) || (tzMin > tMax)) {
return false;
}
if (tzMin > tMin) {
tMin = tzMin;
}
if (tzMax < tMax) {
tMax = tzMax;
}
return ((tMin < ray.tMax) && (tMax > ray.tMin));
}
</code></pre></div></div>
<div class="codecaption">Listing 3: A direct implementation of <a href="https://doi.org/10.1080/2151237X.2005.10129188">"An Efficient and Robust Ray-Box Intersection Algorithm" by Amy Williams et al. 2005.</a></div>
<p>For our test, we want to intersect a ray against four boxes, so we just write a wrapper function that calls <code class="language-plaintext highlighter-rouge">rayBBoxIntersectScalar()</code> four times in sequence.
In the wrapper function, <code class="language-plaintext highlighter-rouge">hits</code> is a reference to a <code class="language-plaintext highlighter-rouge">IVec4</code> where each component of the <code class="language-plaintext highlighter-rouge">IVec4</code> is used to store either <code class="language-plaintext highlighter-rouge">0</code> to indicate no intersection, or <code class="language-plaintext highlighter-rouge">1</code> to indicate an intersection:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>void rayBBoxIntersect4Scalar(const Ray& ray,
const BBox& bbox0,
const BBox& bbox1,
const BBox& bbox2,
const BBox& bbox3,
IVec4& hits,
FVec4& tMins,
FVec4& tMaxs) {
hits[0] = (int)rayBBoxIntersectScalar(ray, bbox0, tMins[0], tMaxs[0]);
hits[1] = (int)rayBBoxIntersectScalar(ray, bbox1, tMins[1], tMaxs[1]);
hits[2] = (int)rayBBoxIntersectScalar(ray, bbox2, tMins[2], tMaxs[2]);
hits[3] = (int)rayBBoxIntersectScalar(ray, bbox3, tMins[3], tMaxs[3]);
}
</code></pre></div></div>
<div class="codecaption">Listing 4: Wrap and call <code class="language-plaintext highligher-rouge">rayBBoxIntersectScalar()</code> four times sequentially to implement scalar 4-way ray-box intersection.</div>
<p>The implementation provided in the original paper is easy to understand, but unfortunately is not in a form that we can easily vectorize.
Note the six branching if statements; branching statements do not bode well for good vectorized code.
The reason branching doesn’t go well with SIMD code is because with SIMD code, the same instruction has to be executed in lockstep across all four SIMD lanes; the only way for different lanes to execute different branches is to run all branches across all lanes sequentially, and for each branch mask out the lanes that the branch shouldn’t apply to.
Contrast with normal scalar sequential execution where we process one ray-box intersection at a time; each ray-box test can independently choose what codepath to execute at each branch and completely bypass executing branches that never get taken.
Scalar code can also do fancy things like advanced branch prediction to further speed things up.</p>
<p>In order to get to a point where we can more easily write vectorized SSE and Neon implementations of the ray-box test, we first need to refactor the original implementation into an intermediate scalar form that is more amenable to vectorization.
In other words, we need to rewrite the code in Listing 3 to be as branchless as possible.
We can see that each of the if statements in Listing 3 is comparing two values and, depending on which value is greater, assigning one value to be the same as the other value.
Fortunately, this type of compare-and-assign with floats can easily be replicated in a branchless fashion by just using a <code class="language-plaintext highlighter-rouge">min</code> or <code class="language-plaintext highlighter-rouge">max</code> operation.
For example, the branching statement <code class="language-plaintext highlighter-rouge">if (tyMin > tMin) { tMin = tyMin; }</code> can be easily replaced with the branchless statement <code class="language-plaintext highlighter-rouge">tMin = fmax(tMin, tyMin);</code>.
I chose to use <code class="language-plaintext highlighter-rouge">fmax()</code> and <code class="language-plaintext highlighter-rouge">fmin()</code> instead of <code class="language-plaintext highlighter-rouge">std::max()</code> and <code class="language-plaintext highlighter-rouge">std::min()</code> because I found <code class="language-plaintext highlighter-rouge">fmax()</code> and <code class="language-plaintext highlighter-rouge">fmin()</code> to be slightly faster in this example.
The good thing about replacing our branches with <code class="language-plaintext highlighter-rouge">min</code>/<code class="language-plaintext highlighter-rouge">max</code> operations is that SSE and Neon both have intrinsics to do vectorized <code class="language-plaintext highlighter-rouge">min</code> and <code class="language-plaintext highlighter-rouge">max</code> operations in the form of <code class="language-plaintext highlighter-rouge">_mm_min_ps</code> and <code class="language-plaintext highlighter-rouge">_mm_max_ps</code> for SSE and <code class="language-plaintext highlighter-rouge">vminq_f32</code> and <code class="language-plaintext highlighter-rouge">vmaxq_f32</code> for Neon.</p>
<p>Also note how in Listing 3, the index of each corner is calculated while looking up the corner; for example: <code class="language-plaintext highlighter-rouge">bbox.cornersAlt[1 - sign[0]]</code>.
To make the code easier to vectorize, we don’t want to be computing indices in the lookup; instead, we want to precompute all of the indices that we will want to look up.
In Listing 5, the <code class="language-plaintext highlighter-rouge">IVec4</code> values named <code class="language-plaintext highlighter-rouge">near</code> and <code class="language-plaintext highlighter-rouge">far</code> are used to store precomputed lookup indices.
Finally, one more shortcut we can make with an eye towards easier vectorization is that we don’t actually care what the values of <code class="language-plaintext highlighter-rouge">tMin</code> and <code class="language-plaintext highlighter-rouge">tMax</code> are in the event that the ray misses the box; if the values that come out of a missed hit in our vectorized implementation don’t exactly match the values that come out of a missed hit in the scalar implementation, that’s okay!
We just need to check for the missed hit case and instead return whether or not a hit has occurred as a bool.</p>
<p>Putting all of the above together, we can rewrite Listing 3 into the following much more compact, more more SIMD friendly scalar implementation:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>bool rayBBoxIntersectScalarCompact(const Ray& ray, const BBox& bbox, float& tMin, float& tMax) {
FVec4 rdir = 1.0f / ray.direction;
IVec4 near(int(rdir.x >= 0.0f ? 0 : 3), int(rdir.y >= 0.0f ? 1 : 4),
int(rdir.z >= 0.0f ? 2 : 5));
IVec4 far(int(rdir.x >= 0.0f ? 3 : 0), int(rdir.y >= 0.0f ? 4 : 1),
int(rdir.z >= 0.0f ? 5 : 2));
tMin = fmax(fmax(ray.tMin, (bbox.corners[near.x] - ray.origin.x) * rdir.x),
fmax((bbox.corners[near.y] - ray.origin.y) * rdir.y,
(bbox.corners[near.z] - ray.origin.z) * rdir.z));
tMax = fmin(fmin(ray.tMax, (bbox.corners[far.x] - ray.origin.x) * rdir.x),
fmin((bbox.corners[far.y] - ray.origin.y) * rdir.y,
(bbox.corners[far.z] - ray.origin.z) * rdir.z));
return tMin <= tMax;
}
</code></pre></div></div>
<div class="codecaption">Listing 5: A much more compact implementation of Williams et al. 2005; this implementation does not calculate a negative tMin if the ray origin is inside of the box.</div>
<p>The wrapper around <code class="language-plaintext highlighter-rouge">rayBBoxIntersectScalarCompact()</code> to make a function that intersects one ray against four boxes is exactly the same as in Listing 4, just with a call to the new function, so I won’t bother going into it.</p>
<p>Here is how the scalar compact implementation (Listing 5) compares to the original scalar implementation (Listing 3).
The “speedup” columns use the scalar compact implementation as the baseline:</p>
<table>
<thead>
<tr>
<th style="text-align: right"> </th>
<th style="text-align: center">x86-64:</th>
<th style="text-align: center">x86-64 Speedup:</th>
<th style="text-align: center">arm64:</th>
<th style="text-align: center">arm64 Speedup:</th>
<th style="text-align: center">Rosetta2:</th>
<th style="text-align: center">Rosetta2 Speedup:</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: right">Scalar Compact:</td>
<td style="text-align: center">44.5159 ns</td>
<td style="text-align: center">1.0x.</td>
<td style="text-align: center">41.8187 ns</td>
<td style="text-align: center">1.0x.</td>
<td style="text-align: center">81.0942 ns</td>
<td style="text-align: center">1.0x.</td>
</tr>
<tr>
<td style="text-align: right">Scalar Original:</td>
<td style="text-align: center">44.1004 ns</td>
<td style="text-align: center">1.0117x</td>
<td style="text-align: center">78.4001 ns</td>
<td style="text-align: center">0.5334x</td>
<td style="text-align: center">90.7649 ns</td>
<td style="text-align: center">0.8935x</td>
</tr>
<tr>
<td style="text-align: right">Scalar No Early-Out:</td>
<td style="text-align: center">55.6770 ns</td>
<td style="text-align: center">0.8014x</td>
<td style="text-align: center">85.3562 ns</td>
<td style="text-align: center">0.4899x</td>
<td style="text-align: center">102.763 ns</td>
<td style="text-align: center">0.7891x</td>
</tr>
</tbody>
</table>
<p>The original scalar implementation is actually ever-so-slightly faster than our scalar compact implementation on x86-64!
This result actually doesn’t surprise me; note that the original scalar implementation has early-outs when checking the values of <code class="language-plaintext highlighter-rouge">tyMin</code> and <code class="language-plaintext highlighter-rouge">tzMin</code>, whereas the early-outs have to be removed in order to restructure the original scalar implementation into the vectorization-friendly compact scalar implementation.
To confirm that the original scalar implementation is faster because of the early-outs, in the test program I also include a version of the original scalar implementation that has the early-outs removed.
Instead of returning when the checks on <code class="language-plaintext highlighter-rouge">tyMin</code> or <code class="language-plaintext highlighter-rouge">tzMin</code> fail, I modified the original scalar implementation to store the result of the checks in a bool that is stored until the end of the function and then checked at the end of the function.
In the results, this modified version of the original scalar implementation is labeled as “Scalar No Early-Out”; this modified version is considerably slower than the compact scalar implementation on both x86-64 and arm64.</p>
<p>The more surprising result is that the original scalar implementation is <em>slower</em> than the compact scalar implementation on arm64, and by a considerable amount!
Even more interesting is that the original scalar implementation and the modified “no early-out” version perform relatively similarly on arm64; this result strongly hints to me that for whatever reason, the version of Clang I used just wasn’t able to optimize for arm64 as well as it was able to for x86-64.
Looking at the <a href="https://godbolt.org/#z:OYLghAFBqd5QCxAYwPYBMCmBRdBLAF1QCcAaPECAMzwBtMA7AQwFtMQByARg9KtQYEAysib0QXACx8BBAKoBnTAAUAHpwAMvAFYTStJg1DIApACYAQuYukl9ZATwDKjdAGFUtAK4sGe1wAyeAyYAHI%2BAEaYxBKS0gAOqAqETgwe3r56icmOAkEh4SxRMVLSdpgOqUIETMQE6T5%2BXLaY9rkM1bUE%2BWGR0bFlNXUNmc0KQ93BvUX9pQCUtqhexMjsHOYAzHhUANRYNCHoEAD6x6oAHABsx5eSp3MmGgCCm8HI3lg7JhtuqiwswQIxGCADoEN9sI8XmYNq1tntMAdMEdTkxasgELd7lDXgx3l5Pt83LUWMcQgIwRCcTDXNtqVs8R9MF8fiwmAQEJSNtgdl9nlDxsQvA4dgAxABqFUkXwA7FZnrzeV4GKlZRZeQB6DU7RTM5Wq/jEHYEACe8WZ8WVKqMO1OLC4ZnOO0M6B2VFoqHZGzMqjuBHp8P2UxRZyuNzuxweCsVMbtDqd9sd33l0NhtEDiODJ2OaJWmIjUaeMZj7s9BG9vuOBDdFckyfptKoUOLisFwurJjlzZbxdL7J2qnr0Z7ir71ZNQ6LI9HHv7AC9J9OZ2WdgB3Rc9zsAEQ3LbHe3ZTBMAFYLHXjzuNimY9vJ93eRKpRA5jsQAOIBoQRoqAsdiaP1%2BP6kDsc4Ad%2Bv6rmBP5qre/IwhmSIhhc1xYpG95ipKyCSNmiZOlQkgvm%2BuHUARMEyjucFpghWaoui%2BbYsOGFPmOtZVm6pFvlQtYkS%2BnZWORDYMPgTb8oxj5YdQs7VqoxzAfuJqyW6UkgYp%2B6rpGr7vjJv7/gpv6gXOkbAZB6m8V2AliZh2H7jJcnKQpdkroZhFaUZf4QHpwEGW5kGfuBZEUS8jH7qg5rEOyJAnlYF4QICOx4C%2BaAMOMao7MQmAEMsDAHjUUV4Cel7qrBU68mO5iXDsoXRBFxBRQVsWCPFZnqulmXENl6CHnlBXJrKgXoeJ0pVeFRC1ZYEBJSlg3lTsETNWlGVZUxEmqF8lizSCqjASaa3qhEIIml5u0bXOxnHftq4PFefUDVZlVhTVAC0E0CFNVkzXNqWtUtg0QKtj0bVtf47AD%2B2HSBIMnWdoMgpdvXFTGg33dVo0AFQvcl1ZlWYFWfXxC1tdlv2rajs3bTspMREdlNnZTV1FRZJXLUND2jRqGMpdjuPzd97XM39OzalTwNC0dotroLs30zd/ICVeOLPME6YhMzyMjSQ7OTVjUkzUwwFa8zH3NehvNE1ZEBMJL%2B1A5bQsHcBttQ86Vuw/TOKMwKQLtjsACSd18eh%2BoCDBjExm2IoB6HPZxYO8tRy2cUTnHTMjnFC7J0uvJxeuGebnL17R41nW5ae56Feht7J%2BhftPi5qgfjpDdeU3a4NwF1fmzHimJ93jWGcB2caW%2B9fadtHlud5EEQKZ7eMTXEldwPjUOfFfdD65jeeSBEDOcZbcBx7jFxcNNV1TFcUJTsBv46bOVHqe%2BUXvDjMxoCM0n6NZ9bg11aXzfi182LvfCwj9CoyxePnBWTxw7VgsBYVAq1I5MyDtlJBI59xoHatEBQUVLg9Wuj2LUq8sCqGRM6BQXxTwAgYAADR2NQgAmvQ4IAAtehTBVB0LZKoJh3CWE9XjkpFcmCQjEAUE8Wg/pTxmB6qeDY%2BD1SC21MEEhZCmAUJEdEKK1CADyxAACyHDZEWBoQw/hT9GKVxTOhOBCCObVmmjjZhaQSCiP1q9Bx70nHcI8Fg4gxtBGaLERIqRFgNDGPCReFkW43TUIgNQ3xojNrAR8a46Im03aBLScEyRUVIk7lPFwAq0TYnBHicERJ6TwapL8QdTJKdFRBPEbk08%2BSooyKid8GJVA4kJOySCU67DVCVOIAM%2BpI4mkhKikUp%2BrTildNiRw8pLjalAxqUk1Q4yeyTJaRYGZBS9nzI2N07hyyRn2yGeck0WyWw7NCfs9pRyTlLL6aslJHDzmbN3OA26UpnEjOfFfDxX0AFmyfE0vJBV3F%2BJwYUqFQKYWPIvNLYqvysKXOyYC6%2BcoCY/XNhCuR8KCVniJdk2FFhjwFRRR7SBoloFexFAAJSYDtNBiokb4HSpUAQ3ykYkDwMAYI3z9wEH0UK3OpVlKiqMVXRizL/wG0cRVTlFR2iKUVV4iq/LBUMFUlKsVurHL9mldpdCMY3wqu5QwCAlq1W/m1WUh1urfyirKa651wETUQBNRpA%2B/VaVBSeBEVAng0ostsaoH2ghsGqpEGIWo9idjypmuFcGBsI0fWDUDLmxoDVGv9E4k1ASmZI2IJykpXBAKS1TSCW1qRvlxWSMABgUV5FPwlTsJtLa5mdOOTsCAZa8CjMQT8HYGgbmdoFd2w5vaYkDs5QdFkbgx0Tq7UimJCz51DoGUuldd5gpSpNHm40R6OGernMeggF6ZUF15O6kpEAIhZpBHcqKa6e0FIOfkyGNanUZIpmlBdsdb25o4Q%2Bp9CCX1kqmYUyG76wmyK/cUgGv7gQ6v/aTQdw7vmmgNeB59r7TzwYebMmdMSUMspBH%2B65AGsMHRw6ekdc6IOqCgzCmDey4NTumYhnjUSKMmio2h0ENHMMLqTiB%2BEUB70QhPYYzZso3CdmXd6o9wQWQ8iLcWkct8qBiCUN8hGiopO4fU7J912mewyb7aZltHajNZ12Kp%2BTu6i0hwaXelzCzcM3ori/RUV68ObpY2x0RzTQnwY6Z%2B9dP7KN/rnC%2BMT2704gcC2B4LBHoO7K4Fx5t67jFRdi4J%2BLiXAPJYbU571QXuTGmvQp5Tyn%2B2BbMzVrT7mdOgrdPpzAhn/OOaaxelrmmDWWZbNZmJzW7MgYc/FJzaWR3Lrc2ysbXmbN1d64FGMt9pPVeXTWpbOMZpVfSzV/bI36moueAAN1QHgV0qaI1RoIDGhwkg40GGIIm5NTjU2kDNZnAHQLMY7AzU4ljGg/uCMBxM4FoPcZZuaP96H0503wMHGDrNZhIceeR9s2HaPM0II2Nj3HgP551icQgQgCgSek8zkq3NwQadI7p4jTVoHVAKFGzsKnBByXfs3YCOYD20dPZe8IUQH2B0suAuDz1Br%2BfwpNYr5F3zefkv2Q%2BoXIuEFi7EbGyXCbfuzQR/LpnfGtzy44RrqlavqcxcF4IYX4bRfRv1w4d7RuZcm4QVjxnyV11W85%2Buid6vW1PP7drl3uu3dKA94bz7xuWPE/9%2BSttluOdp9t1XRmHAFi0E4MeXgfgOBaFIKgTgSnLDWE7UsFYzJNg8FIAQTQeeFgAGsQDHghwXjgkhi%2Bt/L5wXgCgQAQ5b6XvPpA4CwCQGgFg8Q6DRHIJQefi/6AxHeIYYAXAuAaGaDQSR2DKAREHxEYItQTScCb/PtgggdEMFoFfyfpAsBsiMOIF/daruYFHy/zApCyAXgz21%2BvAgIrQg%2B6YEQ4UxAJoHgWAg%2BQIeALAoBCw7oTAwACg4oeAmAq4Oi5oJeTe/Agg727AUgMggguoagg%2BugzQBgRgKA1g1g%2BgeAEQo%2BkACwoU7Qf%2Bj0OiGwI%2BrQqqqQLgQkIwTQpAgQUwhQxQWQSQKQAgYhshOQqQPQ0hswLQbQVQEwihYwghVqnQdQqhfQJQtg2hngjQeg4wXQRhMwJQCwCgdeqwegQImAawPA%2BeheA%2BL%2BFeHAyEj0twV89BwAOwu%2BX4IIOWEAVeVglgwEuAhAJAa0GwzQOwHgC%2BS%2BRojecwvAE%2BWgcwHeXePenA/epAJeZePhI%2BY%2BzereeR%2BgnAZgXhZRw%2BVRk%2BNRP%2BYiwhkgQAA">compiled x86-64 assembly</a> and the <a href="https://godbolt.org/#z:OYLghAFBqd5QCxAYwPYBMCmBRdBLAF1QCcAaPECAMzwBtMA7AQwFtMQByARg9KtQYEAysib0QXACx8BBAKoBnTAAUAHpwAMvAFYTStJg1DIApACYAQuYukl9ZATwDKjdAGFUtAK4sGe1wAyeAyYAHI%2BAEaYxBKS0gAOqAqETgwe3r56icmOAkEh4SxRMVLSdpgOqUIETMQE6T5%2BXLaY9rkM1bUE%2BWGR0bFlNXUNmc0KQ93BvUX9pQCUtqhexMjsHOYAzHhUANRYNCHoEAD6x6oAHABsx5eSp3MmGgCCm8HI3lg7JhtuqiwswQIxGCADoEN9sI8XmYNq1tntMAdMEdTkxasgELd7lDXgx3l5Pt83LUWMcQgIwRCcTDXNtqVs8R9MF8fiwmAQEJSNtgdl9nlDxsQvA4dgAxABqFUkXwA7FZnrzeV4GKlZRZeQB6DU7RTM5Wq/jEHYEACe8WZ8WVKqMO1OLC4ZnOO0M6B2VFoqHZGzMqjuBHp8P2UxRZyuNzuxweCsVMbtDqd9sd33l0NhtEDiODJ2OaJWmIjUaeMZj7s9BG9vuOBDdFckyfptKoUOLisFwurJjlzZbxdL7J2qnr0Z7ir71ZNQ6LI9HHv7AC9J9OZ2WdgB3Rc9zsAEQ3LbHe3ZTBMAFYLHXjzuNimY9vJ93eRKpRA5jsQAOIBoQRoqAsdiaP1%2BP6kDsc4Ad%2Bv6rmBP5qre/IwhmSIhhc1xYpG95ipKyCSNmiZOlQkgvm%2BuHUARMEyjucFpghWaoui%2BbYsOGFPmOtZVm6pFvlQtYkS%2BnZWORDYMPgTb8oxj5YdQs7VqoxzAfuJqyW6UkgYp%2B6rpGr7vjJv7/gpv6gXOkbAZB6m8V2AliZh2H7jJcnKQpdkroZhFaUZf4QHpwEGW5kGfuBZEUS8jH7qg5rEOyJAnlYF4QICOx4C%2BaAMOMao7MQmAEMsDAHjUUV4Cel7qrBU68mO5iXDsoXRBFxBRQVsWCPFZnqulmXENl6CHnlBXJrKgXoeJ0pVeFRC1ZYEBJSlg3lTsETNWlGVZUxEmqF8lizSCqjASaa3qhEIIml5u0bXOxnHftq4PFefUDVZlVhTVAC0E0CFNVkzXNqWtUtg0QKtj0bVtf47AD%2B2HSBIMnWdoMgpdvXFTGg33dVo0AFQvcl1ZlWYFWfXxC1tdlv2rajs3bTspMREdlNnZTV1FRZJXLUND2jRqGMpdjuPzd97XM39OzalTwNC0dotroLs30zd/ICVeOLPME6YhMzyMjSQ7OTVjUkzUwwFa8zH3NehvNE1ZEBMJL%2B1A5bQsHcBttQ86Vuw/TOKMwKQLtjsACSd18eh%2BoCDBjExm2IoB6HPZxYO8tRy2cUTnHTMjnFC7J0uvJxeuGebnL17R41nW5ae56Feht7J%2BhftPi5qgfjpDdeU3a4NwF1fmzHimJ93jWGcB2caW%2B9fadtHlud5EEQKZ7eMTXEldwPjUOfFfdD65jeeSBEDOcZbcBx7jFxcNNV1TFcUJTsBv46bOVHqe%2BUXvDjMxoCM0n6NZ9bg11aXzfi182LvfCwj9CoyxePnBWTxw7VgsBYVAq1I5MyDtlJBI59xoHatEBQUVLg9Wuj2LUq8sCqGRM6BQXxTwAgYAADR2NQgAmvQ4IAAtehTBVB0LZKoJh3CWE9XjkpFcmCQjEAUE8Wg/pTxmB6qeDY%2BD1SC21MEEhZCmAUJEdEKK1CADyxAACyHDZEWBoQw/hT9GKVxTOhOBCCObVmmjjZhaQSCiP1q9Bx70nHcI8Fg4gxtBGaLERIqRFgNDGPCReFkW43TUIgNQ3xojNrAR8a46Im03aBLScEyRUVIk7lPFwAq0TYnBHicERJ6TwapL8QdTJKdFRBPEbk08%2BSooyKid8GJVA4kJOySCU67DVCVOIAM%2BpI4mkhKikUp%2BrTildNiRw8pLjalAxqUk1Q4yeyTJaRYGZBS9nzI2N07hyyRn2yGeck0WyWw7NCfs9pRyTlLL6aslJHDzmbN3OA26UpnEjOfFfDxX0AFmyfE0vJBV3F%2BJwYUqFQKYWPIvNLYqvysKXOyYC6%2BcoCY/XNhCuR8KCVniJdk2FFhjwFRRR7SBoloFexFAAJSYDtNBiokb4HSpUAQ3ykYkDwMAYI3z9wEH0UK3OpVlKiqMVXRizL/wG0cRVTlFR2iKUVV4iq/LBUMFUlKsVurHL9mldpdCMY3wqu5QwCAlq1W/m1WUh1urfyirKa651wETUQBNRpA%2B/VaVBSeBEVAng0ostsaoH2ghsGqpEGIWo9idjypmuFcGBsI0fWDUDLmxoDVGv9E4k1ASmZI2IJykpXBAKS1TSCW1qRvlxWSMABgUV5FPwlTsJtLa5mdOOTsCAZa8CjMQT8HYGgbmdoFd2w5vaYkDs5QdFkbgx0Tq7UimJCz51DoGUuldd5gpSpNHm40R6OGernMeggF6ZUF15O6kpEAIhZpBHcqKa6e0FIOfkyGNanUZIpmlBdsdb25o4Q%2Bp9CCX1kqmYUyG76wmyK/cUgGv7gQ6v/aTQdw7vmmgNeB59r7TzwYebMmdMSUMspBH%2B65AGsMHRw6ekdc6IOqCgzCmDey4NTumYhnjUSKMmio2h0ENHMMLqTiB%2BEUB70QhPYYzZso3CdmXd6o9wQWQ8iLcWkct8qBiCUN8hGiopO4fU7J912mewyb7aZltHajNZ12Kp%2BTu6i0hwaXelzCzcM3ori/RUV68ObpY2x0RzTQnwY6Z%2B9dP7KN/rnC%2BMT2704gcC2B4LBHoO7K4Fx5t67jFRdi4J%2BLiXAPJYbU571QXuTGmvQp5Tyn%2B2BbMzVrT7mdOgrdPpzAhn/OOaaxelrmmDWWZbNZmJzW7MgYc/FJzaWR3Lrc2ysbXmbN1d64FGMt9pPVeXTWpbOMZpVfSzV/bI36moueAAN1QHgV0qaI1RoIDGhwkg40GGIIm5NTjU2kDNZnAHQLMY7AzU4ljGg/uCMBxM4FoPcZZuaP96H0503wMHGDrNZhIceeR9s2HaPM0II2Nj3HgP551icQgQgCgSek8zkq3NwQadI7p4jTVoHVAKFGzsKnBByXfs3YCOYD20dPZe8IUQH2B0suAuDz1Br%2BfwpNYr5F3zefkv2Q%2BoXIuEFi7EbGyXCbfuzQR/LpnfGtzy44RrqlavqcxcF4IYX4bRfRv1w4d7RuZcm4QVjxnyV11W85%2Buid6vW1PP7drl3uu3dKA94bz7xuWPE/9%2BSttluOdp9t1XRmHAFi0E4MeXgfgOBaFIKgTgSnLDWE7UsFYzJNg8FIAQTQeeFgAGsQDHghwXjgkhi%2Bt/L5wXgCgQAQ5b6XvPpA4CwCQGgFg8Q6DRHIJQefi/6AxBJFd84j13iGGAFwLgGhmg0EkdgygERB8RGCLUE0nAm/z7YIIHRDBaB38n6QLAbIjDiA/3Wq7mAo%2BH%2BmApCyAXgz29%2BvAgIrQg%2B6YEQ4UxAJoHgWAg%2BQIeALAkBCw7oTAwACg4oeAmAq4Oi5oJeTe/Agg727AUgMggguoagg%2BugzQBgRgKA1g1g%2BgeAEQo%2BkACwoU7QQBj0OiGwI%2BrQqqqQLgQkIwTQpAgQUwhQxQWQSQKQAgUhihOQqQPQ8hswLQbQVQEwqhYwohVqnQdQmhfQJQtg%2BhngjQeg4wXQZhMwJQCwCgdeqwegQImAawPA%2BeheA%2BH%2BFeHAW%2BO%2Blse%2BNoh%2BX4IIOWEAVeVglgwEuAhAJAa0GwzQOwHgC%2BS%2BRojecwvAE%2BWgcwHeXePenA/epAJeZeARI%2BY%2BzereBR%2BgnAZgfhFRw%2BNRk%2BdRABYi4hkgQAA">compiled arm64 assembly</a> on Godbolt Compiler Explorer for the original scalar implementation shows that the structure of the output assembly is very similar across both architectures though, so the cause of the slower performance on arm64 is not completely clear to me.</p>
<p>For all of the results in the rest of the post, the compact scalar implementation’s timings are used as the baseline that everything else is compared against, since all of the following implementations are derived from the compact scalar implementation.</p>
<p><strong>SSE Implementation</strong></p>
<p>The first vectorized implementation we’ll look at is using SSE on x86-64 processors.
The full SSE through SSE4 instruction set today including contains 281 instructions, introduced over the past two decades-ish in a series of supplementary extensions to the original SSE instruction set.
All modern Intel and AMD x86-64 processors from at least the past decade support SSE4, and all x86-64 processors ever made support at least SSE2 since SSE2 is written into the base x86-64 specification.
As mentioned earlier, SSE uses 128-bit registers that can be split into two, four, eight, or even sixteen lanes; the most common (and original) use case is four 32-bit floats.
AVX and AVX2 later expanded the register width from 128-bit to 256-bit, and the latest AVX-512 extensions introduced 512-bit registers.
For this post though, we’ll just stick with 128-bit SSE.</p>
<p>In order to program directly using SSE instructions, we can either write SSE assembly directly, or we can use SSE intrinsics.
Writing SSE assembly directly is not particularly ideal for all of the same reasons that writing programs in regular assembly is not particularly ideal for most cases, so we’ll want to use intrinsics instead.
Intrinsics are functions whose implementations are specially handled by the compiler; in the case of vector intrinsics, each function maps directly to a known single or small number of vector assembly instructions.
Intrinsics kind of bridge between writing directly in assembly and using full-blown standard library functions; intrinsics are <em>higher</em> level than assembly, but <em>lower</em> level than what you typically find in standard library functions.
The headers for vector intrinsics are defined by the compiler; almost every C++ compiler that supports SSE and AVX intrinsics follows a convention where SSE/AVX intrinsics headers are named using the pattern *mmintrin.h, where * is a letter of the alphabet corresponding to a specific subset or version of either SSE of AVX (for example, x for SSE, e for SSE2, n for SSE4.2, i for AVX, etc.).
For example, <code class="language-plaintext highlighter-rouge">xmmintrin.h</code> is where the <code class="language-plaintext highlighter-rouge">__m128</code> type we used earlier in defining all of our structs comes from.
Intel’s searchable <a href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/">online Intrinsics Guide</a> is an invaluable resource for looking up what SSE intrinsics there are and what each of them does.</p>
<p>The first thing we need to do for our SSE implementation is to define a new <code class="language-plaintext highlighter-rouge">BBox4</code> struct that holds four bounding boxes together.
How we store these four bounding boxes together is extremely important.
The easiest way to store four bounding boxes in a single struct is to just have <code class="language-plaintext highlighter-rouge">BBox4</code> store four separate <code class="language-plaintext highlighter-rouge">BBox</code> structs internally, but this approach is actually really bad for vectorization.
To understand why, consider something like the following, where we perform an <code class="language-plaintext highlighter-rouge">min</code> operation between the ray tMin and a distance to a corner of a bounding box:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>fmax(ray.tMin, (bbox.corners[near.x] - ray.origin.x) * rdir.x);
</code></pre></div></div>
<p>Now consider if we want to do this operation for four bounding boxes in serial:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>fmax(ray.tMin, (bbox0.corners[near.x] - ray.origin.x) * rdir.x);
fmax(ray.tMin, (bbox1.corners[near.x] - ray.origin.x) * rdir.x);
fmax(ray.tMin, (bbox2.corners[near.x] - ray.origin.x) * rdir.x);
fmax(ray.tMin, (bbox3.corners[near.x] - ray.origin.x) * rdir.x);
</code></pre></div></div>
<p>The above serial sequence is a perfect example of what we want to fold into a single vectorized line of code.
The inputs to a vectorized version of the above should be a 128-bit four-lane value with <code class="language-plaintext highlighter-rouge">ray.tMin</code> in all four lanes, another 128-bit four-lane value with <code class="language-plaintext highlighter-rouge">ray.origin.x</code> in all four lanes, another 128-bit four-lane value with <code class="language-plaintext highlighter-rouge">rdir.x</code> in all four lanes, and finally a 128-bit four-lane value where the first lane is a single index of a single corner from the first bounding box, the second lane is a single index of a single corner from the second bounding box, and so on and so forth.
Instead of an array of structs, we need the bounding box values to be provided as a struct of corner value arrays where each 128-bit value stores one 32-bit value from each corner of each of the four boxes.
Alternatively, the <code class="language-plaintext highlighter-rouge">BBox4</code> memory layout that we want can be thought of as an array of 24 floats, which is indexed as a 3D array where the first dimension is indexed by min or max corner, the second dimension is indexed by x, y, and z within each corner, and the third dimension is indexed by which bounding box the value belongs to.
Putting the above together with some accessors and setter functions yields the following definition for <code class="language-plaintext highlighter-rouge">BBox4</code>:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>struct BBox4 {
union {
FVec4 corners[6]; // order: minX, minY, minZ, maxX, maxY, maxZ
float cornersFloat[2][3][4]; // indexed as corner[minOrMax][XYZ][bboxNumber]
float cornersFloatAlt[6][4];
};
inline __m128* minCornerSSE() { return &corners[0].m128; }
inline __m128* maxCornerSSE() { return &corners[3].m128; }
#if defined(__aarch64__)
inline float32x4_t* minCornerNeon() { return &corners[0].f32x4; }
inline float32x4_t* maxCornerNeon() { return &corners[3].f32x4; }
#endif
inline void setBBox(int boxNum, const FVec4& minCorner, const FVec4& maxCorner) {
cornersFloat[0][0][boxNum] = fmin(minCorner.x, maxCorner.x);
cornersFloat[0][1][boxNum] = fmin(minCorner.y, maxCorner.y);
cornersFloat[0][2][boxNum] = fmin(minCorner.z, maxCorner.z);
cornersFloat[1][0][boxNum] = fmax(minCorner.x, maxCorner.x);
cornersFloat[1][1][boxNum] = fmax(minCorner.y, maxCorner.y);
cornersFloat[1][2][boxNum] = fmax(minCorner.x, maxCorner.x);
}
BBox4(const BBox& a, const BBox& b, const BBox& c, const BBox& d) {
setBBox(0, a.minCorner(), a.maxCorner());
setBBox(1, b.minCorner(), b.maxCorner());
setBBox(2, c.minCorner(), c.maxCorner());
setBBox(3, d.minCorner(), d.maxCorner());
}
};
</code></pre></div></div>
<div class="codecaption">Listing 6: Struct holding four bounding boxes together with values interleaved for optimal vectorized access.</div>
<p>Note how the <code class="language-plaintext highlighter-rouge">setBBox</code> function (which the constructor calls) has a memory access pattern where a single value is written into every 128-bit <code class="language-plaintext highlighter-rouge">FVec4</code>.
Generally scattered access like this is extremely expensive in vectorized code, and should be avoided as much as possible; setting an entire 128-bit value at once is much faster than setting four separate 32-bit segments across four different values.
However, something like the above is often inevitably necessary just to get data loaded into a layout optimal for vectorized code; in the test program, <code class="language-plaintext highlighter-rouge">BBox4</code> structs are initialized and set up once, and then reused across all tests.
The time required to set up <code class="language-plaintext highlighter-rouge">BBox</code> and <code class="language-plaintext highlighter-rouge">BBox4</code> is not counted as part of any of the test runs; in a full BVH traversal implementation, the BVH’s bounds at each node should be pre-arranged into a vector-friendly layout before any ray traversal takes place.
In general, figuring out how to restructure an algorithm to be easily expressed using vector intrinsics is really only half of the challenge in writing good vectorized programs; the other half of the challenge is just getting the input data into a form that is amenable to vectorization.
Actually, depending on the problem domain, the data marshaling can account for far more than half of the total effort spent!</p>
<p>Now that we have four bounding boxes structured in a way that is amenable to vectorized usage, we also need to structure our ray inputs for vectorized usage.
This step is relatively easy; we just need to expand each component of each element of the ray into a 128-bit value where the same value is replicated across every 32-bit lane.
SSE has a specific intrinsic to do exactly this: <code class="language-plaintext highlighter-rouge">_mm_set1_ps()</code> takes in a single 32-bit float and replicates it to all four lanes in a 128-bit <code class="language-plaintext highlighter-rouge">__m128</code>.
SSE also has a bunch of more specialized instructions, which can be used in specific scenarios to do complex operations in a single instruction.
Knowing when to use these more specialized instructions can be tricky and requires extensive knowledge of the SSE instruction set; I don’t know these very well yet!
One good trick I did figure out was that in the case of taking a <code class="language-plaintext highlighter-rouge">FVec4</code> and creating a new <code class="language-plaintext highlighter-rouge">__m128</code> from each of the <code class="language-plaintext highlighter-rouge">FVec4</code>’s components, I could use <code class="language-plaintext highlighter-rouge">_mm_shuffle_ps</code> instead of <code class="language-plaintext highlighter-rouge">_mm_set1_ps()</code>.
The problem with using <code class="language-plaintext highlighter-rouge">_mm_set1_ps()</code> in this case is that with a <code class="language-plaintext highlighter-rouge">FVec4</code>, which internally uses <code class="language-plaintext highlighter-rouge">__m128</code> on x86-64, taking a element out to store using <code class="language-plaintext highlighter-rouge">_mm_set1_ps()</code> compiles down to a <code class="language-plaintext highlighter-rouge">MOVSS</code> instruction in addition to a shuffle.
<code class="language-plaintext highlighter-rouge">_mm_shuffle_ps()</code>, on the other hand, compiles down to a single <code class="language-plaintext highlighter-rouge">SHUFPS</code> instruction.
<code class="language-plaintext highlighter-rouge">_mm_shuffle_ps()</code> takes in two <code class="language-plaintext highlighter-rouge">__m128</code>s as input and takes two components from the first <code class="language-plaintext highlighter-rouge">__m128</code> for the first two components of the output, and takes two components from the second <code class="language-plaintext highlighter-rouge">__m128</code> for the second two components of the output.
Which components from the inputs are taken is assignable using an input mask, which can conveniently be generated using the <code class="language-plaintext highlighter-rouge">_MM_SHUFFLE()</code> macro that comes with the SSE intrinsics headers.
Since our ray struct’s origin and direction elements are already backed by <code class="language-plaintext highlighter-rouge">__m128</code> under the hood, we can just use <code class="language-plaintext highlighter-rouge">_mm_shuffle_ps()</code> with the same element from the ray as both the first and second inputs to generate a <code class="language-plaintext highlighter-rouge">__m128</code> containing only a single component of each element.
For example, to create a <code class="language-plaintext highlighter-rouge">__m128</code> containing only the x component of the ray direction, we can write: <code class="language-plaintext highlighter-rouge">_mm_shuffle_ps(rdir.m128, rdir.m128, _MM_SHUFFLE(0, 0, 0, 0))</code>.</p>
<p>Translating the <code class="language-plaintext highlighter-rouge">fmin()</code> and <code class="language-plaintext highlighter-rouge">fmax()</code> functions is very straightforward with SSE; we can use SSE’s <code class="language-plaintext highlighter-rouge">_mm_min_ps()</code> and <code class="language-plaintext highlighter-rouge">_mm_max_ps()</code> as direct analogues.
Putting all of the above together allows us to write a fully SSE-ized version of the compact scalar ray-box intersection test that intersects a single ray against four boxes simultaneously:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>void rayBBoxIntersect4SSE(const Ray& ray,
const BBox4& bbox4,
IVec4& hits,
FVec4& tMins,
FVec4& tMaxs) {
FVec4 rdir(_mm_set1_ps(1.0f) / ray.direction.m128);
/* use _mm_shuffle_ps, which translates to a single instruction while _mm_set1_ps involves a
MOVSS + a shuffle */
FVec4 rdirX(_mm_shuffle_ps(rdir.m128, rdir.m128, _MM_SHUFFLE(0, 0, 0, 0)));
FVec4 rdirY(_mm_shuffle_ps(rdir.m128, rdir.m128, _MM_SHUFFLE(1, 1, 1, 1)));
FVec4 rdirZ(_mm_shuffle_ps(rdir.m128, rdir.m128, _MM_SHUFFLE(2, 2, 2, 2)));
FVec4 originX(_mm_shuffle_ps(ray.origin.m128, ray.origin.m128, _MM_SHUFFLE(0, 0, 0, 0)));
FVec4 originY(_mm_shuffle_ps(ray.origin.m128, ray.origin.m128, _MM_SHUFFLE(1, 1, 1, 1)));
FVec4 originZ(_mm_shuffle_ps(ray.origin.m128, ray.origin.m128, _MM_SHUFFLE(2, 2, 2, 2)));
IVec4 near(int(rdir.x >= 0.0f ? 0 : 3), int(rdir.y >= 0.0f ? 1 : 4),
int(rdir.z >= 0.0f ? 2 : 5));
IVec4 far(int(rdir.x >= 0.0f ? 3 : 0), int(rdir.y >= 0.0f ? 4 : 1),
int(rdir.z >= 0.0f ? 5 : 2));
tMins = FVec4(_mm_max_ps(
_mm_max_ps(_mm_set1_ps(ray.tMin),
(bbox4.corners[near.x].m128 - originX.m128) * rdirX.m128),
_mm_max_ps((bbox4.corners[near.y].m128 - originY.m128) * rdirY.m128,
(bbox4.corners[near.z].m128 - originZ.m128) * rdirZ.m128)));
tMaxs = FVec4(_mm_min_ps(
_mm_min_ps(_mm_set1_ps(ray.tMax),
(bbox4.corners[far.x].m128 - originX.m128) * rdirX.m128),
_mm_min_ps((bbox4.corners[far.y].m128 - originY.m128) * rdirY.m128,
(bbox4.corners[far.z].m128 - originZ.m128) * rdirZ.m128)));
int hit = ((1 << 4) - 1) & _mm_movemask_ps(_mm_cmple_ps(tMins.m128, tMaxs.m128));
hits[0] = bool(hit & (1 << (0)));
hits[1] = bool(hit & (1 << (1)));
hits[2] = bool(hit & (1 << (2)));
hits[3] = bool(hit & (1 << (3)));
}
</code></pre></div></div>
<div class="codecaption">Listing 7: SSE version of the compact Williams et al. 2005 implementation.</div>
<p>The last part of <code class="language-plaintext highlighter-rouge">rayBBoxIntersect4SSE()</code> where <code class="language-plaintext highlighter-rouge">hits</code> is populated might require a bit of explaining.
This last part implements the check for whether or not a ray actually hit the box based on the results stored in <code class="language-plaintext highlighter-rouge">tMin</code> and <code class="language-plaintext highlighter-rouge">tMax</code>.
This implementation takes advantage of the fact that misses in this implementation produce <code class="language-plaintext highlighter-rouge">inf</code> or <code class="language-plaintext highlighter-rouge">-inf</code> values; to figure out if a hit has occurred, we just have to check that in each lane, the <code class="language-plaintext highlighter-rouge">tMin</code> value is less than the <code class="language-plaintext highlighter-rouge">tMax</code> value, and <code class="language-plaintext highlighter-rouge">inf</code> values play nicely with this check.
So, to conduct the check across all lanes at the same time, we use <code class="language-plaintext highlighter-rouge">_mm_cmple_ps()</code>, which compares if the 32-bit float in each lane of the first input is less-than-or-equal than the corresponding 32-bit float in each lane of the second input.
If the comparison succeeds, <code class="language-plaintext highlighter-rouge">_mm_cmple_ps()</code> writes <code class="language-plaintext highlighter-rouge">0xFFF</code> into the corresponding lane in the output <code class="language-plaintext highlighter-rouge">__m128</code>, and if the comparison fails, <code class="language-plaintext highlighter-rouge">0</code> is written instead.
The remaining <code class="language-plaintext highlighter-rouge">_mm_movemask_ps()</code> instruction and bit shifts are just used to copy the results in each lane out into each component of <code class="language-plaintext highlighter-rouge">hits</code>.</p>
<p>I think variants of this 4-wide SSE ray-box intersection function are fairly common in production renderers; I’ve seen something similar developed independently at multiple studios and in multiple renderers, which shouldn’t be surprising since the translation from the original Williams et al. 2005 paper to a SSE-ized version is relatively straightforward.
Also, the performance results further hint at why variants of this implementation are popular!
Here is how the SSE implementation (Listing 7) performs compared to the scalar compact representation (Listing 5):</p>
<table>
<thead>
<tr>
<th style="text-align: right"> </th>
<th style="text-align: center">x86-64:</th>
<th style="text-align: center">x86-64 Speedup:</th>
<th style="text-align: center">Rosetta2:</th>
<th style="text-align: center">Rosetta2 Speedup:</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: right">Scalar Compact:</td>
<td style="text-align: center">44.5159 ns</td>
<td style="text-align: center">1.0x.</td>
<td style="text-align: center">81.0942 ns</td>
<td style="text-align: center">1.0x.</td>
</tr>
<tr>
<td style="text-align: right">SSE:</td>
<td style="text-align: center">10.9660 ns</td>
<td style="text-align: center">4.0686x</td>
<td style="text-align: center">13.6353 ns</td>
<td style="text-align: center">5.9474x</td>
</tr>
</tbody>
</table>
<p>The SSE implementation is almost exactly four times faster than the reference scalar compact implementation, which is exactly what we would expect as a best case for a properly written SSE implementation.
Actually, in the results listed above, the SSE implementation is listed as being slightly <em>more</em> than four times faster, but that’s just an artifact of averaging together results from multiple runs; the amount over 4x is basically just an artifact of the statistical margin of error.
A 4x speedup is the maximum speedup we can possible expect given that SSE is 4-wide for 32-bit floats.
In our SSE implementation, the <code class="language-plaintext highlighter-rouge">BBox4</code> struct is already set up before the function is called, but the function still needs to translate each incoming ray into a form suitable for vector operations, which is additional work that the scalar implementation doesn’t need to do.
In order to make this additional setup work not drag down performance, the <code class="language-plaintext highlighter-rouge">_mm_shuffle_ps()</code> trick becomes very important.</p>
<p>Running the x86-64 version of the test program on arm64 using Rosetta 2 produces a more surprising result: close to a 6x speedup!
Running through Rosetta 2 means that the x86-64 and SSE instructions have to be translated to arm64 and Neon instructions, and the 8x speedup here hints that for this test, Rosetta 2’s SSE to Neon translation ran much more efficiently than Rosetta 2’s x86-64 to arm64 translation.
Otherwise, a greater-than-4x speedup should not be possible if both implementations are being translated with equal levels of efficiency.
I did not expect that to be the case!
Unfortunately, while we can speculate, only Apple’s developers can say for sure what Rosetta 2 is doing internally that produces this result.</p>
<p><strong>Neon Implementation</strong></p>
<p>The second vectorized implementation we’ll look at is using Neon on arm64 processors.
Much like how all modern x86-64 processors support at least SSE2 because the 64-bit extension to x86 incorporated SSE2 into the base instruction set, all modern arm64 processors support Neon because the 64-bit extension to ARM incorporates Neon in the base instruction set.
Compared with SSE, Neon is a much more compact instruction set, which makes sense since SSE belongs to a CISC ISA while Neon belongs to a RISC ISA.
Neon includes a little over a hundred instructions, which is less than half the number of instructions that the full SSE to SSE4 instruction set contains.
Neon has all of the basics that one would expect, such as arithmetic operations and various comparison operations, but Neon doesn’t have more complex high-level instructions like the fancy shuffle instructions we used in our SSE implementation.</p>
<p>Much like how Intel has a searchable SSE intrinsics guide, ARM provides a helpful <a href="https://developer.arm.com/architectures/instruction-sets/intrinsics/">searchable intrinsics guide</a>.
Howard Oakley’s <a href="https://eclecticlight.co/2021/07/27/code-in-arm-assembly-rounding-and-arithmetic/">recent blog series</a> on writing arm64 assembly also includes a great <a href="https://eclecticlight.co/2021/08/23/code-in-arm-assembly-lanes-and-loads-in-neon/">introduction to using Neon</a>.
Note that even though there are fewer Neon instructions in total than there are SSE instructions, the ARM intrinsics guide lists several <em>thousand</em> functions; this is because of one of the chief differences between SSE and Neon.
SSE’s <code class="language-plaintext highlighter-rouge">__m128</code> is just a generic 128-bit container that doesn’t actually specify what type or how many lanes it contains; what type a <code class="language-plaintext highlighter-rouge">__m128</code> value is or how many lanes a <code class="language-plaintext highlighter-rouge">__m128</code> value contains interpreted as is entirely up to each SSE instruction.
Contrast with Neon, which has explicit separate types for floats and integers, and also defines separate types based on width.
Since Neon has many different 128-bit types, each Neon instruction has multiple corresponding intrinsics that differ simply by the input types and widths accepted in the function signature.
As a result of all of the above differences from SSE, writing a Neon implementation is not quite as simple as just doing a one-to-one replacement of each SSE intrinsic with a Neon intrinsic.</p>
<p>…or is it?
Writing C/C++ code utilizing Neon instructions can be done by using the native Neon intrinsics found in <code class="language-plaintext highlighter-rouge"><arm_neon.h></code>, but another option exists through <a href="https://github.com/DLTcollab/sse2neon">the sse2neon project</a>.
When compiling for arm64, the x86-64 SSE <code class="language-plaintext highlighter-rouge"><xmmintrin.h></code> header is not available for use because every function in the <code class="language-plaintext highlighter-rouge"><xmmintrin.h></code> header maps to a specific SSE instruction or group of SSE instructions, and there’s no sense in the compiler trying to generate SSE instructions for a processor architecture that SSE instructions don’t even work on.
However, the function definitions for each intrinsic are just function definitions, and sse2neon project reimplements every SSE intrinsic function with a Neon implementation under the hood.
So, using sse2neon, code originally written for x86-64 using SSE intrinsics can be compiled without modification on arm64, with Neon instructions generated from the SSE intrinsics.
A number of large projects originally written on x86-64 now have arm64 ports that utilize sse2neon to support vectorized code without having to completely rewrite using Neon intrinsics; as discussed in <a href="https://blog.yiningkarlli.com/2021/07/porting-takua-to-arm-pt2.html">my previous Takua on ARM post</a>, this approach is the exact approach that was taken to port <a href="https://www.embree.org">Embree</a>
to arm64.</p>
<p>The sse2neon project was originally started by John W. Ratcliff and a few others at NVIDIA to port a handful of games from x86-64 to arm64; the original version of sse2neon only implemented the small subset of SSE that was needed for their project.
However, after the project was posted to Github with a MIT license, other projects found sse2neon useful and contributed additional extensions that eventually fleshed out full coverage for MMX and all versions of SSE from SSE1 all the way through SSE4.2.
For example, Syoyo Fujita’s <a href="https://github.com/lighttransport/embree-aarch64">embree-aarch64 project</a>, which was the basis of Intel’s official Embree arm64 port, resulted in a number of improvements to sse2neon’s precision and faithfulness to the original SSE behavior.
Over the years sse2neon has seen contributions and improvements from NVIDIA, Amazon, Google, the Embree-aarch64 project, the Blender project, and recently Apple as part of Apple’s larger slew of contributions to various projects to improve arm64 support for Apple Silicon.
Similar open-source projects also exist to further generalize SIMD intrinsics headers (<a href="https://github.com/simd-everywhere/simde">simde</a>), to reimplement the AVX intrinsics headers using Neon (<a href="https://github.com/DLTcollab/sse2neon">AvxToNeon</a>), and Intel even has a project to do the reverse of sse2neon: reimplement Neon using SSE (<a href="https://github.com/intel/ARM_NEON_2_x86_SSE">ARM_NEON_2_x86_SSE</a>).</p>
<p>While learning about Neon and while looking at how Embree was ported to arm64 using sse2neon, I started to wonder how efficient using sse2neon versus writing code directly using Neon intrinsics would be.
The SSE and Neon instruction sets don’t have a one-to-one mapping to each other for many of the more complex higher-level instructions that exist in SSE, and as a result, some SSE intrinsics that compiled down to a single SSE instruction on x86-64 have to be implemented on arm64 using many Neon instructions.
As a result, at least in principle, my expectation was that on arm64, code written directly using Neon intrinsics typically should likely have at least a small performance edge over SSE code ported using sse2neon.
So, I decided to do a direct comparison in my test program, which required implementing the 4-wide ray-box intersection test using Neon:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>inline uint32_t neonCompareAndMask(const float32x4_t& a, const float32x4_t& b) {
uint32x4_t compResUint = vcleq_f32(a, b);
static const int32x4_t shift = { 0, 1, 2, 3 };
uint32x4_t tmp = vshrq_n_u32(compResUint, 31);
return vaddvq_u32(vshlq_u32(tmp, shift));
}
void rayBBoxIntersect4Neon(const Ray& ray,
const BBox4& bbox4,
IVec4& hits,
FVec4& tMins,
FVec4& tMaxs) {
FVec4 rdir(vdupq_n_f32(1.0f) / ray.direction.f32x4);
/* since Neon doesn't have a single-instruction equivalent to _mm_shuffle_ps, we just take
the slow route here and load into each float32x4_t */
FVec4 rdirX(vdupq_n_f32(rdir.x));
FVec4 rdirY(vdupq_n_f32(rdir.y));
FVec4 rdirZ(vdupq_n_f32(rdir.z));
FVec4 originX(vdupq_n_f32(ray.origin.x));
FVec4 originY(vdupq_n_f32(ray.origin.y));
FVec4 originZ(vdupq_n_f32(ray.origin.z));
IVec4 near(int(rdir.x >= 0.0f ? 0 : 3), int(rdir.y >= 0.0f ? 1 : 4),
int(rdir.z >= 0.0f ? 2 : 5));
IVec4 far(int(rdir.x >= 0.0f ? 3 : 0), int(rdir.y >= 0.0f ? 4 : 1),
int(rdir.z >= 0.0f ? 5 : 2));
tMins =
FVec4(vmaxq_f32(vmaxq_f32(vdupq_n_f32(ray.tMin),
(bbox4.corners[near.x].f32x4 - originX.f32x4) * rdirX.f32x4),
vmaxq_f32((bbox4.corners[near.y].f32x4 - originY.f32x4) * rdirY.f32x4,
(bbox4.corners[near.z].f32x4 - originZ.f32x4) * rdirZ.f32x4)));
tMaxs = FVec4(vminq_f32(vminq_f32(vdupq_n_f32(ray.tMax),
(bbox4.corners[far.x].f32x4 - originX.f32x4) * rdirX.f32x4),
vminq_f32((bbox4.corners[far.y].f32x4 - originY.f32x4) * rdirY.f32x4,
(bbox4.corners[far.z].f32x4 - originZ.f32x4) * rdirZ.f32x4)));
uint32_t hit = neonCompareAndMask(tMins.f32x4, tMaxs.f32x4);
hits[0] = bool(hit & (1 << (0)));
hits[1] = bool(hit & (1 << (1)));
hits[2] = bool(hit & (1 << (2)));
hits[3] = bool(hit & (1 << (3)));
}
</code></pre></div></div>
<div class="codecaption">Listing 8: Neon version of the compact Williams et al. 2005 implementation.</div>
<p>Even if you only know SSE and have never worked with Neon, you should already be able to tell broadly how the Neon implementation in Listing 8 works!
Just from the name alone, <code class="language-plaintext highlighter-rouge">vmaxq_f32()</code> and <code class="language-plaintext highlighter-rouge">vminq_f32()</code> obviously correspond directly to <code class="language-plaintext highlighter-rouge">_mm_max_ps()</code> and <code class="language-plaintext highlighter-rouge">_mm_min_ps()</code> in the SSE implementation, and understanding how the ray data is being loaded into Neon’s 128-bit registers using <code class="language-plaintext highlighter-rouge">vdupq_n_f32()</code> instead of <code class="language-plaintext highlighter-rouge">_mm_set1_ps()</code> should be relatively easy too.
However, because there is no fancy single-instruction shuffle intrinsic available in Neon, the way the ray data is loaded is potentially slightly less efficient.</p>
<p>The largest area of difference between the Neon and SSE implementations is in the processing of the tMin and tMax results to produce the output <code class="language-plaintext highlighter-rouge">hits</code> vector.
The SSE version uses just two intrinsic functions because SSE includes the fancy high-level <code class="language-plaintext highlighter-rouge">_mm_cmple_ps()</code> intrinsic, which compiles down to a single <code class="language-plaintext highlighter-rouge">CMPPS</code> SSE instruction, but implementing this functionality using Neon takes some more work.
The <code class="language-plaintext highlighter-rouge">neonCompareAndMask()</code> helper function implements the <code class="language-plaintext highlighter-rouge">hits</code> vector processing using four Neon intrinsics; a better solution may exist, but for now this is the best I can do given my relatively basic level of Neon experience.
If you have a better solution, feel free to let me know!</p>
<p>Here’s how the native Neon intrinsics implementation performs compared with using sse2neon to translate the SSE implementation.
For an additional point of comparison, I’ve also included the Rosetta 2 SSE result from the previous section.
Note that the speedup column for Rosetta 2 here isn’t comparing how much faster the SSE implementation running over Rosetta 2 is with the compact scalar implementation running over Rosetta 2; instead, the Rosetta 2 speedup columns here compare how much faster (or slower) the Rosetta 2 runs are compared with the <em>native</em> arm64 compact scalar implementation:</p>
<table>
<thead>
<tr>
<th style="text-align: right"> </th>
<th style="text-align: center">arm64:</th>
<th style="text-align: center">arm64 Speedup:</th>
<th style="text-align: center">Rosetta2:</th>
<th style="text-align: center">Rosetta2 Speedup over Native:</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: right">Scalar Compact:</td>
<td style="text-align: center">41.8187 ns</td>
<td style="text-align: center">1.0x.</td>
<td style="text-align: center">81.0942 ns</td>
<td style="text-align: center">0.5157x</td>
</tr>
<tr>
<td style="text-align: right">SSE:</td>
<td style="text-align: center">-</td>
<td style="text-align: center">-</td>
<td style="text-align: center">13.6353 ns</td>
<td style="text-align: center">3.0669x</td>
</tr>
<tr>
<td style="text-align: right">SSE2NEON:</td>
<td style="text-align: center">12.3090 ns</td>
<td style="text-align: center">3.3974x</td>
<td style="text-align: center">-</td>
<td style="text-align: center">-</td>
</tr>
<tr>
<td style="text-align: right">Neon:</td>
<td style="text-align: center">12.2161 ns</td>
<td style="text-align: center">3.4232x</td>
<td style="text-align: center">-</td>
<td style="text-align: center">-</td>
</tr>
</tbody>
</table>
<p>I originally also wanted to include a test that would have been the reverse of sse2neon: use Intel’s <a href="https://github.com/intel/ARM_NEON_2_x86_SSE">ARM_NEON_2_x86_SSE</a> project to get the Neon implementation working on x86-64.
However, when I tried using ARM_NEON_2_x86_SSE, I discovered that the ARM_NEON_2_x86_SSE isn’t quite complete enough yet (as of time of writing) to actually compile the Neon implementation in Figure 8.</p>
<p>I was very pleased to see that both of the native arm64 implementations ran faster than the SSE implementation running over Rosetta 2; which means that my native Neon implementation is at least halfway decent, and which also means that sse2neon works as advertised.
The native Neon implementation is also just a hair faster than the sse2neon implementation, which indicates that at least here, rewriting using native Neon intrinsics instead of mapping from SSE to Neon does indeed produce slightly more efficient code.
However, the sse2neon implementation is very very close in terms of performance, to the point where it may well be within an acceptable margin of error.
Overall, both of the native arm64 implementations get a respectable speedup over the compact scalar reference, even though the speedup amounts are a bit less than the ideal 4x.
I think that the slight performance loss compared to the ideal 4x is probably attributable to the more complex solution required for filling the output <code class="language-plaintext highlighter-rouge">hits</code> vector.</p>
<p>To better understand why the sse2neon implementation performs so close to the native Neon implementation, I tried just copy-pasting every single function implementation out of sse2neon into the SSE 4-wide ray-box intersection test.
Interestingly, the result was extremely similar to my native Neon implementation; structurally they were more or less identical, but the sse2neon version had some additional extraneous calls.
For example, instead of replacing <code class="language-plaintext highlighter-rouge">_mm_max_ps(a, b)</code> one-to-one with <code class="language-plaintext highlighter-rouge">vmaxq_f32(a, b)</code>, sse2neon’s version of <code class="language-plaintext highlighter-rouge">_mm_max_ps(a, b)</code> is <code class="language-plaintext highlighter-rouge">vreinterpretq_m128_f32(vmaxq_f32(vreinterpretq_f32_m128(a), vreinterpretq_f32_m128(b)))</code>.
<code class="language-plaintext highlighter-rouge">vreinterpretq_m128_f32</code> is a helper function defined by sse2neon to translate an input <code class="language-plaintext highlighter-rouge">__m128</code> into a <code class="language-plaintext highlighter-rouge">float32x4_t</code>.
There’s a lot of reinterpreting of inputs to specific float or integer types in sse2neon; all of the reinterpreting in sse2neon is to convert from SSE’s generic <code class="language-plaintext highlighter-rouge">__m128</code> to Neon’s more specific types.
In the specific case of <code class="language-plaintext highlighter-rouge">vreinterpretq_m128_f32</code>, the reinterpretation should actually compile down to a no-op since sse2neon typedefs <code class="language-plaintext highlighter-rouge">__m128</code> directly to <code class="language-plaintext highlighter-rouge">float32x4_t</code>, but many of sse2neon’s other reinterpretation functions do require additional extra Neon instructions to implement.</p>
<p>Even though the Rosetta 2 result is definitively slower than the native arm64 results, the Rosetta 2 result is far closer to the native arm64 results than I normally would have expected.
Rosetta 2 usually can be expected to perform somewhere in the neighborhood of 50% to 80% of native performance for compute-heavy code, and the Rosetta 2 performance for the compact scalar implementation lines up with this expectation.
However, the Rosetta 2 performance for the vectorized version lends further credence to the theory from the previous section that Rosetta 2 somehow is better able to translate vectorized code than scalar code.</p>
<p><strong>Auto-vectorized Implementation</strong></p>
<p>The unfortunate thing about writing vectorized programs using vector intrinsics is that… vector intrinsics can be hard to use!
Vector intrinsics are intentionally fairly low-level, which means that when compared to writing normal C or C++ code, using vector intrinsics is only a half-step above writing code directly in assembly.
The vector intrinsics APIs provided for SSE and Neon have very large surface areas, since a large number of intrinsic functions exist to cover the large number of vector instructions that there are.
Furthermore, unless compatibility layers like sse2neon are used, vector intrinsics are not portable between different processor architectures in the same way that normal higher-level C and C++ code is.
Even though I have some experience working with vector intrinsics, I still don’t consider myself even remotely close to comfortable or proficient in using them; I have to rely heavily on looking up everything using various reference guides.</p>
<p>One potential solution to the difficulty of using vector intrinsics is compiler <a href="https://en.wikipedia.org/wiki/Automatic_vectorization">auto-vectorization</a>.
Auto-vectorization is a compiler technique that aims to allow programmers to better utilize vector instructions without requiring programmers to write everything using vector intrinsics.
Instead of writing vectorized programs, programmers write standard scalar programs which the compiler’s auto-vectorizer then converts into a vectorized program at compile-time.
One common auto-vectorization technique that many compilers implement is loop vectorization, which takes a serial innermost loop and restructures the loop such that each iteration of the loop maps to one vector lane.
Implementing loop vectorization can be extremely tricky, since like with any other type of compiler optimization, the cardinal rule is that the originally written program behavior must be unmodified and the original data dependencies and access orders must be preserved.
Add in the need to consider all of the various concerns that are specific to vector instructions, and the result is that loop vectorization is easy to get wrong if not implemented very carefully by the compiler.
However, when loop vectorization is available and working correctly, the performance increase to otherwise completely standard scalar code can be significant.</p>
<p>The 4-wide ray-box intersection test should be a perfect candidate for auto-vectorization!
The scalar implementations are implemented as just a single for loop that calls the single ray-box test once per iteration of the loop, for four iterations.
Inside of the loop, the ray-box test is fundamentally just a bunch of simple min/max operations and a little bit of arithmetic, which as seen in the SSE and Neon implementations, is the easiest part of the whole problem to vectorize.
I originally expected that I would have to compile the entire test program with all optimizations disabled, because I thought that with optimizations enabled, the compiler would auto-vectorize the compact scalar implementation and make comparisons with the hand-vectorized implementations difficult.
However, after some initial testing, I realized that the scalar implementations weren’t really getting auto-vectorized at all even with optimization level <code class="language-plaintext highlighter-rouge">-O3</code> enabled.
Or, more precisely, the compiler was emitting long stretches of code using vector instructions and vector registers… but the compiler was just utilizing one lane in all of those long stretches of vector code, and was still looping over each bounding box separately.
As a point of reference, here is <a href="https://godbolt.org/#z:OYLghAFBqd5QCxAYwPYBMCmBRdBLAF1QCcAaPECAMzwBtMA7AQwFtMQByARg9KtQYEAysib0QXACx8BBAKoBnTAAUAHpwAMvAFYTStJg1DIApACYAQuYukl9ZATwDKjdAGFUtAK4sGISWakrgAyeAyYAHI%2BAEaYxBJcAMykAA6oCoRODB7evv6BaRmOAqHhUSyx8VxJtpj2xQxCBEzEBDk%2BfgG19VlNLQSlkTFxCckKza3teV3j/YPllaMAlLaoXsTI7BzmiXhUANRYNOHoEAD6Z6oAHABsZzeSF0smGgCCO2HI3lj7JoluqhYLDCBGIYQAdAg/tgXu8zIk6ntDphjphThcmC1kAgHk9YR8GF8vD8/m4WiwzuEBJDofj4a49nTdoTvphfv8WEwCAgaYlsPtfm9YeNiF4HPsAGIANUwyEkvwA7FY3gKBV4GFlFRYBQB6HX7RRs9Wa/jEfYEACeKTZKXVGqM%2BwuLC4Ziu%2B0M6H2VFoqC5iTMqkeBCZSKOYTR50ut3ujzOzxVqsTTpdbudrr%2ByrhCNooZR4fRZ0xGxxsfjr0Tie9voI/sDZwIXtrkgzTIZVFhFdVIrFDZMSo7nYrVa5%2B1ULYTg9Vw4bFvH5cnU59I4AXnOF4vq/sAO5rwd9gAiu8708OXKYJgArBZmxfD4lM4mD3OBwLpbLJBAlvsQKOIBpwRoVArPsFp/gBQGkPsy5gYBwFbjBQFak%2BQrwrmqIFtcdy4nGL6SjKcqRmmbpUJIX4/kR1CkUhCqHih2ZofmkZFti2Flomb4EdOTb1l6VE/lQTaUV%2BfZWDRrYMPg7ZChOeHvtQS4NqoZyQSeFrKV6ClQepJ5bnG36/kpwGgWpwHQcucaQfBunCf2YkyRxH4nkpKmaWpLmbuZZEGRZIEQCZkFmT58H/rB1G0e8Mknqg1rEFyJCXlYt4QCC%2Bx4F%2BaAMOMWr7MQmAEOsDCns0CV4Jed7ash84CtO5g3Ps0VxHFxAJWVyWCKlNnarl%2BXEIV6BniVZUZoq4W4Q59UxU11gQBlWUObV%2BzRJ1OV5QVskEaovyWIt4KqJBFpbdq0TghaAWHTty6Wedx1bs894jWN%2BHyg1sVEMQAC0M0CHNT0LUt2XdWtDkQJt707XtIH7GDx2nVBUMXVd0Pgrdw2VexT0TY1b0AFRfZlDY1WYdX/SJK09YVwObdji37fs1PRGd9NXfTd0VXZVXrc9k1vTqeNZYTxPLYDvWcyD%2Bz6gzkMS2d0vbuLi2sw9Qpife%2BJvGEObhJzmOvSQvOzQTCkLUwkEG5zf2dbhwsU09EBMPLx0Q/bEsnZBzsI%2B6DvI6z%2BLs8KoI9vsACSGMibhxoCEhMmJt24ph9Hg4pWOqsJ52KWzinHOTilq6Z%2BuAopTued7irD6J%2B1/XFVeN7lbhT6Z7hIdyV5qh/kZbcBR325t2Fje20n6np4P7XmZBhd6T%2BreGftfk%2BYFcEQNZvcyU3BED2P7VualI8T957f%2BVBECeZZPdh37MkpS9U1Xq1KVpfsZuk9bRXnlepW3qj7OJiCC1X29LVJTvkLVaItK6vwsO/cqSt3ilzVq8WODYLAWFQJteOHMI6FTQZOE8aBepxAUAlG4Q17qDj1NvLAqg0TugUL8K8wIGAAA19j0IAJrMLCAALWYUwVQTDOSqDYfwjhQ1U4aU3Lg8IxAFCvFoMGK8ZghpXkSMQ7U4t9RhAoVQpgNCJFxASvQgA8sQAAsjwxRFgGEsOER/GS9dMy4SQSgvmDZ5pE3YdkEgkjTbfRcb9Nx/CPB4OIJbURuipEyLkRYDQ5jom3nZPuL09CID0MCZI3akEAmeLiLtH2oSsnhNkQlWJh4rxcDKvExJYRklhFSdk2GmSgknVyVnVUYTpGFKvMUhKCi4l/ASVQJJKT8ngkutw1QtTiAjOaZONpESEplI/p08pfTEk8OqR4xpEMGlpNUNMwcsyOkWAWSUo5yzEj9P4esiZrsxnXItHszsBzInHO6Wci5ayhmbIyTw65uyjzQMeu%2BdxEzPwPx8QDEBNs5JtKKWVbxQSCGlLhWChFrzbyK0qoCuUtz8mgsfkqMmQNbYwqUcikl14yX5MRRYC8ZUMV%2B1gdJeBAdxQACUmAHSwaqca%2BBcoOCyP88aJA8DADCP8k8BBjFiuLtVTSkqzENxkuy0CZtXF1V5bKBo6lVV%2BLqsK0VDBtJyqlYa9yI55WGVwomH8Gr%2BUCAgLarVwF9VVJdYa4CkqqmevdZBC1EALV6TPqNRlEVXjRFQJ4HKHLHGqCDoIfBmqRBiBaB4FgKQmAOGcfsZVC1YqwzNjGv64aIYC3NCas1wY3EWpCRzcaxBeUVK4OBeWebwSOoFTK4OGNwgtDagQCA9a8CTNQXyFZIUDh/AlPsDQ%2BlEjARBAO3lJ12QwnOdO5tk79hcH0qRUgVry79sHZM5cK6x0bsSFOsw%2BkLxLAeV2oFVBe0LqPbtU9a7x3sinYkfSGh52CEXUO5dtJ33nqnfKH8XAVj7rTv%2Bl9J7gMJI/Zui8%2BkzC3ufDJb1FSBlrJw63Vt3rIIQGiMW8E5Ke3DvKWDVtbqcl0xyku3Ze7RHYMucR0j5HMAtBOlRqNFpwS0fufRl9p1oP5wFOxlBZGqUJQoyM3jNGwQGqmcJpdy50OdotdhpJAyqkEdMRDSTqhpOoqvI%2ByjcTqMcoE0piEuzVOAaY2J1jVSjMmckdS8zPHLN8ZsyKiEQnqYieYy08TEmSNSfJV51cPnFP%2BYYCpoLamNNlzCxWK2kKy1hHZG4FZFrnzn1eAAN1QHgT0eaY1xoIAmhwkgk0GGIKm9NmazY5rcXmkLaWuv7PBYWtxEXVAaE691kbAoC3ILHP14tXBhuje6%2BNlBRaUGBGc3NmZvWJtLdUMkVba3Oyr2bG4hAhAFCzb2%2BJtVWXMpnfO%2BuS7FqFA1sTMdgg1LikVL7UsCrE2qs1eEKIBrTWM2Ho5ZBAbQ2rtveRQ92F6L/kvepccj7IIvvRp%2B/GqRiaAcptQGm4HA7QeLWm76k1iPocGbJ3DztCO0UJJWZ977KDfuY4cPVnHePM0daJ8tknYRqU9P3CTnh/O6Xw5OwlZRvS10M7R0zjHShWfY8a7j5rIPYYDeSJDiX5Phfa6p/Y9mHAVi0E4BeXgfgOBaFIKgTguXLDWH2AoNYGw2Q7B4KQAgmgjcrAANYgAvENk3HBJDm699bzgvAFAgCG57y3RvSBwFgEgNAaa6BxHIJQFPKQ0/xC%2BIYYA1QNAzZoLI/BlBohh%2BiGEFoFpODu5T2wQQBiGC0Fr3H0gWBORGHEO39tRXMBR/b5gShyAvDVbr7wEEdQw85miLFYgFoPBYDD6CPALAJ8rG9EwYACgpR4EwFuAx1oLfu/4IIer7ApAyEEIaNQYfdAzYMEYFA1hrD6DwNEKPkAVjRQaIP96BiiQkedQmqWQLgEkUwfgXACoQQEk8wwwVQCoNwqQ6QmQAgkBEgMBhQaBDA8BFQIw0ByBdgoBAgfQkwngHQmB3QJBjQEwAw4YCwBBSBtgdBGB0BLBcwDBCBmBNwKwTu6wmwegoImAWwPAxupuoe7eNuHAmE70DwD8T%2BwAW6Ta/426EAduVglgkEuAhAJAW0iQM2%2BwTWOe%2BhkGvAseWgt6pAfuAe%2BgnAIepAFuVu0hke0eHuXuVhQeZgkhzhEe7hceVh/eUiYBkgQAA">the x86-64 compiled output</a> and <a href="https://godbolt.org/#z:OYLghAFBqd5QCxAYwPYBMCmBRdBLAF1QCcAaPECAMzwBtMA7AQwFtMQByARg9KtQYEAysib0QXACx8BBAKoBnTAAUAHpwAMvAFYTStJg1DIApACYAQuYukl9ZATwDKjdAGFUtAK4sGISWakrgAyeAyYAHI%2BAEaYxBJcAMykAA6oCoRODB7evv6BaRmOAqHhUSyx8VxJtpj2xQxCBEzEBDk%2BfgG19VlNLQSlkTFxCckKza3teV3j/YPllaMAlLaoXsTI7BzmiXhUANRYNOHoEAD6Z6oAHABsZzeSF0smGgCCO2HI3lj7JoluqhYLDCBGIYQAdAg/tgXu8zIk6ntDphjphThcmC1kAgHk9YR8GF8vD8/m4WiwzuEBJDofj4a49nTdoTvphfv8WEwCAgaYlsPtfm9YeNiF4HPsAGIANUwyEkvwA7FY3gKBV4GFlFRYBQB6HX7RRs9Wa/jEfYEACeKTZKXVGqM%2BwuLC4Ziu%2B0M6H2VFoqC5iTMqkeBCZSKOYTR50ut3ujzOzxVqsTTpdbudrr%2ByrhCNooZR4fRZ0xGxxsfjr0Tie9voI/sDZwIXtrkgzTIZVFhFdVIrFDZMSo7nYrVa5%2B1ULYTg9Vw4bFvH5cnU59I4AXnOF4vq/sAO5rwd9gAiu8708OXKYJgArBZmxfD4lM4mD3OBwLpbLJBAlvsQKOIBpwRoVArPsFp/gBQGkPsy5gYBwFbjBQFak%2BQrwrmqIFtcdy4nGL6SjKcqRmmbpUJIX4/kR1CkUhCqHih2ZofmkZFti2Flomb4EdOTb1l6VE/lQTaUV%2BfZWDRrYMPg7ZChOeHvtQS4NqoZyQSeFrKV6ClQepJ5bnG36/kpwGgWpwHQcucaQfBunCf2YkyRxH4nkpKmaWpLmbuZZEGRZIEQCZkFmT58H/rB1G0e8Mknqg1rEFyJCXlYt4QCC%2Bx4F%2BaAMOMWr7MQmAEOsDCns0CV4Jed7ash84CtO5g3Ps0VxHFxAJWVyWCKlNnarl%2BXEIV6BniVZUZoq4W4Q59UxU11gQBlWUObV%2BzRJ1OV5QVskEaovyWIt4KqJBFpbdq0TghaAWHTty6Wedx1bs894jWN%2BHyg1sVEMQAC0M0CHNT0LUt2XdWtDkQJt707XtIH7GDx2nVBUMXVd0Pgrdw2VexT0TY1b0AFRfZlDY1WYdX/SJK09YVwObdji37fs1PRGd9NXfTd0VXZVXrc9k1vTqeNZYTxPLYDvWcyD%2Bz6gzkMS2d0vbuLi2sw9Qpife%2BJvGEObhJzmOvSQvOzQTCkLUwkEG5zf2dbhwsU09EBMPLx0Q/bEsnZBzsI%2B6DvI6z%2BLs8KoI9vsACSGMibhxoCEhMmJt24ph9Hg4pWOqsJ52KWzinHOTilq6Z%2BuAopTued7irD6J%2B1/XFVeN7lbhT6Z7hIdyV5qh/kZbcBR325t2Fje20n6np4P7XmZBhd6T%2BreGftfk%2BYFcEQNZvcyU3BED2P7VualI8T957f%2BVBECeZZPdh37MkpS9U1Xq1KVpfsZuk9bRXnlepW3qj7OJiCC1X29LVJTvkLVaItK6vwsO/cqSt3ilzVq8WODYLAWFQJteOHMI6FTQZOE8aBepxAUAlG4Q17qDj1NvLAqg0TugUL8K8wIGAAA19j0IAJrMLCAALWYUwVQTDOSqDYfwjhQ1U4aU3Lg8IxAFCvFoMGK8ZghpXkSMQ7U4t9RhAoVQpgNCJFxASvQgA8sQAAsjwxRFgGEsOER/GS9dMy4SQSgvmDZ5pE3YdkEgkjTbfRcb9Nx/CPB4OIJbURuipEyLkRYDQ5jom3nZPuL09CID0MCZI3akEAmeLiLtH2oSsnhNkQlWJh4rxcDKvExJYRklhFSdk2GmSgknVyVnVUYTpGFKvMUhKCi4l/ASVQJJKT8ngkutw1QtTiAjOaZONpESEplI/p08pfTEk8OqR4xpEMGlpNUNMwcsyOkWAWSUo5yzEj9P4esiZrsxnXItHszsBzInHO6Wci5ayhmbIyTw65uyjzQMeu%2BdxEzPwPx8QDEBNs5JtKKWVbxQSCGlLhWChFrzbyK0qoCuUtz8mgsfkqMmQNbYwqUcikl14yX5MRRYC8ZUMV%2B1gdJeBAdxQACUmAHSwaqca%2BBcoOCyP88aJA8DADCP8k8BBjFiuLtVTSkqzENxkuy0CZtXF1V5bKBo6lVV%2BLqsK0VDBtJyqlYa9yI55WGVwomH8Gr%2BUCAgLarVwF9VVJdYa4CkqqmevdZBC1EALV6TPqNRlEVXjRFQJ4HKHLHGqCDoIfBmqRBiBaB4FgKQmAOGcfsZVC1YqwzNjGv64aIYC3NCas1wY3EWpCRzcaxBeUVK4OBeWebwSOoFTK4OGNwgtDagQCA9a8CTNQXyFZIUDh/AlPsDQ%2BlEjARBAO3lJ12QwnOdO5tk79hcH0qRUgVry79sHZM5cK6x0bsSFOsw%2BkLxLAeV2oFVBe0LqPbtU9a7x3sinYkfSGh52CEXUO5dtJ33nqnfKH8XAVj7rTv%2Bl9J7gMJI/Zui8%2BkzC3ufDJb1FSBlrJw63Vt3rIIQGiMW8E5Ke3DvKWDVtbqcl0xyku3Ze7RHYMucR0j5HMAtBOlRqNFpwS0fufRl9p1oP5wFOxlBZGqUJQoyM3jNGwQGqmcJpdy50OdotdhpJAyqkEdMRDSTqhpOoqvI%2ByjcTqMcoE0piEuzVOAaY2J1jVSjMmckdS8zPHLN8ZsyKiEQnqYieYy08TEmSNSfJV51cPnFP%2BYYCpoLamNNlzCxWK2kKy1hHZG4FZFrnzn1eAAN1QHgT0eaY1xoIAmhwkgk0GGIKm9NmazY5rcXmkLaWuv7PBYWtxEXVAaE691kbAoC3ILHP14tXBhuje6%2BNlBRaUGBGc3NmZvWJtLdUMkVba3Oyr2bG4hAhAFCzb2%2BJtVWXMpnfO%2BuS7FqFA1sTMdgg1LikVL7UsCrE2qs1eEKIBrTWM2Ho5ZBAbQ2rtveRQ92F6L/kvepccj7IIvvRp%2B/GqRiaAcptQGm4HA7QeLWm76k1iPocGbJ3DztCO0UJJWZ977KDfuY4cPVnHePM0daJ8tknYRqU9P3CTnh/O6Xw5OwlZRvS10M7R0zjHShWfY8a7j5rIPYYDeSJDiX5Phfa6p/Y9mHAVi0E4BeXgfgOBaFIKgTguXLDWH2AoNYGw2Q7B4KQAgmgjcrAANYgAvENk3HBJDm699bzgvAFAgCG57y3RvSBwFgEgNAaa6BxHIJQFPKQ0/xHJEVq470viGGANUDQM2aCyPwZQaIYfohhBaBaTg7uU9sEEAYhgtBG9x9IFgTkRhxDd/bUVzAUfu%2BYEocgLw1Wm%2B8BBHUMPOZoixWIBaDwWAw%2BgjwCwGfKxvRMGAAoKUeBMBbgMdaC37v%2BCCHq%2BwKQMhBCGjUGH3QM2DBGBQNYaw%2Bg8DRCj5AFY0UDQo%2B70BiiQkedQmqWQLgEkUwfgM2IQ4YCwIwM2hQmQAgsBegqBDQ8wwwVQ3QkBAgfQkwngHQegdgBBjQEwAwiBuBZBVBGBM2swrQOBFQyBKwTu6wmwegoImAWwPAxupuoe3eNuHAeeBe9sReDo1QAE4I26EAduVglgkEuAhAJAW0iQM2%2BwTWOe6hkGvAseWgt6pAfuAe%2BgnAIepAFuVuIhke0eHuXuRhQeZgQh1hEe9hceRhw%2BUiUBkgQAA">the arm64 compiled output</a> for the compact scalar implementation.</p>
<p>Finding that the auto-vectorizer wasn’t really working on the scalar implementations led me to try to write a new scalar implementation that would auto-vectorize well.
To try to give the auto-vectorizer as good of a chance at possible at working well, I started with the compact scalar implementation and embedded the single-ray-box intersection test into the 4-wide function as an inner loop.
I also pulled apart the implementation into a more expanded form where every line in the inner loop carries out a single arithmetic operation that can be mapped to exactly to one SSE or Neon instruction.
I also restructured the data input to the inner loop to be in a readily vector-friendly layout; the restructuring is essentially a scalar implementation of the vectorized setup code found in the SSE and Neon hand-vectorized implementations.
Finally, I put a <code class="language-plaintext highlighter-rouge">#pragma clang loop vectorize(enable)</code> in front of the inner loop to make sure that the compiler knows that it can use the loop vectorizer here.
Putting all of the above together produces the following, which is as auto-vectorization-friendly as I could figure out how to rewrite things:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>void rayBBoxIntersect4AutoVectorize(const Ray& ray,
const BBox4& bbox4,
IVec4& hits,
FVec4& tMins,
FVec4& tMaxs) {
float rdir[3] = { 1.0f / ray.direction.x, 1.0f / ray.direction.y, 1.0f / ray.direction.z };
float rdirX[4] = { rdir[0], rdir[0], rdir[0], rdir[0] };
float rdirY[4] = { rdir[1], rdir[1], rdir[1], rdir[1] };
float rdirZ[4] = { rdir[2], rdir[2], rdir[2], rdir[2] };
float originX[4] = { ray.origin.x, ray.origin.x, ray.origin.x, ray.origin.x };
float originY[4] = { ray.origin.y, ray.origin.y, ray.origin.y, ray.origin.y };
float originZ[4] = { ray.origin.z, ray.origin.z, ray.origin.z, ray.origin.z };
float rtMin[4] = { ray.tMin, ray.tMin, ray.tMin, ray.tMin };
float rtMax[4] = { ray.tMax, ray.tMax, ray.tMax, ray.tMax };
IVec4 near(int(rdir[0] >= 0.0f ? 0 : 3), int(rdir[1] >= 0.0f ? 1 : 4),
int(rdir[2] >= 0.0f ? 2 : 5));
IVec4 far(int(rdir[0] >= 0.0f ? 3 : 0), int(rdir[1] >= 0.0f ? 4 : 1),
int(rdir[2] >= 0.0f ? 5 : 2));
float product0[4];
#pragma clang loop vectorize(enable)
for (int i = 0; i < 4; i++) {
product0[i] = bbox4.corners[near.y][i] - originY[i];
tMins[i] = bbox4.corners[near.z][i] - originZ[i];
product0[i] = product0[i] * rdirY[i];
tMins[i] = tMins[i] * rdirZ[i];
product0[i] = fmax(product0[i], tMins[i]);
tMins[i] = bbox4.corners[near.x][i] - originX[i];
tMins[i] = tMins[i] * rdirX[i];
tMins[i] = fmax(rtMin[i], tMins[i]);
tMins[i] = fmax(product0[i], tMins[i]);
product0[i] = bbox4.corners[far.y][i] - originY[i];
tMaxs[i] = bbox4.corners[far.z][i] - originZ[i];
product0[i] = product0[i] * rdirY[i];
tMaxs[i] = tMaxs[i] * rdirZ[i];
product0[i] = fmin(product0[i], tMaxs[i]);
tMaxs[i] = bbox4.corners[far.x][i] - originX[i];
tMaxs[i] = tMaxs[i] * rdirX[i];
tMaxs[i] = fmin(rtMax[i], tMaxs[i]);
tMaxs[i] = fmin(product0[i], tMaxs[i]);
hits[i] = tMins[i] <= tMaxs[i];
}
}
</code></pre></div></div>
<div class="codecaption">Listing 9: Compact scalar version written to be easily auto-vectorized.</div>
<p>How well is Apple Clang v12.0.5 able to auto-vectorize the implementation in Listing 9?
Well, looking at the output assembly <a href="https://godbolt.org/#z:OYLghAFBqd5QCxAYwPYBMCmBRdBLAF1QCcAaPECAMzwBtMA7AQwFtMQByARg9KtQYEAysib0QXACx8BBAKoBnTAAUAHpwAMvAFYTStJg1DIApACYAQuYukl9ZATwDKjdAGFUtAK4sGIAMwapK4AMngMmAByPgBGmMQSAOxcpAAOqAqETgwe3r4BQemZjgJhEdEscQlcybaY9iUMQgRMxAS5Pn6BdQ3Zza0EZVGx8UkpCi1tHfndEwNDFVVjAJS2qF7EyOwc5v54VADUWDQR6BAA%2BueqABwAbOe3kpfLJhoAgrvhyN5YByb%2BblULBY4QIxHCADoEP9sK8PmZ/PV9kdMCdMGdLkxWsgEI9nnDPgxvl5fv83K0WOcIgIoTCCQjXPt6XsiT9MH8ASwmAQELT/NgDn93nCJsQvA4DgAxABqmGQkj%2BiSs70Fgq8DGyiosgoA9DqDop2erNfxiAcCABPVLs1LqjVGA6XFhcMzXA6GdAHKi0VDc/xmVRPAjM5HHcLoi5XO4PJ7nF4q1WJp0ut3O13/ZXwxG0UOo8MY85Yza42Pxt6JxPe30Ef2B84EL21yQZ5mMqhwiuq0XihsmJUdzsVqvcg6qFsJweq4cNi3j8uTqc%2BkcALznC8X1YOAHc14O%2BwARXed6dHblMEwAVgszYvh/8mcTB7nA8FMrlkggywOIFHEA0EI0KhVgOC0/wAoDSAOZcwMA4CtxgoCtSfYUEVzNECxue48TjF8pVleVIzTN0qEkL8fyI6hSKQxJDxQ7M0PzSMixxbCy0TN8COnJt6y9KifyoJtKK/PsrBo1sGHwdthQnPD32oJcG1Uc5IJPC1lK9BSoPUk8tzjb9fyU4DQLU4DoOXONIPg3ThP7MSZI4j8TyUlTNLUlzN3MsiDIskCIBMyCzJ8%2BD/1g6jaI%2BGST1Qa1iG5EhLysW8IFBA48C/NAGAmLUDmITACA2BhTxaBK8EvO9tWQ%2BdBWncxbgOaL4ji4gErK5LBFSmztVy/LiEK9AzxKsqM0VcLcIc%2BqYqa6wIAyrKHNqg4Yk6nK8oK2SCNUP5LEWiFVEgi0tu1GIIQtALDp25dLPO46txee8RrG/CFQa2KiGIABaGaBDmp6FqW7LurWhyIE296dr2kCDjB47TqgqGLqu6GIVu4bKvYp6Jsat6ACovsyhsarMOr/pElaesK4HNuxxb9oOamYjO%2Bmrvpu6Krsqr1ueya3p1PGssJ4nlsB3rOZBg59QZyGJbO6Xt3FxbWYe4UxPvAl3nCHMIk5zHXpIXnZoJhSFqYSCDc5v7Otw4WKaeiAmHl46IftiWTsg52EfdB3kdZgl2ZFMEewOABJDGRNw40BCQmTE27CUw%2BjwcUrHVWE87FLZxTjnJxS1dM/XQUUp3PO9xVh9E/a/riqvG9ytwp9M9wkO5K81Q/yMtuAo77c27CxvbaT9T08H9rzMgwu9J/VvDP2vyfMCuCIGs3uZKbgiB7H9q3NSkeJ%2B89v/KgiBPMsnuw79mSUpeqar1alK0oOM3SetorzyvUrb1R9nE1BBar7elqkp3yFqtEWldX4WHfuVJWHxS5qzeLHBsFgLCoE2vHDmEdCpoMnCeNAvV4gKASrcIa91Bx6m3lgVQ6J3QKD%2BFeEEDAAAaBx6EAE1mHhAAFrMKYKoJhXJVBsP4RwoaqcNKblwREYgCg3i0GDFeMwQ0rz%2BGIdqcW%2BpwgUKoUwGhEj4gJXoQAeWIAAWR4YoiwDCWHCI/jJeumZcJIJQXzBs80ibsJyCQSRptvouN%2Bm4/hHg8HEEtqI3RUiZFyIsBocx0Tbwcn3F6ehEB6GBMkbtSCATPHxF2j7UJWTwmyISrEw8V4uBlXiYk8IyTwipOybDTJQSTq5KzqqMJ0jClXmKQlBRcT/gJKoEklJ%2BSISXW4aoWpxARnNMnG0iJCUykf06eUvpiSeHVI8Y0iGDS0mqGmYOWZHSLALJKUc5Z/h%2Bn8PWRM12YzrkWj2Z2A5kTjndLORctZQzNkZJ4dc3ZR5oGPXfO4iZn4H4%2BIBiAm2ck2lFLKt4oJBDSlwrBQi15t5FaVUBfKW5%2BTQWPyVGTIGtsYVKORSS68ZL8mIosBeMqGK/awOkvAgOEpHGBijugjUkcsGDnGuSohH8SH5zISQLACR3EMIyeEFhUqGAcO%2BbwhVMqxnCNETgqlkojbyPMcoxZFLBWqLIRozAlDPTaJRZI/R4QjGmLHHqyx1iTkxBiCgxYeibEtOqppNpmrqxzKvAKk5Nd/l2Lgd/Bgmt2TJldNTT5kihBCGwKCp%2BkKtq3HJcUiERFP7hXDZGx05wiKxp%2BfkhNSbgHkzTeS3V%2B4s0phzXAz4hwwynCYtiEs%2BIL4RvDGIv0AYgyxpqfkyImABDJoJc/WqGayoQgEv2htXb81cX7fWYt4zh2joYOOrqqap1UoSjW2dTYF1ZjbGG1UGse0ADdUB4E9EoAgbK2oNhdaoCo3j8bm38UOoJH6frvgWts%2BIITPUWvwb67ksK9VdKvK%2BiobzKlbrjdkrZJbNkPIrD6rVUTzEvNg66nwCGBlVOQ5M%2BpaG0n3P%2BYmLD1YoMnJ6U6gjLAiODJ/Wk0ZQHJnLgwzRjV2G8M4b1XBwjvTzmrNbqR9JOL0PUdafxujSK9WCZEyxsT7zJPsbqQqu5vH5MIog888xjGEqqdYx8rTkzUPrtk8XQUmKZJso/GbNlxs/2IOQWONxksXOeYWsgdzBxXNuPQCBycD6n1BHdFmyzn43ZZoo/ET8enBQRc8xAFIO1SNxay4l4gyW5Opbyk%2BswpsYsbMkTl5ACWbOVeWClg4aWnH%2BEgugcrILgJta4wVuzALGURWZWKCUAAlJgB0eWvgxvgXKDhsj/PGiQPAwBwj/JPAQYxK3etrdtc%2BGSo3QJm1cXVabcpGjqUO34uqi3lsMG0ppdb4Q7ubnWzwnCoifwndm2Oz7Z3gLXaqf927wEHtbpBz5F7rcId6TPqNfrcIyEeBYKkJgEoFCiAMGaK9%2BDNRbnBAQAgjBzSoEWuyTA2i6AHSYF4IgWOHCLeXOiOEN6705TG2yoOgh8GnckG8anqB8JvTwAz5xBx9sLViqdXC%2BdpcLl8yg5s3nX3SClzL1XiZV4K7qggQgChSAq7V2ro75oNuZT16Ig30ujcQ4UGFjcI5iDTYPQh0mXBwLywlxCH72RpOu8Au7sbnu8AzcaDc33hx9Qe69zSZcD0tuaQd0HhhCUa4VKfo7pZt5IIJ%2Bahn/cWf09CbzzlAvxTY9ly9ZubPLDk/O4nQX45%2Beg/zORdn5vmfi9N6Uwk0NkV4/Tesfq7v4m0%2Bd4sIxxvOex8t4L%2BPjvk/GNl9wlFcEN2k/V1r11APgPpMe%2B3xDXfK/IT7634fhgu1F%2B983ID6v6/1MAxP0tyEsMD%2BP7P8/h/N2bkv8/%2BN/rlZNKA4D4p4rJPwf6QijLf7gFZ5gFn4QEwEjIX4cwnhtAm41536gEWgQgg7QGYHYGs64Em44FYEm6IH/6V4Q5oFD7d4ToB4Q5EF0H4HEE8L0G2qIF9xAoRCtDPoQCt654ciwjiYhSHD/CSgHAaD6T%2BDASgg8H15nICEJJCEciiFcD6SkRm6gblwEAyGj4L50iCFu4iEHBmD6QXj1b/Ia5ehcHSG8GF78ErKKGGH%2BD6QaBSGCDaGT7HJ2H6F%2B6GEKg/hcCrD66aHuFopD7yFiEGH%2BCiEXj6RmBmENyX4jipDEAYA9ixKD67ZZjJFMDABcgPwGAOg%2BjRQHC06C7C6MBMAxD0BsRTgkAHDPqpQVLRL3R4AchuAHDNgtHWDWC26djJGpEODpGQIVLOry4QjkqcFkbmLDFgzX6DQGpBEg7UrDErKjGBjjH7pXiTEjLTHlKzGn4D6QKFYHD9HoBpHzFUEnEpFnGDEXF0xz434QIqKLEm7LEIZLF3HUzZ6HHPGiKnHnFvzmatz/G3GAnt4fFgn7gNYQlPF35rGSAbGopbFk5Wa7FxL7Gv5r6wm1yiIwkrHiZ4nlJfHTZYlHG9aJiEl34DJrIoEraQmQSUlQnHGMkVLUnAnXEAnYkMmvEXHNJBEgkEBDEIbwmImSLUpUCtAnRokJIYk3aPFknl4VjW53GrFK6in4IJQSncbSnwyAEXHHEClCl36GmfEPH6nkmqjKmQkVJWnYn3HfHmmKmJgmnWkrLEZboulcnG48JvHorMm2q%2BmXEinkpam7Q6mynhCkm/Gga2n4kJKxlElz5RkLG4kBkqnibuk8EUH0nemqCBnQlpmukZlJKemQLck%2Bm8mZGTja4ECBk2k8lFluArIJkpkcyYo0QcCrC0CcAXi8B%2BAcBaCkCoCcBNmWDWCNbrCbDsi7A8CkAECaCdmrAADWIAF4QQ3ZHAkgfZC5Q5nAvACgIAQQ85A5nZpAcAsASAaASOdA8Q5AlAV5qQN5CQ3whgwAXAXAGgKQNAsi%2BClAMQO5MQ4QrQFonAs5V5bAggBiEaIFJ5pAWAXIRg4gsFXuWOB5sFJqco1O2ws5oI9QO5OYMQsUxAFoHgWAO5YIeALAoFp53oORCg0oeAmAW4Bi1o/Zs5/AggIgYg7AUgMggghoagO5ugKQBRxg3Rlg%2BgeAMQB5kAqw0UjQaF70Bi/g%2B59Qp22QLgEk0wfgCIwQEkCwIwCQgQiQaQGQWQAg2lIAulRQ5lDABllQowxlPQ6lAg/QUwngnQVlLWdgLlTQkwgw4YbqRlGgJlcw7leQOl3l/l9lSwxlqwCgk5WwegYImA2Fp5G5vZpA/Zg5w5HAmE70jw%2BRr5Bw75AEEIKhEAo5VgElBwuAhAdRM5kEiOj59AZoM5ywvAx5Wg9WpAK5a5%2BgnAW5WVO5uV%2B5h5c5C5PVG5Zg25sFo1E1J5PVWOUiGlkgQAA%3D%3D">on x86-64</a> and <a href="https://godbolt.org/#z:OYLghAFBqd5QCxAYwPYBMCmBRdBLAF1QCcAaPECAMzwBtMA7AQwFtMQByARg9KtQYEAysib0QXACx8BBAKoBnTAAUAHpwAMvAFYTStJg1DIApACYAQuYukl9ZATwDKjdAGFUtAK4sGe1wAyeAyYAHI%2BAEaYxCAAnGakAA6oCoRODB7evnrJqY4CQSHhLFEx8baY9vkMQgRMxASZPn5cFVXptfUEhWGR0XEJCnUNTdmtQ109xaUDAJS2qF7EyOwc5gDMeFQA1Fg0IegQAPpHqgAcAGxHF5InsyYaAIIbwcjeWNsm626qLCzBBGIwQAdAgvtgHs8zOtKltdph9phDicmPVkAgbndIS8GG8vB8vm56iwjiEBKDwdjoa4tlTNrj3phPt8WEwCAgKetsNtPk9IUNiF4HNsAGIANUwyEknwA7FYnjyeV4GOlZRYeQB6DXbRRM5Wq/jEbYEACeiSZiWVKqM2xOLC4ZjO20M6G2VFoqDZ6zMqluBDpcL2wSRx1Ol2utyO9wVitjdodTvtjq%2B8qhMNogYRweRR1Rywxkejj1jsfdnoI3t9RwIbsrkhTdJpVEhJcVAqFNZMcpbrZLZbZ21UDZjvcV/ZrJuHxdHY49A4AXlOZ7Py9sAO5L3tdgAim9b492bKYJgArBZ6yfd%2BtU7Gd1OezzxZLJBBZtsQIOIBpgRoqPNtiaX4/n%2BpDbPOQG/v%2Ba4QX%2Bap3ny0KZoiObnFcmJRg%2BooSlKoZJk6VCSG%2BH54dQhFwTKu4IemSHZqGebouhRaxk%2BOHjnW1ZumRH5UHWpFvl2VgUY2DD4M2fIjlhz7UHONaqEcoEHia8lujJYHKQea5Ru%2Bn5yf%2BgFKf%2B4HzlGoHQZp/HdkJEksS%2BB5yQpqlKQ5q7GUROkmQBEAGaBRkedB36QeRlHPBJB6oOaxBsiQp5WJeEAAtseBvmgDBDGq2zEJgBBLAwh51DFeCnle6rwdOPLjuYFzbOF0RRcQMVFfFgiJRZ6qZdlxC5egR4FUVKaysFmE2dVEV1dYEApWlNmVdsEStRlWU5ZJOGqJ8lizcCqigSaa3qhEwImj5u0bfOpnHfta73NeA1Ddh0o1ZFRDEAAtBNAhTXdM1zel7VLTZECrc9G1bQB2xA/th1gWDJ1neDwKXf1pXMXdI21U9ABUb2pTWFVmFV30CQtHW5f9q3o7N23bOTERHdTZ3U1dJVWWVy33aNT0aljaW4/j82/Z1rMA9s2o06DItHeL67C7NjM3XyQnXtiTzBBmISs6jj0kJzk04zJM1MKBOus19rWYfzJN3RATDS/tIPWyLB2gfbMPOjb8OM9izP8oCHbbAAkijAmYfqAhwRJsbtsKQfh72CVDorMetglk4Jyzo4JYuqfLjyCUblnW4KzesfNd1%2BVnhexWYXeqeYQHUluaoX56U3Pkt%2BuTdBbXltx8pye981xmgbnWkfo3unbV5Hm%2BVBEDmZ3El1zhPdD81TmJQPI/uc33lgRArmmR3QdexJCUPWNZ6NQlSXbEbhPm3lx5noVl6I8zsYAjNZ9PQ1cVX3zi0C1Lo/Cwz9ipy2eIXJWjxI41gsBYVAq1o4sxDrlJBo4DxoE6tEBQMULh9Wur2LU68sCqCRM6BQnwzz/AYAADW2NQgAmvQ4IAAtehTBVB0NZKoJh3CWF9UTipVcmCQjEAUI8Wg/ozxmD6medY%2BD1TC21MEEhZCmAUJEdEGK1CADyxAACyHDZEWBoQw/hL8JLV1TJhOBCCuY1mmnjZhGQSCiMNu9Bxn0nHcI8Fg4gptBGaLERIqRFgNDGPCZeZk243TUIgNQ3xojNqgR8a46Im0PaBLScEyRMVIm7jPFwIq0TYnBHicERJ6TIapL8QdTJadFRBPEbks8%2BSYoyKiV8GJVA4kJOycCU67DVCVOIAM%2Bpo4mkhJikUl%2BrTildNiRw8pLjakgxqUk1Q4zeyTJaRYGZBS9nzPWN07hyyRmOyGeck0WzWw7NCfs9pRyTlLL6aslJHDzmbL3OA26z5nEjNfDfDxP0AEWykk0vJRV3F%2BJwYUqFQKYWPMvLLUqvypSXOyYC2%2BcoiZ/UthCuR8KCXniJdk2FFgTxFRRV7SB4loE%2B2FLY30YdkEqlDmg3sw1iV4JfgQ7ORCSBYBiM4mhKTggMLFQwFh7zOEyolUM/hgiMFkpFHraRxj5GzJJbyxRRCVGYFIa6dRCLRHaOCHowxQ4tWmPMQciIEQEHTC0RYhp5VVJNNVeWKZZ4eUHIrt8qxUCXg7CDAcOiaICxYhPgwVWTI2I%2Bj9OTV5ojQiYAEICu%2BoK1oXGJfk4EPEE2v2Cu/GNwYhFegTdWJNHzsmpvTf/Ym2biWau3PmusRbhKiSgSW2N2wABuqA8CuiUAQJlTUawOtUMUdx2NjbeIqdkmdH1nwzXWdEAJrqTXYM9WySFWq2lnkncUJ5pSGBnP6Wsmtqybklg9WqsJxiHmHsdT4E9PSynJqqTKq5N7Yx3vLHug5HS7UvpYG%2B3pC7amDLXaM%2Bcv7GkqvvU%2Bh9Wqj2vs6ccxZjdP2jMvcMi98GeT/t3XCrVyG0NgYw887DkGknVKvXRwjW6xE7vucY4DMUKPgZebR9JeHPk3tRRJJlL4jZMv1ku2B8ChxONFmJ6TM1kCSe2OJpx6AN2jhHWOjQTtgQ4dfLpmDr4mNaekxAVoG19P/n2kZ2YJmspjoSDfPTvHiAGec7Z%2Bzo6zPrFAugFzKzRHuf855gNNLK50pgdsAASkwHaHLHwo3wJlBw6RvnDRIHgYAwRvkHgIPonL%2Bc3Wrny0YmuElYuASNo4qqyXJTVGUtVrxVVMvZYYOpVS%2BXggdZK5ajCgiPx1dS%2BmobDX/ytbKRN9r/4utntmx5UrjdFtaSPoNWlIVHhEI8CwRITBhQKFEAYI0fbsGqjXECAgBBGDGlQLNJkmB1F0B2kwLwRATsOEy/OJEkIB1DoynFplftBDYPq5IR4r3UDYSengL79iYtxZmpFQ6mFs6o5nPJhB9ZZOTukCjtH%2BPYyLyx1VBAhAFCkDxwTgnNXjQFdShTwRVPUc08WwoDTK4BzEGSzFFtJTCZcGAtLJHwJRvpGSdsAXv4hdxZF3gFL1QLmS52NqYXovyTzhukV8tNYudy5oTFCufOcW6/qnMy8oETeAYt9zs325rdy8A5rouxXOfJYYQbk9d8beHPNxlb3%2Bz7em593bv3DvSMxMDaFVSJvzHaoj5hr3YeLDAcD0ikPlv1W%2B4z8n%2BZ63SyqSm/r8unvjcy6m%2BL4X5eQaV6BG1ivZfa8gkQXnjnNYpvu%2BL1Rn6DessgkhjX3vDALkD7r/3nvo%2BneYTCo3qVHuu933HyCQZI%2Bl8W8X0P5f6%2BBmT6j6uBodO5/x4j6Xk0wJZtr9P%2Bf/7l%2B6cX7P3TnfLMDz76MZ3o/3eb8cLv4t7/lrf8cMfy7j%2BRCHqHHQgGz3yWZAhEwwCh2C%2BBFG2A0G0nWH/ABHAP9yOWgJiVgOZAQK4G0kIgZ03WLgIHQKT2AygIWRwPgO2DMG0hPDs2%2BSJzdFALQIgMwKoMFxoPWG0g0FQMEDIKD32UoJgK4PWAQOlA/C4HmEpxIMELTxEOwLEIQJPG0jMEYPKyf1UkSGIAwA7EiTj3vCoh0KYGAFZBvgMBtA9HCn7Xq0%2B0wAgEYCYAiHoCYjHBIG2HHUShKXCWujwGZDcG2HrD8OsGsHZ1bB0L0IcAMNARKXtUx2BGJRANGUnC1ViKBnb16h1VkNm3JViIWXiN9ESLJRimSIGWMXSOqhn1j1AW%2BVjEiPQH0KyPfwaKaKfmKXJhNw7xAQURyLpzyJPVyOaKplD2IBqN6MEVaOiOGIWR6SWSmIIBiPhSGPaORTqMVBWJ6K70KMkGKMRTPDKKtQOUqML2aPWJ5E2PyMw0uI6NGKLy2Ii03RuK7zmMbn3xy1WJD2eO3CY2%2BJKVeIgAWKWN92%2BPqVkKBJmMwx2L2NEXJSoHqAOgqOKQyJn26NqK11jFZ0hJiWhOJXhNgyRKiRRMH3GOyMmN0MaOmM%2BJKQhOpM6LdzOIxI2MtQGK7yxLpNGNJMeNHFpIeP%2BLiV5NAVAnZIeN%2BJZOxNmhxxhOwRinxM2kJJiWJLa3uPROdxLBFKuJiQ1NuJNxVImKePFOpNmLiRfytQeOFMNNFPONpw4VZPf3fTPUFOWMtNATBMEVJwIDtJKT%2BMJAWW1LJJZlRQog4HmFoE4BPF4D8A4C0FIFQE4DcFCPWgUEWGWCZA2B4FIAIE0BDPmAAGsQATwdMwyOBJBIzszYzOBeAFAQAdMszoyQzSA4BYAkA0Ads6BohyBKBWzEh2yYhiQ%2B0zhno3hDBgAuAuANBWgaBJFsFKAIhyyIhgh6gTROAMzWy2BBAdEY1lz6zSAsBWQjBxAdzRcTtqydyDVJRXtVgMyARKhyyMwIhIpiATQPAsByzAQ8AWAVyGz3RTCFAxQ8BMA1wdFzQoyMz%2BBBARAxB2ApAZBBBdQ1ByzdBWhLDjBEybB7zqzIB5hwpqhTznodF1gqzKh6t0gXARJRgWhSBAhgwnUYhWhcg0gBAKKcgUhGKGApg%2Bg6K2gSKBBOgRhPBmg9A7AeKahhhugaLOKhKxLmLxgxKOKSh%2BhpCFglgVg9BARMAryGziyIzSAoyYy4yOB%2BzBzrZhybQxyfxgR8CIAEzLBrBQJcBCAPD0zQJtsez6AjR0zZheA6ytA7NSB8zCz9BOBSzdLyyDKqyazMzsy/LiyzAyydzwqor6y/KTsxFSLJAgA%3D%3D%3D">on arm64</a>… the result is disappointing.
Much like with the compact scalar implementation, the compiler is in fact emitting nice long sequences of vector intrinsics and vector registers… but the loop is still getting unrolled into four repeated blocks of code where only one lane is utilized per unrolled block, as opposed to produce a single block of code where all four lanes are utilized together.
The difference is especially apparent when compared with the hand-vectorized <a href="https://godbolt.org/#z:OYLghAFBqd5QCxAYwPYBMCmBRdBLAF1QCcAaPECAMzwBtMA7AQwFtMQByARg9KtQYEAysib0QXACx8BBAKoBnTAAUAHpwAMvAFYTStJg1DIApACYAQuYukl9ZATwDKjdAGFUtAK4sGe1wAyeAyYAHI%2BAEaYxBJmAOykAA6oCoRODB7evnrJqY4CQSHhLFExXPG2mPb5DEIETMQEmT5%2BXJXV6XUNBIVhkdGxCQr1jc3ZbcPdvcWlgwCUtqhexMjsHOYAzHhUANRYNCHoEAD6x6oAHABsx5eSp3MmGgCCm8HI3lg7JhtuqiwswQIxGCADoEN9sI8XmYNlVtntMAdMEdTkwGsgELd7lDXgx3l5Pt83A0WMcQgIwRCcTDXNtqVs8R9MF8fiwmAQEJSNtgdl9nlDhsQvA4dgAxABqmGQki%2BcSsz15vK8DHSsosvIA9BqdopmcrVfxiDsCABPRLMxLKlVGHanFjlc47QzoHZUWiodkbMyqO4Eenw/bBZEnM5XG53Y4PBWKmN2h07e1mc7feXQ2G0AOIoMo45olaYiNRp4xmNuj0EL0%2B44EV2VyQp%2Bm0qhQkuKwXCmsmOUt1slsvsnaqBvR3uK/s1k3D4ujsfugcALynM9n5Z2AHcl72uwARTet8d7dlMEwAVgs9ZPu42qZjO6nPd5Eqlkggcx2IEHEA0II0VAWOxNL8fz/UgdnnIDf3/NcIL/NU735GFMyRHMLmuLFIwfMVJWlENE0dKhJDfD88OoQi4LiXcEPTJDsxDPMMXQosYyfHDxzratXTIj8qDrUi3y7KwKMbBh8GbfkRyw59qDnGtVGOUCDxNeTXRksDlIPNdI3fT85P/QClP/cD50jUDoM0/juyEiSWJfA85IU1SlIc1djKInSTIAiADNAoyPOg79IPIyiXgkg9UHNYh2RIU8rEvCBAR2PA3zQBhhjVHZiEwAhlgYQ96hivBTyvdV4OnXlx3MS4dnC6IouIGKiviwREos9VMuy4hcvQI8CqKlNZWCzCbOqiK6usCAUrSmzKp2CJWoyrKcsknDVC%2BSxZpBVRQJNNb1QiEETR83aNvnUzjv2tcHmvAahuwmUasiohiAAWgmgQpruma5vS9qlpsiBVuejatoAnYgf2w6wLBk6zvBkFLv60rmLukbaqegAqN7UprCqzCq76BIWjrcv%2B1b0dm7adnJiIjups7qaukqrLK5b7tGp6NSxtLcfx%2Bbfs61mAZ2bUadBkWjvF9dhdmxmbv5ITrxxZ5ggzEJWdRx6SE5yacZkmamFAnXWa%2B1rMP5km7ogJhpf2kHrZFg7QPtmGnRt%2BHGZxZmBSBDsdgASRRgTMP1AQ4IkmN2xFIPw97BKh0VmPWwSycE5Z0cEsXVPl15BKNyzrcFZvWPmu6/KzwvYrMLvVPMIDqS3NUL89KbnyW/XJugtry24%2BU5Pe%2Ba4zQNzrSP0b3Ttq8jzfKgiBzM7iS65wnuh%2BapzEoHkf3Ob7ywIgVzTI7oOvYkhKHrGs9GoSpKdiNwnzby48z0Ky9EeZmNARms%2BnoauKr75xaBal0fhYZ%2BxU5YvELkrJ4kcawWAsKgVa0cWYh1ykg0cB40CdWiAoGKlw%2BrXV7FqdeWBVDIidAoL4Z4AQMAABoJmCAATXoQwAAWgmJgqg6FslUEw7hLC%2BqJxUquTBIRiAKCeLQP0Z4zB9TPBsfB6phbamCCQshTAKEiOiDFahAB5YgABZDhsiLA0IYfwl%2BElq6pkwnAhBXMazTTxswjwWCyA33eg4z6TjuEuNEabQRmixESKkRYDQxiwmXhZNuV01CIDUN8dETaoEfEkFEZtD2ATUnYOCTFCJu4zxcCKlEmJwQ4nBAScQR27DVAVIOhktOipAniMkbk4xMjInfGiVQWJ8SsmVNOtU2p856mjiaTkgp4SimdJiRwspGQ%2BlJMGQs1QIzexjJaRMl%2BmzonTO6bM3priqkpMOSaVZrZ1khMKVsiw7SdkbC6dwuZtSQbHLSSsvc4DbrPmcX0187jsY/QARbKSTTWmXkNn0nB2yIWuKhTcoqstSpfOlEs1xfzb5yiJn9S2oK5FFRhaIuFFcCXYJiieBFr9BqQPEtAn2IpbE%2BjDsglUoc0G9mGriiweCX4EOzkQkgWAYjMJockxhorWHJI4SK6pDDJWqH4YIjBkLRR62kcY%2BR1yK79SUcQzApCXTqPca47RwQ9GGKHNc0x5j8kWAiBEBBMwtEWIaeVVSTSVXlnGVy4xWr868isVA9%2BDBVbMjjEmcmBzRFCCENgP5d8gVrUuJyvJII8KUswirIMtpjh4QjRwip0bY3/2JomzlGrtypodOmmlrxdiBkOHRdEBZsQn2DVmti3pfQRvKX00ImABBxsxffSqyaiogh4p26tLNM1qw7VWAgeaam9v7QwQdbUE0jshTFct466xTs2E2QNioZ3MgAG6oDwC6JQBAGVNRrPa1QxQIUAscVVSN0Rn0fWfDNV50R/EuqNYSj17IwU2ryTFB9xQpn3JKau99lSXn5uWWcks7rVWhOMVcm1kGfDQYeaU%2BDRykMnJQzGND5ZQMxVuRBh1uGOkwe6QRnthyBm/v6aRxpyr0NYco2eHDLA8MzMboRxDS7DnvL9ahrjFHtkxR43x2jAn6MPP2cxtJkM2N1I%2BWR6TIHZNquufxwTezhNqcSaJ55KGkUSQZS%2BI2DL9aftgfAocTjRb2ZczNZATmdgOacegf9o5r23o0E7VNZniCvjC2x18HHeTBZcxANoG14NRZS8R0RsXtNtiyreswhtwvzLRf%2BZAqaMvRCy5JiOuXEsbFAugQrFS0sNZi3MKzXtqUhVpUKEUAAlJgO02WPhRvgTKDh0gfOGiQPAwBggfIPAQfRc3JMLfNfeCS/XAJG1fXsPAY2ajKW214qq03ZsMHUqpRbwQLurkWxwjCgiPyjalDUCAz3xsCA8qd0p33zv/iu6ugHHk7uNxB1pI%2BVLK40qIYWnYp7sGqlQLsDkzI0AsESEwEUAB1OgGZWAUKyk6WgIIdhmA0BoE8iV0f0DYIIdkE3njnsvRlAbDK/aCGwS9yQhb7E7E2zNSKh1MLZ2zh5hB9Y3MPukMLkXM5F4S6qggQgChSAy9l%2By47xolupVV4I9XyNv1OJBwoQLiphrEFGycf4xxr1cGOIkBQSXgJvm1ILkE72aiVqTChjU5MvBKFtNbhQCAvBUDdJge3Kv1xK4xMaSKqUDAEEwBQogrtUhGHoOvGBqo1xK8zzm0ktvI/r3PbQeHFDjyJ30To8U0bjrW2D6H8PVMNTIplBbvbNCreF5D2H%2BgkeIAd8qXhUCQ%2BvfnFAscfR%2BjjhCAABJyFFKKAIsbQs7DXxvtrKHzejYYd3m3vfw8D7HyPjKo3x%2BT%2Bn7PhfS%2BV9JdAslx/W/JsoyHyw/fje%2B8R4d4P8/p%2BT8OiX4z7z6L7L6xr5ak6gQQFmDP6SZTbAhnZd4F4H5N794/5u6/YX4s4mggiYGn5T7AE35gFfigQb6kGwFFzDbfK/Z77IGf5H7oEDa4EIGgj/5MF4GAG2hX4gG36xqP4P4P4UFt7VQsGsIf6H5oGO4YGiFYHSEzasGcEEHX6gF37QFQFQFCE0oxjy47AhANB3q/57abQsiQgwYBS7DfCijr7aQbD/iAiGGVKDbcjTLmEshWFcDaSES64AbFwEAOEgjzgmEuHARuGk7aQnhtYfI6FUD6H2Fj6ILOFmEhGWE7AbDaQaB2GCD%2BFOGmHRKuEpEygfhcALBq6%2BH%2BGBFUhJG/ihGU4fgwH1KYQA4ULTL/TIHcID6lFtH3Y/50FZR26ME4EA7/gQB2ri4gicp6EIZjp4TQy/Y0Lj5vjkxD7zF4QlGCJdFyQ/4jFS7jFbpniTEHTTHxhAw0ELFUxn57YMIX6lHZzbFjETGYANABFHFJizGiEsJnFLGjYfGrGRGSbG7FKtHW7UIdHrHAnXY9FB59HH5MEg7DGjE%2Bi7GwoxQxFTGXjj5vHyG0KfEXHEArEOhrEAZtEQmO53GImcqomHHokzEnGiFXGrHnFD70mAE3HLhkmSBImEoolPGLjUnHEiFYk/EEmMnfELF/HWKto1hK6dgwZQAeFEhEg7BkRAzFGJqB6kgsCoDw5sgKAADWA%2ByByA1O3%2BjuTRWBxuYpHy0pcKeSxS9qngEA0papSWLIbgipTc4pmE1pcmgm9ptAjphAzp8pPw7pxRnpEk3pBmdy0SfpAZnYTiLpCpPwOwEA9RKGkZFg5adpqADpTpM0iZIZyZEAth4ZECwUHACwtAnAJ4vAfgHAWgpAqAnAbplg1gOwCgSwKwzImwPApABAmgFZCwupIAJ4oWVZHAkgtZA5jZnAvACgIAoW/Z9ZFZpAcAsASAaOiQdAH6FAb06O25MQ7whgwAXAXAGgbQNAki2ClAEQ05EQwQDQJonAvZaOtOBAOiwaT5y5pAWAbIGeawDZHueA5e05eqUoXgSez5vAgIVQ05GYEQkUxAJoHgWA05QIeALAUFCwboTAwACg4oeAmAa4Oi5odZvZ/AggIgYg7AUgMggguoag05ugbQBgRgKA1g1g%2BgeAEQ85kACw4UNQ85HAz0OiGwc5VQL26QLgIkYwrQpAgQQYjqZQSQKQaQAgslOQqlNQ0w/QyldgklAgXQowngLQeg%2BlH2tQIwPQilulZlVlGlEwVlOlJQAwxRiwywqwegQImAawPAlZ1ZU535TZHAqEz0twN8rFwAOwp5P4IIHhEALZVglgoEuAhAJAa0GwyWHg%2B59ARoPZcwvAS5WgbWpAw5o5%2BgnAk5pAdZDZwVc5C5fZA5JV45ZggVtVs5jVy5JV8OYiUlkgQAA%3D">SSE compiled output</a> and the hand-vectorized <a href="https://godbolt.org/#z:OYLghAFBqd5QCxAYwPYBMCmBRdBLAF1QCcAaPECAMzwBtMA7AQwFtMQByARg9KtQYEAysib0QXACx8BBAKoBnTAAUAHpwAMvAFYTStJg1DIApACYAQuYukl9ZATwDKjdAGFUtAK4sGe1wAyeAyYAHI%2BAEaYxCAAnGakAA6oCoRODB7evnrJqY4CQSHhLFEx8baY9vkMQgRMxASZPn5cFVXptfUEhWGR0XEJCnUNTdmtQ109xaUDAJS2qF7EyOwc5gDMeFQA1Fg0IegQAPpHqgAcAGxHF5InsyYaAIIbwcjeWNsm626qLCzBBGIwQAdAgvtgHs8zOtKltdph9phDicmPVkAgbndIS8GG8vB8vm56iwjiEBKDwdjoa4tlTNrj3phPt8WEwCAgKetsNtPk9IUNiF4HNsAGIANUwyEknwA7FYnjyeV4GOlZRYeQB6DXbRRM5Wq/jEbYEACeiSZiWVKqM2xOLC4ZjO20M6G2VFoqDZ6zMqluBDpcL2wSRx1Ol2utyO9wVitjdodTvtjq%2B8qhMNogYRweRR1Rywxkejj1jsfdnoI3t9RwIbsrkhTdJpVEhJcVAqFNZMcpbrZLZbZ21UDZjvcV/ZrJuHxdHY49A4AXlOZ7Py9sAO5L3tdgAim9b492bKYJgArBZ6yfd%2BtU7Gd1OezzxZLJBBZtsQIOIBpgRoqPNtiaX4/n%2BpDbPOQG/v%2Ba4QX%2Bap3ny0KZoiObnFcmJRg%2BooSlKoZJk6VCSG%2BH54dQhFwTKu4IemSHZqGebouhRaxk%2BOHjnW1ZumRH5UHWpFvl2VgUY2DD4M2fIjlhz7UHONaqEcoEHia8lujJYHKQea5Ru%2Bn5yf%2BgFKf%2B4HzlGoHQZp/HdkJEksS%2BB5yQpqlKQ5q7GUROkmQBEAGaBRkedB36QeRlHPBJB6oOaxBsiQp5WJeEAAtseBvmgDBDGq2zEJgBBLAwh51DFeCnle6rwdOPLjuYFzbOF0RRcQMVFfFgiJRZ6qZdlxC5egR4FUVKaysFmE2dVEV1dYEApWlNmVdsEStRlWU5ZJOGqJ8lizcCqigSaa3qhEwImj5u0bfOpnHfta73NeA1Ddh0o1ZFRDEAAtBNAhTXdM1zel7VLTZECrc9G1bQB2xA/th1gWDJ1neDwKXf1pXMXdI21U9ABUb2pTWFVmFV30CQtHW5f9q3o7N23bOTERHdTZ3U1dJVWWVy33aNT0aljaW4/j82/Z1rMA9s2o06DItHeL67C7NjM3XyQnXtiTzBBmISs6jj0kJzk04zJM1MKBOus19rWYfzJN3RATDS/tIPWyLB2gfbMPOjb8OM9izP8oCHbbAAkijAmYfqAhwRJsbtsKQfh72CVDorMetglk4Jyzo4JYuqfLjyCUblnW4KzesfNd1%2BVnhexWYXeqeYQHUluaoX56U3Pkt%2BuTdBbXltx8pye981xmgbnWkfo3unbV5Hm%2BVBEDmZ3El1zhPdD81TmJQPI/uc33lgRArmmR3QdexJCUPWNZ6NQlSXbEbhPm3lx5noVl6I8zsYAjNZ9PQ1cVX3zi0C1Lo/Cwz9ipy2eIXJWjxI41gsBYVAq1o4sxDrlJBo4DxoE6tEBQMULh9Wur2LU68sCqCRM6BQnwzz/AYAADW2NQgAmvQ4IAAtehTBVB0NZKoJh3CWF9UTipVcmCQjEAUI8Wg/ozxmD6medY%2BD1TC21MEEhZCmAUJEdEGK1CADyxAACyHDZEWBoQw/hL8JLV1TJhOBCCuY1mmnjZhGQSCiMNu9Bxn0nHcI8Fg4gptBGaLERIqRFgNDGPCZeZk243TUIgNQ3xojNqgR8a46Im0PaBLScEyRMVIm7jPFwIq0TYnBHicERJ6TIapL8QdTJadFRBPEbks8%2BSYoyKiV8GJVA4kJOycCU67DVCVOIAM%2Bpo4mkhJikUl%2BrTildNiRw8pLjakgxqUk1Q4zeyTJaRYGZBS9nzPWN07hyyRmOyGeck0WzWw7NCfs9pRyTlLL6aslJHDzmbL3OA26z5nEjNfDfDxP0AEWykk0vJRV3F%2BJwYUqFQKYWPMvLLUqvypSXOyYC2%2BcoiZ/UthCuR8KCXniJdk2FFgTxFRRV7SB4loE%2B2FLY30YdkEqlDmg3sw1iV4JfgQ7ORCSBYBiM4mhKTggMLFQwFh7zOEyolUM/hgiMFkpFHraRxj5GzJJbyxRRCVGYFIa6dRCLRHaOCHowxQ4tWmPMQciIEQEHTC0RYhp5VVJNNVeWKZZ4eUHIrt8qxUCXg7CDAcOiaICxYhPgwVWTI2I%2Bj9OTV5ojQiYAEICu%2BoK1oXGJfk4EPEE2v2Cu/GNwYhFegTdWJNHzsmpvTf/Ym2biWau3PmusRbhKiSgSW2N2wABuqA8CuiUAQJlTUawOtUMUdx2NjbeIqdkmdH1nwzXWdEAJrqTXYM9WySFWq2lnkncUJ5pSGBnP6Wsmtqybklg9WqsJxiHmHsdT4E9PSynJqqTKq5N7Yx3vLHug5HS7UvpYG%2B3pC7amDLXaM%2Bcv7GkqvvU%2Bh9Wqj2vs6ccxZjdP2jMvcMi98GeT/t3XCrVyG0NgYw887DkGknVKvXRwjW6xE7vucY4DMUKPgZebR9JeHPk3tRRJJlL4jZMv1ku2B8ChxONFmJ6TM1kCSe2OJpx6AN2jhHWOjQTtgQ4dfLpmDr4mNaekxAVoG19P/n2kZ2YJmspjoSDfPTvHiAGec7Z%2Bzo6zPrFAugFzKzRHuf855gNNLK50pgdsAASkwHaHLHwo3wJlBw6RvnDRIHgYAwRvkHgIPonL%2Bc3Wrny0YmuElYuASNo4qqyXJTVGUtVrxVVMvZYYOpVS%2BXggdZK5ajCgiPx1dS%2BmobDX/ytbKRN9r/4utntmx5UrjdFtaSPoNWlIVHgqzLV4AE3oOJkhcSwRI9RMCPBEoYhQABrex5aKyVv9E4g2QLZ3xqrA93mLLYw7cEOxGsaAjvRcwAoOQCUFl9reJgAAjkcAtVtQJzW%2BRMRwyBntpV2/d7YCgEBbE7JhwmOntgWac%2BsG6RXtjfbu2940R2Sl9qx8QaH7WvDejegDoHIPBCgXWFwG998%2B1MHQOgPt0PmdmAgHThAtARcs4IEd0CWOcd2fvMfR4RC625T7dg1UqAdjsiZP947woADqdAMysAoVlZ0tBgTbDMBoDQJ5EpHfoGwQQbI0tPAHUOjKcWmV%2B0ENg%2Brkh1c3cqzNSKh1MLZ2j/JhB9ZZOTukFH6Py5F7x6qtjggChSDJ5T6OGrxoCupRz4IvPJYC%2BLYUBpxUw1iDJfF%2BgLwiRGcw5Z1wYCb5tQR%2BBKN9IbaE03o1OTVIuImShGwDo0IuxUBA4YGADgNYEBME167EfwB6DPWCDA1UUOdv8/oM1Igto/hHCx14Kg7pMBHESNn9cTJtBeDSnUS7mBc968xx6NcGVFgECZAgaITILo2wc4roAIqA2wmATA6It2v2VMGoaK0odeeAxANCDeTeLesOSBuGSuZOteyWDCaBzepIreYuWBdSN6eByBLChBGBLOZBcGFBKMU2qBfajeRB7WmBcWwIU2GSjBfyU2BBrB6BxBnBJo3BQIbW5B6WTBEhrCNBIhdBXBPBDB94C8KMIQ9Q46EAZBiCXICyAUOwXwIo2wGg2k6w/4AI2hyWB0zIEImGBhzIxhXA2khEJem6xcBAVhyBAyth%2BhwEjhtu2kJ4OBRcPIaebomhlhOhvh9h/hRh2wJOH4GgFhggXhoy8WehsRv4AR0oH4PObheeUR1h84MRMSDh8RjuH4ZgIR3aios2FCXSuegsfa3C0OsOLRHCbRLOQh7BJB2hXBs28wTRZe2cEA9qcewIxKGhuGRU/ezKQMzBcxZE5MWBNCSxQxpeZeHRqgXRYuYxiekxZKMU0xB0sxBa8x1UshDADC6xVMGU%2BBSxBRIxzx%2BxExUxkBsGZxdY0MU2LCtxKxyWfx5xhENREkleJS/0LRwQux4u1CMJPRtBpBAxlqGx7hzx0erxvohxiKZ4VA9Qm0XxCaPxVxaxwJb4AJyBpJdYqJ6Jy4UJDAMJmJkg2Joi5KeJ6RhJFxAh/x9xyBNxwJTxtJLx4xWJxK7JAynJ0oCxVxQJ1JdxWBspA%2BoJG2X26OHEmeJSB2HgR2J2Z26AF2129RjxheHCCg6x3yme5K%2BSJSDqngEAGpM05mzIbghI2wTcypsYlp0yJ6tptA9phA2abpzhhIrp5mdmN6Xp6qVGs0qAdpDpTiTpIZ3wbp1RHpiokZFgLaNpsZfp8ZVUiZ3woZ5hypO4HA8wtAnAJ4vAfgHAWgpAqAnALplg1gmOiwywTIGwPApABAmgZZ8wl2IAJ4OmFZHAkg1ZvZ9ZnAvACgIAOmPZtZZZpAcAsASABudA0Q5AlAa59AMQxIfaZwz0bwhgwAXAXAGgrQNAki2ClAEQE5EQwQ9QJonAXZ/2ruBAOiMaT5C5pAWArIRg4g35vemuM535BqkoXgv%2Bz5vAAIlQE5GYEQkUxAJoHgWAE5gIeALAUF8w7oTAwACgYoeAmAa4Oi5oNZXZ/AggIgYg7AUgMggguoagE5ugrQBgRgKA1g1g%2BgeAEQM5kA8w4U1QIFz0Oi6w05lQ9W6QLgIkowLQpAgQwYTqMQrQuQaQAgMlOQKQqlDAUwfQSlbQElAgnQIwngzQegdgBlNQww3QClulZlVl6l4wVlOlJQ/Q%2BRCgbZKweggImAqwPA5ZlZ4535DZHAe5B51sR5Nop5P4wIzhEATZVglgoEuAhAJAa03OoE2piQ65RonZswvA85WgdmpAA5Q5%2BgnAY5pANZdZwV05s53ZvZRVI5ZggV1VU59VC5RVmuYiklkgQAA%3D">Neon compiled output</a>.</p>
<p>Here are the results of running the auto-vectorized implementation above, compared with the reference compact scalar implementation:</p>
<table>
<thead>
<tr>
<th style="text-align: right"> </th>
<th style="text-align: center">x86-64:</th>
<th style="text-align: center">x86-64 Speedup:</th>
<th style="text-align: center">arm64:</th>
<th style="text-align: center">arm64 Speedup:</th>
<th style="text-align: center">Rosetta2:</th>
<th style="text-align: center">Rosetta2 Speedup:</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: right">Scalar Compact:</td>
<td style="text-align: center">44.5159 ns</td>
<td style="text-align: center">1.0x.</td>
<td style="text-align: center">41.8187 ns</td>
<td style="text-align: center">1.0x.</td>
<td style="text-align: center">81.0942 ns</td>
<td style="text-align: center">1.0x.</td>
</tr>
<tr>
<td style="text-align: right">Autovectorize:</td>
<td style="text-align: center">34.1398 ns</td>
<td style="text-align: center">1.3069x</td>
<td style="text-align: center">38.1917 ns</td>
<td style="text-align: center">1.0950x</td>
<td style="text-align: center">59.9757 ns</td>
<td style="text-align: center">1.3521x</td>
</tr>
</tbody>
</table>
<p>While the auto-vectorized version certainly is faster than the reference compact scalar implementation, the speedup is far from the 3x to 4x that we’d expect from well vectorized code that was properly utilizing each processor’s vector hardware.
On arm64, the speed boost from auto-vectorization is almost nothing.</p>
<p>So what is going on here?
Why is compiler failing so badly at auto-vectorizing code that has been explicitly written to be easily vectorizable?
The answer is that the compiler is in fact producing vectorized code, but since the compiler doesn’t have a more complete understanding of what the code is actually trying to do, the compiler can’t set up the data appropriately to really be able to take advantage of vectorization.
Therein lies what is, in my opinion, one of the biggest current drawbacks of relying on auto-vectorization: there is only so much the compiler can do without a higher, more complex understanding of what the program is trying to do overall.
Without that higher level understanding, the compiler can only do so much, and understanding how to work around the compiler’s limitations requires a deep understanding of how the auto-vectorizer is implemented internally.
Structuring code to auto-vectorize well also requires thinking ahead to what the vectorized output assembly should be, which is not too far from just writing the code using vector intrinsics to begin with.
At least to me, if achieving maximum possible performance is a goal, then all of the above actually amounts to <em>more</em> complexity than just directly writing using vector intrinsics.
However, that isn’t to say that auto-vectorization is completely useless- we still did get a bit of a performance boost!
I think that auto-vectorization is definitely better than nothing, and when it does work, it works well.
But, I also think that auto-vectorization is not a magic bullet perfect solution to writing vectorized code, and when hand-vectorizing is an option, a well-written hand-vectorized implementation has a strong chance of outperforming auto-vectorization.</p>
<p><strong>ISPC Implementation</strong></p>
<p>Another option exists for writing portable vectorized code without having to directly use vector intrinsics: <a href="https://ispc.github.io/">ISPC</a>, which stands for “Intel SPMD Program Compiler”.
The ISPC project was started and initially developed by Matt Pharr after he realized that the reason auto-vectorization tends to work so poorly in practice is because <em>auto-vectorization is not a programming model</em> <a href="https://pharr.org/matt/blog/2018/04/30/ispc-all">[Pharr 2018]</a>.
A programming model both allows programmers to better understand what guarantees the underlying hardware execution model can provide, and also provides better affordances for compilers to rely on for generating assembly code.
ISPC utilizes a programming model known as <a href="https://en.wikipedia.org/wiki/SPMD">SPMD</a>, or single-program-multiple-data.
The SPMD programming model is generally very similar to the <a href="https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads">SIMT</a> programming model used on GPUs (in many ways, SPMD can be viewed as a generalization of SIMT): programs are written as a serial program operating over a single data element, and then the serial program is run in a massively parallel fashion over many different data elements.
In other words, the parallelism in a SPMD program is implicit, but unlike in auto-vectorization, the implicit parallelism is also a <em>fundamental</em> component of the programming model.</p>
<p>Mapping to SIMD hardware, writing a program using a SPMD model means that the serial program is written for a single SIMD lane, and the compiler is responsible for multiplexing the serial program across multiple lanes <a href="https://doi.org/10.1109/InPar.2012.6339601">[Pharr and Mark 2012]</a>.
The difference between SPMD-on-SIMD and auto-vectorization is that with SPMD-on-SIMD, the compiler can know much more and rely on much harder guarantees about how the program wants to be run, as enforced by the programming model itself.
ISPC compiles a special variant of the C programming language that has been extended with some vectorization-specific native types and control flow capabilities.
Compared to writing code using vector intrinsics, ISPC programs look a lot more like normal scalar C code, and often can even be compiled as normal scalar C code with little to no modification.
Since the actual transformation to vector assembly is up to the compiler, and since ISPC utilizes LLVM under the hood, programs written for ISPC can be written just once and then compiled to many different LLVM-supported backend targets such as SSE, AVX, Neon, and even CUDA.</p>
<p>Actually writing an ISPC program is, in my opinion, very straightforward; since the language is just C with some additional builtin types and keywords, if you already know how to program in C, you already know most of ISPC.
ISPC provides vector versions of all of the basic types like <code class="language-plaintext highlighter-rouge">float</code> and <code class="language-plaintext highlighter-rouge">int</code>; for example, ISPC’s <code class="language-plaintext highlighter-rouge">float<4></code> in memory corresponds exactly to the <code class="language-plaintext highlighter-rouge">FVec4</code> struct we defined earlier for our test program.
ISPC also adds qualifier keywords like <code class="language-plaintext highlighter-rouge">uniform</code> and <code class="language-plaintext highlighter-rouge">varying</code> that act as optimization hints for the compiler by providing the compiler with guarantees about how memory is used; if you’ve programmed in GLSL or a similar GPU shading language before, you already know how these qualifiers work.
There are a variety of other small extensions and differences, all of which are well covered by the <a href="https://ispc.github.io/ispc.html">ISPC User’s Guide</a>.</p>
<p>The most important extension that ISPC adds to C is the <code class="language-plaintext highlighter-rouge">foreach</code> control flow construct.
Normal loops are still written using <code class="language-plaintext highlighter-rouge">for</code> and <code class="language-plaintext highlighter-rouge">while</code>, but the <code class="language-plaintext highlighter-rouge">foreach</code> loop is really how parallel computation is specified in ISPC.
The inside of a <code class="language-plaintext highlighter-rouge">foreach</code> loop describes what happens on one SIMD lane, and the iterations of the <code class="language-plaintext highlighter-rouge">foreach</code> loop are what get multiplexed onto different SIMD lanes by the compiler.
In other words, the contents of the <code class="language-plaintext highlighter-rouge">foreach</code> loop is roughly analogous to the contents of a GPU shader, and the <code class="language-plaintext highlighter-rouge">foreach</code> loop statement itself is roughly analogous to a kernel launch in the GPU world.</p>
<p>Knowing all of the above, here’s how I implemented the 4-wide ray-box intersection test as an ISPC program.
Note how the actual intersection testing happens in the <code class="language-plaintext highlighter-rouge">foreach</code> loop; everything before that is setup:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>typedef float<3> float3;
export void rayBBoxIntersect4ISPC(const uniform float rayDirection[3],
const uniform float rayOrigin[3],
const uniform float rayTMin,
const uniform float rayTMax,
const uniform float bbox4corners[6][4],
uniform float tMins[4],
uniform float tMaxs[4],
uniform int hits[4]) {
uniform float3 rdir = { 1.0f / rayDirection[0], 1.0f / rayDirection[1],
1.0f / rayDirection[2] };
uniform int near[3] = { 3, 4, 5 };
if (rdir.x >= 0.0f) {
near[0] = 0;
}
if (rdir.y >= 0.0f) {
near[1] = 1;
}
if (rdir.z >= 0.0f) {
near[2] = 2;
}
uniform int far[3] = { 0, 1, 2 };
if (rdir.x >= 0.0f) {
far[0] = 3;
}
if (rdir.y >= 0.0f) {
far[1] = 4;
}
if (rdir.z >= 0.0f) {
far[2] = 5;
}
foreach (i = 0...4) {
tMins[i] = max(max(rayTMin, (bbox4corners[near[0]][i] - rayOrigin[0]) * rdir.x),
max((bbox4corners[near[1]][i] - rayOrigin[1]) * rdir.y,
(bbox4corners[near[2]][i] - rayOrigin[2]) * rdir.z));
tMaxs[i] = min(min(rayTMax, (bbox4corners[far[0]][i] - rayOrigin[0]) * rdir.x),
min((bbox4corners[far[1]][i] - rayOrigin[1]) * rdir.y,
(bbox4corners[far[2]][i] - rayOrigin[2]) * rdir.z));
hits[i] = tMins[i] <= tMaxs[i];
}
}
</code></pre></div></div>
<div class="codecaption">Listing 10: ISPC implementation of the compact Williams et al. 2005 implementation.</div>
<p>In order to call the ISPC function from our main C++ test program, we need to define a wrapper function on the C++ side of things.
When an ISPC program is compiled, ISPC automatically generates a corresponding header file named using the name of the ISPC program appended with “_ispc.h”.
This automatically generated header can be included by the C++ test program.
Using ISPC through CMake 3.19 or newer, ISPC programs can be added to any normal C/C++ project, and the automatically generated ISPC headers can be included like any other header and will be placed into the correct place by CMake.</p>
<p>Since ISPC is a separate language and since ISPC code has to be compiled as a separate object from our main C++ code, we can’t pass the various structs we’ve defined directly into the ISPC function.
Instead, we need a simple wrapper function that extracts pointers to the underlying basic data types from our custom structs, and passes those pointers to the ISPC function:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>void rayBBoxIntersect4ISPC(const Ray& ray,
const BBox4& bbox4,
IVec4& hits,
FVec4& tMins,
FVec4& tMaxs) {
ispc::rayBBoxIntersect4ISPC(ray.direction.data, ray.origin.data, ray.tMin, ray.tMax,
bbox4.cornersFloatAlt, tMins.data, tMaxs.data, hits.data);
}
</code></pre></div></div>
<div class="codecaption">Listing 11: Wrapper function to call the ISPC implementation from C++.</div>
<p>Looking at the assembly output from ISPC <a href="https://godbolt.org/#z:OYLghAFBqd5QCxAYwPYBMCmBRdBLAF1QCcAaPECAMzwBtMA7AQwFtMQByARg9KtQYEAysib0QXACx8BBAKoBnTAAUAHpwAMvAFYTStJg1B4FAB2Skl9ZATwDKjdAGFUtAK4sGIAGylHAGTwGTAA5DwAjTGIQSQBWUlNUBUI7Bhd3Tx8EpJSBQOCwlkjouMtMa1sBIQImYgJ0jy9fK0wbVOragnzQiKiY%2BIUauobM5qGuoJ6ivriASktUN2Jkdg4AagIAT1NMLCo1qlpUJgIAUgBmJ3OL7AOjk%2BvzgCFTjQBBV7e177XMVUS6msAG6oPDoNbEJibJ5PVCqACSgiiShsknhQmUTggaAYgzWbgYeH4xBYd2OBAhUIAInhiK1KgxTrEntdYlTSJ8flzuTzeXyfji8QSiSRSYdyZTNgB5Yh4YBBJkspnszn8tXq7mCinC4li%2B4UyGbAAqAFkghz3hqreqtfjCbqySdJaamKoLV9rZ6ebadaLHRTwuE4ZI0MRgsQFIrvMrFZJle6vYmfr6Sf6NmbcbH46qk16U3qJQQTa7I8y42yE7m8/a/UEKQhCKWnuWqbM1qcAOwvS28/P%2B84Q/DEdvnKntrtrLgAOg0%2BwA9JKaXS2gJFRp45OZ/PF7T6alFVxsz2q2rp7O1gvDUu96vmQAmZXjqkXbsfY/cvt1tbBWqK1lji4AInc5SDWaQ1liJ8XxzbkiTWCBiCHKdVBHbBALWDQtzbTtXw1H9iDXR90PXZ4YK5Ttn3fLk4IQpDNlQ4isPHXD1Xwg8iNHSdoKon4KLIn4aMQ2kpwALwYzjMNnbCu34nk2PvDixwfUieO%2BPj3lku0RVTL8qF/Zl/xHICngw0CuFAu8oJUj0eUEpCUJuRipOYzSuT0gjmXXNkjLWR4WJ5dSbNg/ZaOE%2BjHIkpicNcn53PY7z0Ljay%2BUCvk7OEsSIrHSSqGk/z%2BTihSEs42JuKC8iO0ot9ytikhMCYZAEHgvAfMwqcp0kPKYu%2BIsgibPBFLWFhXQgYbVAQqFTXNeDA2DUNwybeSni8582UVAbvIAWklGU5QVTzlTbAAqQdhNUeZuqtMaoFm1QQxIBbFSWw81tep4NrHbbDV2%2BVGWZF7WzWE6hOIKdNkrE9uQgW77rDZEnvqjyngfV7Vvex8vqhH79uRw6gdO0GRNmWYyo1IsS3WwaWCCUaacNF03RmoM7vm%2BHmUK5aYzej61kx6VZV%2Bwi2WOgnkIu1TE2phgbuZ2HHvZ/SngBrn0a2naBZxgGRZBsGIchn5odl1mI0VDmUdWtGeb57G/tx4X8Z1omSeS9UGwIfrBt6zNmR5i4nHQ8nVA9taXYCyqYIojh5loThYl4LwOC0UhUE4ExzDWBRFmWTB2zvc4eFIAhNCj%2BYAGsQFiDR9E4SR4%2BL5POF4BQQCrovE6j0g4FgJA0BYUw6CichKF7/v6GiNPkC4LhvHMmhaAIZFKHCevwiCWpNk4Ave7YQQpQYWgN/b0gsGGoxxCPoc9yBTBm6Pv5WjcBfN94Otynr2g8HCSFiE2FwsHrggsoWDP3mIcJgwAFAADU8CYAAO5Sh2AnAu/BBAiDEOwKQMhBCKBUOoI%2BuhzIGCMCACe%2BhP7N0gPMVApgGS302lKAcm1NpDGAJgM4o4FBKFrmUCoqQHAMGcK4RoegAiTEKMUPQiRkgMhGF4cyUjcgMG6OIvo5kWgrgYB0YYQjMhqPKDeTR4xlG9GiGo8Ysi9CDE6MY6Ypj5iZyWCsPQgDMCrB4NHWOdcj4pw4BPTc08pxcGamYCwaxcCEBILnfOoEXB9wHsOU4ecuCzF4G3LQxNSDl0rtXDg3CE5Jx8U3Fuhdi4ZJjhwO8XiCmNxKe3DJ18Ix8MkEAA%3D">for x86-64 SSE4</a> and <a href="https://godbolt.org/#z:OYLghAFBqd5QCxAYwPYBMCmBRdBLAF1QCcAaPECAMzwBtMA7AQwFtMQByARg9KtQYEAysib0QXACx8BBAKoBnTAAUAHpwAMvAFYTStJg1B4FAB2Skl9ZATwDKjdAGFUtAK4sGIAGylHAGTwGTAA5DwAjTGIQSQBWUlNUBUI7Bhd3Tx8EpJSBQOCwlkjouMtMa1sBIQImYgJ0jy9fK0wbVOragnzQiKiY%2BIUauobM5qGuoJ6ivriASktUN2Jkdg4AagIAT1NMLCo1qlpUJgIAUgBmJ3OL7AOjk%2BvzgCFTjQBBV7e177XMVUS6msAG6oPDoNbEJibJ5PVCqACSgiiShsknhQmUTggaAYgzWbgYeH4xBYd2OBAhUIAInhiK1KgxTrEntdYlTSJ8flzuTzeXyfji8QSiSRSYdyZTNgB5Yh4YBBJkspnszn8tXq7mCinC4li%2B4UyGbAAqAFkghz3hqreqtfjCbqySdJaamKoLV9rZ6ebadaLHRTwuE4ZI0MRgsQFIrvMrFZJle6vYmfr6Sf6NmbcbH46qk16U3qJQQTa7I8y42yE7m8/a/UEKQhCKWnuWqbM1qcAOwvS28/P%2B84Q/DEdvnKntrtrLgAOg0%2BwA9JKaXS2gJFRp45OZ/PF7T6alFVxsz2q2rp7O1gvDUu96vmQAmZXjqkXbsfY/cvt1tbBWqK1lji4AInc5SDWaQ1liJ8XxzbkiTWCBiCHKdVBHbBALWDQtzbTtXw1H9iDXR90PXZ4YK5Ttn3fLk4IQpDNlQ4isPHXD1Xwg8iNHSdoKon4KLIn4aMQ2kpwALwYzjMNnbCu34nk2PvDixwfUieO%2BPj3lku0RVTL8qF/Zl/xHICngw0CuFAu8oJUj0eUEpCUJuRipOYzSuT0gjmXXNkjLWR4WJ5dSbNg/ZaOE%2BjHIkpicNcn53PY7z0Ljay%2BUCvk7OEsSIrHSSqGk/z%2BTihSEs42JuKC8iO0ot9ytikhMCYZAEHgvAfMwqcp0kPKYu%2BIsgibPBFLWFhXQgYbVAQqFTXNeDA2DUNwybeSni8582UVAbvIAWklGU5QVTzlTbAAqQdhNUeZuqtMaoFm1QQxIBbFSWw81tep4NrHbbDV2%2BVGWZF7WzWE6hOIKdNkrE9uQgW77rDZEnvqjyngfV7Vvex8vqhH79uRw6gdO0GRNmWYyo1IsS3WwaWCCUaacNF03RmoM7vm%2BHmUK5aYzej61kx6VZV%2Bwi2WOgnkIu1TE2phgbuZ2HHvZ/SngBrn0a2naBZxgGRZBsGIchn5odl1mI0VDmUdWtGeb57G/tx4X8Z1omSeS9UGwIfrBt6zNmR5i4nHQ8nVA9taXYCyqYIojh5loThYl4LwOC0UhUE4ExzDWBRFmWTB2zvc4eFIAhNCj%2BYAGsQFiDR9E4SR4%2BL5POF4BQQCrovE6j0g4FgJA0BYUw6CichKF7/v6GiNPkC4LhvHMmhaAIZFKHCevwiCWpNk4Ave7YQQpQYWgN/b0gsGGoxxCPoc9yBTBm6Pv5WjcBfN94Otynr2g8HCSFiE2FwsHrggsoWDP3mIcJgwAFAADU8CYAAO5Sh2AnAu/BBAiDEOwKQMhBCKBUOoI%2BuhzIGCMCACe%2BhP7N0gPMVApgGS302lKAcm1NpDGAJgM4o5ggCCbuUG8XgICOBGF4cyARJiFGKHoRIyQGSCIkTkBk3QxF9HMi0FcDAOjDFcI0PQKiGTqImAUXo0RlHjBkcYzoCjDESHmJnJYKw9CAMwKsHg0dY51yPinDgE9NzTynFwZqZgLBrFwIQEgud86gRcH3Aew5Th5y4LMXgbctDE1IOXSu1cOC11IAnJOHim4t0LsXFJMcOB3jcbkxuhT24pOvhGVIMQgA%3D%3D%3D">for arm64 Neon</a>, things look pretty good!
The contents of the <code class="language-plaintext highlighter-rouge">foreach</code> loop have been compiled down to a single straight run of vectorized instructions, with all four lanes filled beforehand.
Comparing ISPC’s output with the compiler output for the hand-vectorized implementations, the core of the ray-box test looks very similar between the two, while ISPC’s output for all of the precalculation logic actually seems slightly better than the output from the hand-vectorized implementation.</p>
<p>Here is how the ISPC implementation performs, compared to the baseline compact scalar implementation:</p>
<table>
<thead>
<tr>
<th style="text-align: right"> </th>
<th style="text-align: center">x86-64:</th>
<th style="text-align: center">x86-64 Speedup:</th>
<th style="text-align: center">arm64:</th>
<th style="text-align: center">arm64 Speedup:</th>
<th style="text-align: center">Rosetta2:</th>
<th style="text-align: center">Rosetta2 Speedup:</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: right">Scalar Compact:</td>
<td style="text-align: center">44.5159 ns</td>
<td style="text-align: center">1.0x.</td>
<td style="text-align: center">41.8187 ns</td>
<td style="text-align: center">1.0x.</td>
<td style="text-align: center">81.0942 ns</td>
<td style="text-align: center">1.0x.</td>
</tr>
<tr>
<td style="text-align: right">ISPC:</td>
<td style="text-align: center">8.2877 ns</td>
<td style="text-align: center">5.3835x</td>
<td style="text-align: center">11.2182 ns</td>
<td style="text-align: center">3.7278x</td>
<td style="text-align: center">11.3709 ns</td>
<td style="text-align: center">7.1317x</td>
</tr>
</tbody>
</table>
<p>The performance from the ISPC implementation looks really good!
Actually, on x86-64, the ISPC implementation’s performance looks <em>too good to be true</em>: at first glance, a 5.3835x speedup over the compact scalar baseline implementation shouldn’t be possible since the maximum expected possible speedup is only 4x.
I had to think about this result a while; I think the explanation for this apparently better-than-possible speedup is because the setup versus the actual intersection test parts of the 4-wide ray-box test need to be considered separately.
The actual intersection part is the part that is an apples-to-apples comparison across all of the different implementations, while the setup code can vary significantly both in how it is written and in how well it can be optimized across different implementations.
The reason for the above is that the setup code is more inherently scalar.
I think that the reason the ISPC implementation has an overall more-than-4x speedup over the baseline is because in the baseline implementation, the scalar setup code is not much out of the <code class="language-plaintext highlighter-rouge">-O3</code> optimization level, whereas the ISPC implementation’s setup code is both getting more out of ISPC’s <code class="language-plaintext highlighter-rouge">-O3</code> optimization level and is additionally just better vectorized on account of being ISPC code.
A data point that lends credence to this theory is that when Clang and ISPC are both forced to disabled all optimizations using the <code class="language-plaintext highlighter-rouge">-O0</code> flag, the performance difference between the baseline and ISPC implementations falls back into a much more expected multiplier below 4x.</p>
<p>Generally, I really like ISPC!
ISPC delivers on the promise of write-once compiler-and-run-anywhere vectorized code, and unlike auto-vectorization, ISPC’s output compiler assembly performs as we expect for well-written vectorized code.
Of course, ISPC isn’t 100% fool-proof magic; care still needs to be taken in writing good ISPC programs that don’t contain excessive amounts of execution path divergence between SIMD lanes, and care still needs to be taken in not doing too many expensive gather/scatter operations.
However, these types of considerations are just part of writing vectorized code in general and are not specific to ISPC, and furthermore, these types of considerations should be familiar territory for anyone with experience writing GPU code as well.
I think that’s a general strength of ISPC: writing vector CPU code using ISPC feels a lot like writing GPU code, and that’s by design!</p>
<div id="results"></div>
<p><strong>Final Results and Conclusions</strong></p>
<p>Now that we’ve walked through every implementation in the test program, below are the complete results for every implementation across x86-64, arm64, and Rosetta 2.
As mentioned earlier, all results were generated by running on a 2019 16 inch MacBook Pro with a Intel Core i7-9750H CPU for x86-64, and on a 2020 M1 Mac Mini for arm64 and Rosetta 2.
All results were generated by running the test program with 100000 runs per implementation; the timings reported are the average time for one run.
I ran the test program 5 times with 100000 runs each time; after throwing out the highest and lowest result for each implementation to discard outliers, I then averaged the remaining three results for each implementation for each architecture.
In the results, the “speedup” columns use the scalar compact implementation as the baseline for comparison:</p>
<table>
<thead>
<tr>
<th style="text-align: right"> </th>
<th style="text-align: center"> </th>
<th style="text-align: center"> </th>
<th style="text-align: center">Results</th>
<th style="text-align: center"> </th>
<th style="text-align: center"> </th>
<th style="text-align: center"> </th>
</tr>
<tr>
<th style="text-align: right"> </th>
<th style="text-align: center">x86-64:</th>
<th style="text-align: center">x86-64 Speedup:</th>
<th style="text-align: center">arm64:</th>
<th style="text-align: center">arm64 Speedup:</th>
<th style="text-align: center">Rosetta2:</th>
<th style="text-align: center">Rosetta2 Speedup:</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: right">Scalar Compact:</td>
<td style="text-align: center">44.5159 ns</td>
<td style="text-align: center">1.0x.</td>
<td style="text-align: center">41.8187 ns</td>
<td style="text-align: center">1.0x.</td>
<td style="text-align: center">81.0942 ns</td>
<td style="text-align: center">1.0x.</td>
</tr>
<tr>
<td style="text-align: right">Scalar Original:</td>
<td style="text-align: center">44.1004 ns</td>
<td style="text-align: center">1.0117x</td>
<td style="text-align: center">78.4001 ns</td>
<td style="text-align: center">0.5334x</td>
<td style="text-align: center">90.7649 ns</td>
<td style="text-align: center">0.8935x</td>
</tr>
<tr>
<td style="text-align: right">Scalar No Early-Out:</td>
<td style="text-align: center">55.6770 ns</td>
<td style="text-align: center">0.8014x</td>
<td style="text-align: center">85.3562 ns</td>
<td style="text-align: center">0.4899x</td>
<td style="text-align: center">102.763 ns</td>
<td style="text-align: center">0.7891x</td>
</tr>
<tr>
<td style="text-align: right">SSE:</td>
<td style="text-align: center">10.9660 ns</td>
<td style="text-align: center">4.0686x</td>
<td style="text-align: center">-</td>
<td style="text-align: center">-</td>
<td style="text-align: center">13.6353 ns</td>
<td style="text-align: center">5.9474x</td>
</tr>
<tr>
<td style="text-align: right">SSE2NEON:</td>
<td style="text-align: center">-</td>
<td style="text-align: center">-</td>
<td style="text-align: center">12.3090 ns</td>
<td style="text-align: center">3.3974x</td>
<td style="text-align: center">-</td>
<td style="text-align: center">-</td>
</tr>
<tr>
<td style="text-align: right">Neon:</td>
<td style="text-align: center">-</td>
<td style="text-align: center">-</td>
<td style="text-align: center">12.2161 ns</td>
<td style="text-align: center">3.4232x</td>
<td style="text-align: center">-</td>
<td style="text-align: center">-</td>
</tr>
<tr>
<td style="text-align: right">Autovectorize:</td>
<td style="text-align: center">34.1398 ns</td>
<td style="text-align: center">1.3069x</td>
<td style="text-align: center">38.1917 ns</td>
<td style="text-align: center">1.0950x</td>
<td style="text-align: center">59.9757 ns</td>
<td style="text-align: center">1.3521x</td>
</tr>
<tr>
<td style="text-align: right">ISPC:</td>
<td style="text-align: center">8.2877 ns</td>
<td style="text-align: center">5.3835x</td>
<td style="text-align: center">11.2182 ns</td>
<td style="text-align: center">3.7278x</td>
<td style="text-align: center">11.3709 ns</td>
<td style="text-align: center">7.1317x</td>
</tr>
</tbody>
</table>
<p>In each of the sections above, we’ve already looked at how the performance of each individual implementation compares against the baseline compact scalar implementation.
Ranking all of the approaches (at least for the specific example used in this post), ISPC produces the best performance, hand-vectorization using each processor’s native vector intrinsics comes in second, hand-vectorization using a translation layer such as sse2neon follows very closely behind using native vector intrinsics, and finally auto-vectorization comes in a distant last place.
Broadly, I think a good rule of thumb is that auto-vectorization is better than nothing, and that for large complex programs where vectorization is important and where cross-platform is required, ISPC is the way to go.
For smaller-scale things where the additional development complexity of bringing in an additional compiler isn’t justified, writing directly using vector intrinsics is a good solution, and using translation layers like sse2neon to port code written using one architecture’s vector intrinsics to another architecture without a total rewrite can work just as well as rewriting from scratch (assuming the translation layer is as well-written as sse2neon is).
Finally, as mentioned earlier, I was very surprised to learn that Rosetta 2 seems to be much better at translating vector instructions than it is at translating normal scalar x86-64 instructions.</p>
<p>Looking back over the final test program, around a third of the total lines of code in the test program aren’t ray-box intersection code at all.
Around a third of the code is made up of just defining data structures and doing data marshaling to make sure that the actual ray-box intersection code can be efficiently vectorized at all.
I think that in most applications of vectorization, figuring out the data marshaling to enable good vectorization is just as important of a problem as actually writing the core vectorized code, and I think the data marshaling can often be even harder than the actual vectorization part.
Even the ISPC implementation in this post only works because the specific memory layout of the <code class="language-plaintext highlighter-rouge">BBox4</code> data structure is designed for optimal vectorized access.</p>
<p>For much larger vectorized applications, such as full production renderers, planning ahead for vectorization doesn’t just mean figuring out how to lay out data structures in memory, but can mean having to incorporate vectorization considerations into the fundamental architecture of the entire system.
A great example of the above is DreamWorks Animation’s Moonray renderer, which has an entire architecture designed around coalescing enough coherent work in an incoherent path tracer to facilitate ISPC-based vectorized shading <a href="https://dl.acm.org/citation.cfm?doid=3105762.3105768">[Lee et al. 2017]</a>.
Weta Digital’s Manuka renderer goes even further by fundamentally restructuring the typical order of operations in a standard path tracer into a <em>shade-before-hit</em> architecture, also in part to facilitate vectorized shading <a href="https://doi.org/10.1145/3182161">[Fascione et al. 2018]</a>.
Pixar and Intel have also worked together recently to extend OSL with better vectorization for use in RenderMan XPU, which has necessitated the addition of a new batched interface to OSL <a href="https://www.youtube.com/watch?v=-WqrP50nvN4">[Liani and Wells 2020]</a>.
Some other interesting large non-rendering applications where vectorization has been applied through the use of clever rearchitecting include JPEG encoding <a href="https://blog.cloudflare.com/neon-is-the-new-black/">[Krasnov 2018]</a> and even <a href="https://github.com/simdjson/simdjson">JSON parsing</a> <a href="https://doi.org/10.1007/s00778-019-00578-5">[Langdale and Lemire 2019]</a>.
More generally, the entire domain of data-oriented design <a href="https://www.youtube.com/watch?v=rX0ItVEVjHc">[Acton 2014]</a> revolves around understanding how to structure data layout according to how computation needs to access said data; although data-oriented design was originally motivated by a need to efficiently utilize the CPU cache hierarchy, data-oriented design is also highly applicable to structuring vectorized programs.</p>
<p>In this post, we only looked at 4-wide 128-bit SIMD extensions.
Vectorization is not limited to 128-bits or 4-wide instructions, of course; x86-64’s newer <a href="https://en.wikipedia.org/wiki/Advanced_Vector_Extensions">AVX instructions</a> use 256-bit registers and, when used with 32-bit floats, AVX is 8-wide.
The newest version of AVX, <a href="https://en.wikipedia.org/wiki/AVX-512">AVX-512</a>, extends things even wider to 512-bit registers and can support a whopping 16 32-bit lanes.
Similarly, ARM’s new <a href="https://developer.arm.com/architectures/instruction-sets/simd-isas/sve/sve-programmers-guide">SVE vector extensions</a> serve as a wider successor to Neon (ARM also recently introduced a new lower-energy lighter weight companion vector extension to Neon, named <a href="https://developer.arm.com/architectures/instruction-sets/simd-isas/helium/helium-programmers-guide">Helium</a>).
Comparing AVX and SVE is interesting, because their design philosophies are much further apart than the relatively similar design philosophies behind SSE and Neon.
AVX serves as a direct extension to SSE, to the point where even AVX’s YMM registers are really just an expanded version of SSE’s XMM registers (on processors supporting AVX, the XMM registers physically are actually just the lower 128 bits of the full YMM registers).
Similar to AVX, the lower bits of SVE’s registers also overlap Neon’s registers, but SVE uses a new set of vector instructions separate from Neon.
The big difference between AVX and SVE is that while AVX and AVX-512 specify fixed 256-bit and 512-bit widths respectively, SVE allows for different implementations to define different widths from a minimum of 128-bit all the way up to a maximum of 2048-bit, in 128-bit increments.
At some point in the future, I think a comparison of AVX and SVE could be fun and interesting, but I didn’t touch on them in this post because of a number of current problems.
In many Intel processors today, AVX (and especially AVX-512) is so power-hungry that using AVX means that the processor has to throttle its clock speeds down <a href="https://blog.cloudflare.com/on-the-dangers-of-intels-frequency-scaling/">[Krasnov 2017]</a>, which can in some cases completely negate any kind of performance improvement.
The challenge with testing SVE code right now is… there just aren’t many arm64 processors out that actually implement SVE yet!
As of the time of writing, the only publicly released arm64 processor in the world that I know of that implements SVE is Fujitsu’s A64FX supercomputer processor, which is not exactly an off-the-shelf consumer part.
NVIDIA’s upcoming Grace arm64 server CPU is also supposed to implement SVE, but as of 2021, the Grace CPU is still a few years away from release.</p>
<p>At the end of the day, for any application where vectorization is a good fit, not using vectorization means leaving a large amount of performance on the table.
Of course, the example used in this post is just a single data point, and is a relatively small example; your mileage may and likely will vary for different and larger examples!
As with any programming task, understanding your problem domain is crucial for understanding how useful any given technique will be, and as seen in this post, great care must be taken to structure code and data to even be able to take advantage of vectorization.
Hopefully this post has served as a useful examination of several different approaches to vectorization!
Again, I have put all of the code in this post in <a href="https://github.com/betajippity/sseneoncompare">an open Github repository</a>; feel free to play around with it yourself (or if you are feeling especially ambitious, feel free to use it as a starting point for a full vectorized BVH implementation)!</p>
<p><strong>Addendum</strong></p>
<p>After I published this post, <a href="https://twitter.com/romainguy">Romain Guy</a> wrote in with a suggestion to use <code class="language-plaintext highlighter-rouge">-ffast-math</code> to improve the auto-vectorization results.
I gave the suggestion a try, and the result was indeed markedly improved!
Across the board, using <code class="language-plaintext highlighter-rouge">-ffast-math</code> cut the auto-vectorization timings by about half, corresponding to around a doubling of performance.
Using <code class="language-plaintext highlighter-rouge">ffast-math</code>, the auto-vectorized implementation still trails behind the hand-vectorized and ISPC implementations, but by a much narrower margin than before, and overall is much much better than the compact scalar baseline.
Romain previously presented <a href="https://www.youtube.com/watch?v=Lcq_fzet9Iw">a talk in 2019</a> about Google’s Filament real-time rendering engine, which includes many additional tips for making auto-vectorization work better.</p>
<p><strong>References</strong></p>
<p>Mike Acton. 2014. <a href="https://www.youtube.com/watch?v=rX0ItVEVjHc">Data-Oriented Design and C++</a>. In <em>CppCon 2014</em>.</p>
<p>AMD. 2020. <a href="https://gpuopen.com/rdna2-isa-available/">“RDNA 2” Instruction Set Architecture Reference Guide</a>. Retrieved August 30, 2021.</p>
<p>ARM Holdings. 2021. <a href="https://developer.arm.com/architectures/instruction-sets/intrinsics/">ARM Intrinsics</a>. Retrieved August 30, 2021.</p>
<p>ARM Holdings. 2021. <a href="https://developer.arm.com/architectures/instruction-sets/simd-isas/helium/helium-programmers-guide">Helium Programmer’s Guide</a>. Retrieved September 5, 2021.</p>
<p>ARM Holdings. 2021. <a href="https://developer.arm.com/architectures/instruction-sets/simd-isas/sve/sve-programmers-guide">SVE and SVE2 Programmer’s Guide</a>. Retrieved September 5, 2021.</p>
<p>Holger Dammertz, Johannes Hanika, and Alexander Keller. 2008. <a href="https://doi.org/10.1111/j.1467-8659.2008.01261.x">Shallow Bounding Volume Hierarchies for Fast SIMD Ray Tracing of Incoherent Rays</a>. <em>Computer Graphics Forum</em>. 27, 4 (2008), 1225-1234.</p>
<p>Manfred Ernst and Günther Greiner. 2008. <a href="https://doi.org/10.1109/RT.2008.4634618">Multi Bounding Volume Hierarchies</a>. In <em>RT 2008: Proceedings of the 2008 IEEE Symposium on Interactive Ray Tracing</em>. 35-40.</p>
<p>Luca Fascione, Johannes Hanika, Mark Leone, Marc Droske, Jorge Schwarzhaupt, Tomáš Davidovič, Andrea Weidlich, and Johannes Meng. 2018. <a href="https://doi.org/10.1145/3182161">Manuka: A Batch-Shading Architecture for Spectral Path Tracing in Movie Production</a>. <em>ACM Transactions on Graphics</em>. 37, 3 (2018), 31:1-31:18.</p>
<p>Romain Guy and Mathias Agopian. 2019. <a href="https://www.youtube.com/watch?v=Lcq_fzet9Iw">High Performance (Graphics) Programming</a>. In <em>Android Dev Summit ‘19</em>. Retrieved September 7, 2021.</p>
<p>Intel Corporation. 2021. <a href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/">Intel Intrinsics Guide</a>. Retrieved August 30, 2021.</p>
<p>Intel Corporation. 2021. <a href="https://ispc.github.io/ispc.html">Intel ISPC User’s Guide</a>. Retrieved August 30, 2021.</p>
<p>Thiago Ize. 2013. <a href="http://jcgt.org/published/0002/02/02/">Robust BVH Ray Traversal</a>. <em>Journal of Computer Graphics Techniques</em>. 2, 2 (2013), 12-27.</p>
<p>Tero Karras and Timo Aila. 2013. <a href="https://doi.org/10.1145/2492045.2492055">Fast Parallel Construction of High-Quality Bounding Volume Hierarchies</a>. In <em>HPG 2013: Proceedings of the 5th Conference on High-Performance Graphics</em>. 89-88.</p>
<p>Vlad Krasnov. 2017. <a href="https://blog.cloudflare.com/on-the-dangers-of-intels-frequency-scaling/">On the dangers of Intel’s frequency scaling</a>. In <em>Cloudflare Blog</em>. Retrieved August 30, 2021.</p>
<p>Vlad Krasnov. 2018. <a href="https://blog.cloudflare.com/neon-is-the-new-black/">NEON is the new black: fast JPEG optimization on ARM server</a>. In <em>Cloudflare Blog</em>. Retrieved August 30, 2021.</p>
<p>Geoff Langdale and Daniel Lemire. 2019. <a href="https://doi.org/10.1007/s00778-019-00578-5">Parsing Gigabytes of JSON per Second</a>. <em>The VLDB Journal</em>. 28 (2019), 941-960.</p>
<p>Mark Lee, Brian Green, Feng Xie, and Eric Tabellion. 2017. <a href="https://dl.acm.org/citation.cfm?doid=3105762.3105768">Vectorized Production Path Tracing</a>. In <em>HPG 2017: Proceedings of the 9th Conference on High-Performance Graphics)</em>. 10:1-10:11.</p>
<p>Max Liani and Alex M. Wells. 2020. <a href="https://www.youtube.com/watch?v=-WqrP50nvN4">Supercharging Pixar’s RenderMan XPU with Intel AVX-512</a>. In <em>ACM SIGGRAPH 2020: Exhibitor Sessions</em>.</p>
<p>Alexander Majercik, Cyril Crassin, Peter Shirley, and Morgan McGuire. 2018. <a href="http://jcgt.org/published/0007/03/04/">A Ray-Box Intersection Algorithm and Efficient Dynamic Voxel Rendering</a></p>
<p>Daniel Meister, Shinji Ogaki, Carsten Benthin, Michael J. Doyle, Michael Guthe, and Jiri Bittner. 2021. <a href="https://doi.org/10.1111/cgf.142662">A Survey on Bounding Volume Hierarchies for Ray Tracing</a>. <em>Computer Graphics Forum</em>. 40, 2 (2021), 683-712.</p>
<p>NVIDIA. 2021. <a href="https://raytracing-docs.nvidia.com/optix7/guide/index.html">NVIDIA OptiX 7.3 Programming Guide</a>. Retrieved August 30, 2021.</p>
<p>Howard Oakley. 2021. <a href="https://eclecticlight.co/2021/08/23/code-in-arm-assembly-lanes-and-loads-in-neon/">Code in ARM Assembly: Lanes and loads in NEON</a>. In <em>The Eclectic Light Company</em>. Retrieved September 7, 2021.</p>
<p>Matt Pharr. 2018. <a href="https://pharr.org/matt/blog/2018/04/30/ispc-all">The Story of ISPC</a>. In <em>Matt Pharr’s Blog</em>. Retrieved July 18, 2021.</p>
<p>Matt Pharr and William R. Mark. 2012. <a href="https://doi.org/10.1109/InPar.2012.6339601">ispc: A SPMD compiler for high-performance CPU programming</a>. In <em>2012 Innovative Parallel Computing (InPar)</em>.</p>
<p>Martin Stich, Heiko Friedrich, and Andreas Dietrich. 2009. <a href="https://doi.org/10.1145/1572769.1572771">Spatial Splits in Bounding Volume Hierarchies</a>. In <em>HPG 2009: Proceedings of the 1st Conference on High-Performance Graphics</em>. 7-13.</p>
<p>John A. Tsakok. 2009. <a href="https://doi.org/10.1145/1572769.1572793">Faster Incoherent Rays: Multi-BVH Ray Stream Tracing</a>. In <em>HPG 2009: Proceedings of the 1st Conference on High-Performance Graphics</em>. 151-158.</p>
<p>Nathan Vegdahl. 2017. <a href="https://psychopath.io/post/2017_08_03_bvh4_without_simd">BVH4 Without SIMD</a>. In <em>Psychopath Renderer</em>. Retrieved August 20, 2021.</p>
<p>Ingo Wald, Carsten Benthin, and Solomon Boulos. 2008. <a href="https://doi.org/10.1109/RT.2008.4634620">Getting Rid of Packets - Efficient SIMD Single-Ray Traversal using Multi-Branching BVHs</a>. In <em>RT 2008: Proceedings of the 2008 IEEE Symposium on Interactive Ray Tracing</em>. 49-57.</p>
<p>Ingo Wald, Philipp Slusallek, Carsten Benthin, and Markus Wagner. 2001. <a href="https://doi.org/10.1111/1467-8659.00508">Interactive Rendering with Coherent Ray Tracing</a>. <em>Computer Graphics Forum</em>. 20, 3 (2001), 153-165.</p>
<p>Ingo Wald, Sven Woop, Carsten Benthin, Gregory S. Johnson, and Manfred Ernst. 2014. <a href="https://doi.org/10.1145/2601097.2601199">Embree: A Kernel Framework for Efficient CPU Ray Tracing</a>. <em>ACM Transactions on Graphics</em>. 33, 4 (2014), 143:1-143:8.</p>
<p>Amy Williams, Steve Barrus, Keith Morley, and Peter Shirley. 2005. <a href="https://doi.org/10.1080/2151237X.2005.10129188">An Efficient and Robust Ray-Box Intersection Algorithm</a>. _Journal of Graphics Tools). 10, 1 (2005), 49-54.</p>
<p>Henri Ylitie, Tero Karras, and Samuli Laine. 2017. <a href="https://doi.org/10.1145/3105762.3105773">Efficient Incoherent Ray Traversal on GPUs Through Compressed Wide BVHs</a>. In <em>HPG 2017: Proceedings of the 9th Conference on High-Performance Graphics</em>. 4:1-4:13.</p>
<p>Wikipedia. 2021. <a href="https://en.wikipedia.org/wiki/Advanced_Vector_Extensions">Advanced Vector Extensions</a>. Retrieved September 5, 2021.</p>
<p>Wikipedia. 2021. <a href="https://en.wikipedia.org/wiki/Automatic_vectorization">Automatic Vectorization</a>. Retrieved September 4, 2021.</p>
<p>Wikipedia. 2021. <a href="https://en.wikipedia.org/wiki/AVX-512">AVX-512</a>. Retrieved September 5, 2021.</p>
<p>Wikipedia. 2021. <a href="https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads">Single Instruction, Multiple Threads</a>. Retrieved July 18, 2021.</p>
<p>Wikipedia. 2021. <a href="https://en.wikipedia.org/wiki/SPMD">SPMD</a>. Retrieved July 18, 2021.</p>
https://blog.yiningkarlli.com/2021/08/unbiased-emission-and-scattering-volumes.html
SIGGRAPH 2021 Talk- Unbiased Emission and Scattering Importance Sampling for Heterogeneous Volumes
2021-08-09T00:00:00+00:00
2021-08-09T00:00:00+00:00
Yining Karl Li
<p>This year at SIGGRAPH 2021, Wei-Feng Wayne Huang, Peter Kutz, Matt Jen-Yuan Chiang, and I have a talk that presents a pair of new distance-sampling techniques for improving emission and scattering importance sampling for volume path tracing cases where low-order heterogeneous scattering dominates.
These techniques were developed as part of our ongoing development on <a href="https://www.disneyanimation.com/technology/hyperion/">Disney’s Hyperion Renderer</a> and first saw full-fledged production use on Raya and the Last Dragon, although limited testing of in-progress versions also happened on Frozen 2.
This work was led by Wayne, building upon important groundwork that was put in place by Peter before Peter left Disney Animation.
Matt and I played more of an advisory or consulting role on this project, mostly helping with brainstorming, puzzling through ideas, and figuring out how to formally describe and present these new techniques.</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Aug/unbiased-emission-and-scattering-volumes/teaser.png"><img src="https://blog.yiningkarlli.com/content/images/2021/Aug/unbiased-emission-and-scattering-volumes/preview/teaser.jpg" alt="A higher-res version of Figure 1 from the paper: a torch embedded in thin anisotropic heterogeneous mist. Equal-time comparison of a conventional null-collision approach (left), incorporating our emission sampling strategy (middle), and additionally combining with our scattering sampling strategy via MIS (right)." /></a></p>
<p>Here is the paper abstract:</p>
<p><em>We present two new distance-sampling methods for production volume path tracing. We extend the null-collision integral formulation to efficiently gather heterogeneous volumetric emission, achieving higher-quality results. Additionally, we propose a tabulation-based approach to importance sample volumetric in-scattering through a spatial guiding data structure. Our methods improve the sampling efficiency for scenarios where low-order heterogeneous scattering dominates, which tends to cause high variance renderings with existing null-collision methods.</em></p>
<p>The paper and related materials can be found at:</p>
<ul>
<li><a href="https://www.yiningkarlli.com/projects/emissionscattervolumes.html">Project Page (Author’s Version and Supplementary Material)</a></li>
<li><a href="https://dl.acm.org/doi/10.1145/3450623.3464644">Official Print Version (ACM Library)</a></li>
</ul>
<p>As covered in several previous publications, several years ago we replaced Hyperion’s old residual ratio tracking <a href="https://dl.acm.org/citation.cfm?id=2661292">[Novák et al. 2014</a> , <a href="http://graphics.pixar.com/library/ProductionVolumeRendering">Fong et al. 2017]</a> based volume rendering system with a new, state of the art, null-collision (also called delta tracking or Woodcock tracking) tracking theory based volume rendering system.
Null-collision volume rendering systems are extremely good at dense volumes where light transport is dominated by high-order scattering, such as clouds and snow and sea foam.
However, null-collision volume rendering systems historically have struggled with efficiently rendering optically thin volumes dominated by low-order scattering, such as mist and fog.
The reason null-collision systems struggle with optically thin volumes is because in a thin volume, the average sampled distance is usually very large, meaning that ray often goes right through the volume with very few scattering events <a href="http://jcgt.org/published/0007/03/03/">[Villemin et al. 2018]</a>.
Since we can only evaluate illumination at each scattering event, not having a lot of scattering events means that the illumination estimate is necessarily often very low-quality, leading to tons of noise.</p>
<p>Frozen 2’s forest scenes tended to include large amounts of atmospheric fog to lend the movie a moody look; these atmospherics proved to be a major challenge for Hyperion’s modern volume rendering system.
Going in to Raya and the Last Dragon, we knew that the challenge was only going to get harder: from fairly early on in Raya and the Last Dragon’s production, we already knew that the cinematography direction for the film was going to rely heavily on atmospherics and fog <a href="https://doi.org/10.1145/3450623.3464676">[Bryant et al. 2021]</a> even more than Frozen 2’s cinematography did.
To make things even harder, we also knew that a lot of these atmospherics were going to be lit using emissive volume light sources like fire or torches; not only did we need a good way to improve how we sampled scattering events, but we also needed a better way to sample emission.</p>
<p>The solution to the second problem (emission sampling) actually came long before the solution to the first problem (scattering sampling).
When we first implemented our new volume rendering system, we evaluated the emission term only when an absorption even happened, which is an intuitive interpretation of a random walk since each interaction is associated with one particular event.
However, shortly after we wrote our Spectral and Decomposition Tracking paper <a href="https://doi.org/10.1145/3072959.3073665">[Kutz et al. 2017]</a>, Peter realized that absorption and emission can actually also be evaluated at scattering and null-collision events too, and provided that some care was taken, doing so could be kept unbiased and mathematically correct as well.
Peter implemented this technique in Hyperion before he move on from Disney Animation; later, through experiences from using an early version of this technique on Frozen 2, Wayne realized that the relationship between voxel size and majorant value needed to be factored in to this technique.
When Wayne made the necessary modifications from his realization, the end result sped up this technique dramatically and in some scenes sped up overall volume rendering by up to a factor of 2x.
A complete description of how all of the above is done and how it can be kept unbiased and mathematically correct makes up the first part of our talk.</p>
<p>The solution to the first problem (scattering sampling) came out of many brainstorming and discussion sessions between Wayne, Matt, and myself.
At each volume scattering point, there are three terms that need to be sampled: transmittance, radiance, and the phase function.
The latter two are directly analogous to incoming radiance and the BRDF lobe at a surface scattering event; transmittance is an additional thing that volumes have to worry about over what surfaces care about.
The problem we were facing in optically thin volumes fundamentally boiled down to cases where these three terms have extremely different distributions for the same point in space.
In surface path tracing, the solution to this type of problem is well understood: sample these different distributions using separate techniques and combine using MIS <a href="http://jcgt.org/published/0002/02/10/">[Villemin & Hery 2013]</a>.
However, we had two obstacles preventing us from using MIS here: first, MIS requires knowing a sampling pdf, and at the time, computing the sampling pdf for distance sampling in a null-collision system was an unsolved problem.
Second, we needed a way to do distance sampling based off of not transmittance, but instead the product of incoming radiance and the phase function; this term needed to be learned on-the-fly and stored in an easy-to-sample spatial data structure.
Fortunately, almost exactly around the time we were discussing these problems, Miller et al. <a href="https://doi.org/10.1145/3306346.3323025">[2019]</a> was published, which solved the longstanding open research problem around computing a usable pdf for distance samples, allowing for MIS.
Our idea for on-the-fly learning of the product of incoming radiance and the phase function was to simply piggyback off of Hyperion’s existing cache points light-selection-guiding data structure <a href="https://doi.org/10.1145/3182159">[Burley et al. 2018]</a>.
Wayne worked through the details of all of the above and implemented both in Hyperion, and also figured out how to combine this technique with the previously existing transmittance-based distance sampling and with Peter’s emission sampling technique; the detailed description of this technique makes up the second part of our talk.
The end product is a system that combines different techniques for handling thin and thick volumes to produce good, efficient results in a single unified volume integrator!</p>
<p>Because of the limited length of the SIGGRAPH Talks short paper format, we had to compress our text significantly to fit into the required short paper length.
We put much more detail into the slides that Wayne presented at SIGGRAPH 2021; for anyone that is interested and is attending SIGGRAPH 2021, I’d highly recommend giving the talk a watch (and then going to see all of the other cool Disney Animation talks this year)!
For anyone interested in the technique post-SIGGRAPH 2021, hopefully we’ll be able to get a version of the slides cleared for release by the studio at some point.</p>
<p>Wayne’s excellent implementations of the above techniques proved to be an enormous win for both rendering efficiency and artist workflows on Raya and the Last Dragon; I personally think we would have had enormous difficulties in hitting the lighting art direction on Raya and the Last Dragon if it weren’t for Wayne’s work.
I owe Wayne a huge debt of gratitude for letting me be a small part of this project; the discussions were very fun, seeing it all come together was very exciting, and helping put the techniques down on paper for the SIGGRAPH talk was an excellent exercise in figuring out how to communicate cutting edge research clearly.</p>
<div class="embed-container-cinema">
<iframe src="/content/images/2021/Aug/unbiased-emission-and-scattering-volumes/comparisons/beforeaftercomparison_crop_embed.html" frameborder="0" border="0" scrolling="no"></iframe></div>
<div class="figcaption">A frame from Raya and the Last Dragon without our techniques (left), and with both our scattering and emission sampling applied (right). Both images are rendered using 32 spp per volume pass; surface passes are denoised and composited with non-denoised volume passes to isolate noise from volumes. A video version of this comparison is included in our talk's supplementary materials. For a larger still comparison, <a href="/content/images/2021/Aug/unbiased-emission-and-scattering-volumes/comparisons/beforeaftercomparison_crop.html">click here.</a></div>
<p><strong>References</strong></p>
<p>Marc Bryant, Ryan DeYoung, Wei-Feng Wayne Huang, Joe Longson, and Noel Villegas. 2021. <a href="https://doi.org/10.1145/3450623.3464676">The Atmosphere of Raya and the Last Dragon</a>. In <em>ACM SIGGRAPH 2021 Talks</em>. 51:1-51:2.</p>
<p>Brent Burley, David Adler, Matt Jen-Yuan Chiang, Hank Driskill, Ralf Habel, Patrick Kelly, Peter Kutz, Yining Karl Li, and Daniel Teece. 2018. <a href="https://doi.org/10.1145/3182159">The Design and Evolution of Disney’s Hyperion Renderer</a>. <em>ACM Transactions on Graphics</em>. 37, 3 (2018), 33:1-33:22.</p>
<p>Julian Fong, Magnus Wrenninge, Christopher Kulla, and Ralf Habel. 2017. <a href="http://graphics.pixar.com/library/ProductionVolumeRendering">Production Volume Rendering</a>. In <em>ACM SIGGRAPH 2021 Courses</em>. 2:1-2:97.</p>
<p>Peter Kutz, Ralf Habel, Yining Karl Li, and Jan Novák. 2017. <a href="https://doi.org/10.1145/3072959.3073665">Spectral and Decomposition Tracking for Rendering Heterogeneous Volumes</a>. In <em>ACM Transactions on Graphics</em>. 36, 4 (2017), 111:1-111:16.</p>
<p>Bailey Miller, Iliyan Georgiev, and Wojciech Jarosz. 2019. <a href="https://dl.acm.org/doi/10.1145/3306346.3323025">A Null-Scattering Path Integral Formulation of Light Transport</a>. <em>ACM Transactions on Graphics</em>. 38, 4 (2019). 44:1-44:13.</p>
<p>Jan Novák, Andrew Selle and Wojciech Jarosz. 2014. <a href="https://dl.acm.org/citation.cfm?id=2661292">Residual Ratio Tracking for Estimating Attenuation in Participating Media</a>. <em>ACM Transactions on Graphics</em>. 33, 6 (2014), 179:1-179:11.</p>
<p>Ryusuke Villemin and Christophe Hery. 2013. <a href="http://jcgt.org/published/0002/02/10/">Practical Illumination from Flames</a>. <em>Journal of Computer Graphics Techniques</em>. 2, 2 (2013), 142-155.</p>
<p>Ryusuke Villemin, Magnus Wrenninge, and Julian Fong. 2018. <a href="http://jcgt.org/published/0007/03/03/">Efficient Unbiased Rendering of Thin Participating Media</a>. <em>Journal of Computer Graphics Techniques</em>. 7, 3 (2018), 50-65.</p>
https://blog.yiningkarlli.com/2021/07/porting-takua-to-arm-pt2.html
Porting Takua Renderer to 64-bit ARM- Part 2
2021-07-31T00:00:00+00:00
2021-07-31T00:00:00+00:00
Yining Karl Li
<p>This post is the second half of my two-part series about how I ported my hobby renderer (Takua Renderer) to 64-bit ARM and what I learned from the process.
In the <a href="https://blog.yiningkarlli.com/2021/05/porting-takua-to-arm-pt1.html">first part</a>, I wrote about my motivation for undertaking a port to arm64 in the first place and described the process I took to get Takua Renderer up and running on an arm64-based Raspberry Pi 4B.
I also did a deep dive into several topics that I ran into along the way, which included floating point reproducibility across different processor architectures, a comparison of arm64 and x86-64’s memory reordering models, and a comparison of how the same example atomic code compiles down to assembly in arm64 versus in x86-64.
In this second part, I’ll write about developments and lessons learned after I got my initial arm64 port working correctly on Linux.</p>
<p>We’ll start with how I got Takua Renderer up and running on arm64 macOS, and discuss various interesting aspects of arm64 macOS, such as Universal Binaries and Apple’s Rosetta 2 binary translation layer for running x86-64 binaries on arm64 macOS.
As noted in the first part of this series, my initial port of Takua Renderer to arm64 did not include Embree; after the initial port, I added Embree support using Syoyo Fujita’s embree-aarch64 project (which has since been superseded by official arm64 support in Embree v3.13.0).
In this post I’ll look into how Embree, a codebase containing tons of x86-64 assembly and SSE and AVX intrinsics, was ported to arm64.
I will also use this exploration of Embree as a lens through which to compare x86-64’s SSE vector extensions to arm64’s Neon vector extensions.
Finally, I’ll wrap up with some additional important details to keep in mind when writing portable code between x86-64 and arm64, and I’ll also provide some more performance comparisons featuring the Apple M1 processor.</p>
<p><strong>Porting to arm64 macOS</strong></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Jul/takua-on-arm-pt2/takua_macos_arm64.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Jul/takua-on-arm-pt2/takua_macos_arm64.jpg" alt="Figure 1: Takua Renderer running on arm64 macOS 11, on an Apple Silicon Developer Transition Kit." /></a></p>
<p>At WWDC 2020 last year, Apple announced that Macs would be transitioning from using x86-64 processors to using custom Apple Silicon chips over a span of two years.
Apple Silicon chips package together CPU cores, GPU cores, and various other coprocessors and controllers onto a single die; the CPU cores implement arm64.
Actually, Apple Silicon implements a <em>superset</em> of arm64; there are some interesting extra special instructions that Apple has added to their arm64 implementation, which I’ll get to a bit later.
Similar to how Apple provided developers with preview hardware during the previous Mac transition from PowerPC to x86, Apple also announced that for this transition they would be providing Developer Transition Kits (DTKs) to developers in the form of special Mac Minis based on the iPad Pro’s A12Z chip.
I had been anticipating a Mac transition to arm64 for some time, so I ordered a Developer Transition Kit as soon as they were made available.</p>
<p>Since I had already gotten Takua Renderer up and running on arm64 on Linux, getting Takua Renderer up and running on the Apple Silicon DTK was very fast!
By far the most time consuming part of this process was just getting developer tooling set up and getting Takua’s dependencies built; once all of that was done, building and running Takua basically Just Worked™.
The only reason that getting developer tooling set up and getting dependencies built took a bit of work at the time was because this was just a week and a half after the entire Mac arm64 transition had even been announced.</p>
<p>Interestingly, the main stumbling block I ran into for most things on Apple Silicon macOS wasn’t the change to arm64 under the hood at all; the main stumbling block was… the macOS version number!
For the past 20 years, modern macOS (or Mac OS X as it was originally named) has used 10.x version numbers, but the first version of macOS to support arm64, macOS Big Sur, bumps the version number to 11.x.
This version number bump threw off a surprising number of libraries and packages!
Takua’s build system uses <a href="https://cmake.org">CMake</a> and <a href="https://ninja-build.org">Ninja</a>, and on macOS I get CMake and Ninja through <a href="https://www.macports.org">MacPorts</a>.
At the time, a lot of stuff in MacPorts wasn’t expecting an 11.x version number, so a bunch of stuff wouldn’t build, but fixing all of this just required manually patching build scripts and portfiles to expect an 11.x version number.
All of this pretty much got fixed within weeks of DTKs shipping out (and Apple actually contributed a huge number of patches themselves to various projects and stuff), but I didn’t want to wait at the time, so I just charged ahead.</p>
<p>Only three of Takua’s dependencies needed some minor patching to get working on arm64 macOS: <a href="https://github.com/oneapi-src/oneTBB">TBB</a>, <a href="https://github.com/AcademySoftwareFoundation/openexr">OpenEXR</a>, and <a href="https://github.com/wdas/ptex">Ptex</a>.
TBB’s build script just had to be updated to detect arm64 as a valid architecture for macOS; I submitted a pull request for this fix to the TBB Github repo, but I guess Intel doesn’t really take pull requests for TBB.
It’s okay though; the fix has since shown up in newer releases of TBB.
OpenEXR ‘s build script had to be patched so that inlined AVX intrinsics wouldn’t be used when building for arm64 on macOS; I submitted a pull request for this fix to OpenEXR that got merged, although this fix was later rendered unnecessary by a fix in the final release of Xcode 12.
Finally, Ptex just needed an extra include to pick up the <code class="language-plaintext highlighter-rouge">unlink()</code> system call correctly from <code class="language-plaintext highlighter-rouge">unistd.h</code> on macOS 11.
This change in Ptex was needed going from macOS Catalina to macOS Big Sur, and it’s also merged into the mainline Ptex repository now.</p>
<p>Once I had all of the above out of the way, getting Takua Renderer itself building and running correctly on the Apple Silicon DTK took no time at all, thanks to my previous efforts to port Takua Renderer to arm64 on Linux.
At this point I just ran <code class="language-plaintext highlighter-rouge">cmake</code> and <code class="language-plaintext highlighter-rouge">ninja</code> and a minute later out popped a working build.
From the moment the DTK arrived on my doorstep, I only needed about five hours to get Takua Renderer’s arm64 version building and running on the DTK with all tests passing.
Considering that at that point, outside of Apple nobody had done any work to get anything ready yet, I was very pleasantly surprised that I had everything up and working in just five hours!
Figure 1 is a screenshot of Takua Renderer running on arm64 macOS Big Sur Beta 1 on the Apple Silicon DTK.</p>
<p><strong>Universal Binaries</strong></p>
<p>The Mac has now had three processor architecture migrations in its history; the Mac line began in 1984 based on Motorola 68000 series processors, transitioned from the 68000 series to PowerPC in 1994, transitioned again from PowerPC to x86 (and eventually x86-64) in 2006, and is now in the process of transitioning from x86-64 to arm64.
Apple has used a similar strategy in all three of these processor architecture migrations to smooth the process.
Apple’s general transition strategy consists of two major components: first, provide a “fat” binary format that packages code from both architectures into a single executable that can run on both architecture, and second, provide some way for binaries from the old architecture to run directly on the new architecture.
I’ll look into the second part of this strategy a bit later; in this section, we are interested in Apple’s fat binary format.
Apple calls their fat binary format Universal Binaries; specifically, Apple uses the name “Universal 2 “for the transition to arm64 since the original Universal Binary format was for the transition to x86.</p>
<p>Now that I had separate x86-64 and arm64 builds working and running on macOS, the next step was to modify Takua’s build system to automatically produce a single Universal 2 binary that could run on both Intel and Apple Silicon Macs.
Fortunately, creating Universal 2 binaries is very easy!
To understand why creating Universal 2 binaries can be so easy, we need to first understand at a high level how a Universal 2 binary works.
There actually isn’t much special about Universal 2 binaries per se, in the sense that multi-architecture support is actually an inherent feature of the Mach-O binary executable code file format that Apple’s operating systems all use.
A multi-architecture Mach-O binary begins with a header that declares the file as a multi-architecture file and declares how many architectures are present.
The header is immediately followed by a list of architecture “slices”; each slice is a struct describing some basic information, such as what processor architecture the slice is for, the offset in the file that instructions begin at for the slice, and so on <a href="https://eclecticlight.co/2020/07/28/universal-binaries-inside-fat-headers/">[Oakley 2020]</a>.
After the list of architecture slices, the rest of the Mach-O file is pretty much like normal, except each architecture’s segments are concatenated after the previous architecture’s segments.
Also, Mach-O’s multi-architecture support allows for sharing non-executable resources between architectures.</p>
<p>So, because Universal 2 binaries are really just Mach-O multi-architecture binaries, and because Mach-O multi-architecture binaries don’t do any kind of crazy fancy interleaving and instead just concatenate each architecture after the previous one, all one needs to do to make a Universal 2 binary out of separate arm64 and x86-64 binaries is to concatenate the separate binaries into a single Mach-O file and set up the multi-architecture header and slices correctly.
Fortunately, <a href="https://developer.apple.com/documentation/apple-silicon/building-a-universal-macos-binary">a lot of tooling exists</a> to do exactly the above!
The version of clang that Apple ships with Xcode natively supports building Universal Binaries by just passing in multiple <code class="language-plaintext highlighter-rouge">-arch</code> flags; one for each architecture.
The Xcode UI of course also supports building Universal 2 binaries by just adding x86-64 and arm64 to an Xcode project’s architectures list in the project’s settings.
For projects using CMake, CMake has a <code class="language-plaintext highlighter-rouge">CMAKE_OSX_ARCHITECTURES</code> flag; this flag defaults to whatever the native architecture of the current system is, but can be set to <code class="language-plaintext highlighter-rouge">x86_64;arm64</code> to enable Universal Binary builds.
Finally, since the PowerPC to Intel transition, macOS has included a tool called lipo, which is used to query and create Universal Binaries; I’m fairly certain that the macOS lipo tool is based on the <a href="https://llvm.org/docs/CommandGuide/llvm-lipo.html">llvm-lipo tool</a> that is part of the larger LLVM compiler project.
The lipo tool can combine any x86_64 Mach-O file with any arm64 Mach-O file to create a multi-architecture Universal Binary.
The lipo tool can also be used to “slim” a Universal Binary down into a single architecture by deleting architecture slices and segments from the Universal Binary.</p>
<p>Of course, when building a Universal Binary, any external libraries that have to be linked in also need to be Universal Binaries.
Takua has a relatively small number of direct dependencies, but unfortunately some of Takua’s dependencies pull in many more indirect (relative to Takua) dependencies; for example, Takua depends on <a href="https://www.openvdb.org">OpenVDB</a>, which in turn pulls in <a href="https://github.com/Blosc/c-blosc">Blosc</a>, <a href="https://www.zlib.net">zlib</a>, <a href="https://www.boost.org">Boost</a>, and several other dependencies.
While some of these dependencies are built using CMake and are therefore very easy to build as Universal Binaries themselves, some other dependencies use older or bespoke build systems that can be difficult to retrofit multi-architecture builds into.
Fortunately, this problem is where the lipo tool comes in handy.
For dependencies that can’t be easily built as Universal Binaries, I just built arm64 and x86-64 versions separately and then combined the separate builds into a single Universal Binary using the lipo tool.</p>
<p>Once all of Takua’s dependencies were successfully built as Universal Binaries, all I had to do to get Takua itself to build as a Universal Binary was to add a check in my CMakeLists file to not use a couple of x86-64-specific compiler flags in the event of an arm64 target architecture.
Then I just set the <code class="language-plaintext highlighter-rouge">CMAKE_OSX_ARCHITECTURES</code> flag to <code class="language-plaintext highlighter-rouge">x86_64;arm64</code>, ran <code class="language-plaintext highlighter-rouge">ninja</code>, and out came a working Universal Binary!
Figure 2 shows building Takua Renderer, checking that the current system architecture is an Apple Silicon Mac, using the lipo tool to see and confirm that the output Universal Binary contains both arm64 and x86-64 slices, and finally try running the Universal Binary Takua Renderer build:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Jul/takua-on-arm-pt2/universalbinary.png"><img src="https://blog.yiningkarlli.com/content/images/2021/Jul/takua-on-arm-pt2/universalbinary.png" alt="Figure 2: Building Takua Renderer as a Universal Binary, checking the current system architecture, checking the output Universal Binary's slices to confirm the presence of arm64 and x86-64 support, and finally running Takua Renderer from the Universal Binary build." /></a></p>
<p>Out of curiosity, I also tried creating separate x86-64-only and arm64-only builds of Takua and assembling them into a Universal Binary using the lipo tool and comparing the result with the build of Takua that was natively built as a Universal Binary.
In theory natively building as a Universal Binary should be able to produce a slightly more compact output binary compared with using the lipo tool, since a natively built Universal Binary should be able to share non-code resources between different architectures, whereas the lipo tool just blindly encapsulates two separate Mach-O files into a single multi-architecture Mach-O file.
In fact, you can actually use the lipo tool to combine completely different programs into a single Universal Binary; after all, lipo has absolutely no way of knowing whether or not the arm64 and x86-64 code you want to combine is actually even from the same source code or implements the same functionality.
Indeed, the native Universal Binary Takua is slightly smaller than the lipo-generated Universal Binary Takua.
The size difference is tiny (basically negligible) though, likely because Takua’s binary contains very few non-code resources.
Figure 3 shows creating a Universal Binary by combining separate x86-64 and arm64 builds of Takua together using the lipo tool versus a Universal Binary built natively as a Universal Binary; the lipo version is just a bit over a kilobyte larger than the native version, which is negligible relative to the overall size of the files.</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Jul/takua-on-arm-pt2/lipocomparison.png"><img src="https://blog.yiningkarlli.com/content/images/2021/Jul/takua-on-arm-pt2/lipocomparison.png" alt="Figure 3: Examining the size of a Universal Binary created using the lipo tool versus the size of a Universal Binary built directly as a multi-architecture Mach-O." /></a></p>
<p><strong>Rosetta 2: Running x86-64 on Apple Silicon</strong></p>
<p>While getting Takua Renderer building and running as a native arm64 binary on Apple Silicon only took me about five hours, actually running Takua for the first time in <em>any</em> form on Apple Silicon happened much faster!
Before I did anything to get Takua’s arm64 build up and running on my Apple Silicon DTK, the first thing I did was just copy over the x86-64 macOS build of Takua to see if it would run on Apple Silicon macOS through Apple’s dynamic binary translation layer, Rosetta 2.
I was very impressed to find that the x86-64 version of Takua just worked out-of-the-box through Rosetta 2, and even passed my entire test suite!
I have now had Takua’s native arm64 build up and running as part of a Universal 2 binary for around a year, but I recently circled back to examine how Takua’s x86-64 build works through Rosetta 2.
I wanted to get a rough idea of how Rosetta 2 works, and much like many of the detours that I took on the entire Takua arm64 journey, I stumbled into a good opportunity to compare x86-64 and arm64 and learn more about how the two are similar and how they differ.</p>
<p>For every processor architecture transition that the Mac had undertaken, Apple has provided some sort of mechanism to run binaries for the outgoing processor architecture on Macs based on the new architecture.
During the 68000 to PowerPC transition, Apple’s approach was to emulate an entire 68000 system at the lowest levels of the operating system on PowerPC; in fact, during this transition, PowerPC Macs even allowed 68000 and PowerPC code to call back and forth to each other and be interspersed within the same binary.
During the PowerPC to x86 transition, Apple introduced Rosetta, which worked by JIT-compiling blocks of PowerPC code into x86 on-the-fly at program runtime.
For the x86-64 to arm64 transition, Rosetta 2 follows in the same tradition as in the previous two architecture transitions.
Rosetta 2 has two modes: the first is an ahead-of-time recompiler that converts an entire x86-64 binary to arm64 upon first run of an x86-64 binary and caches the translated binary for later reuse.
The second mode Rosetta 2 has is a JIT translator, which is used for cases where the target program itself is also JIT-generating x86-64 code; obviously in these cases the target program’s JIT output cannot be recompiled to arm64 through an ahead-of-time process.</p>
<p>Apple does not publicly provide much information at all about how Rosetta 2 works under the hood.
Rosetta 2 is one of those pieces of Apple technology that basically “Just Works” well enough that the typical user never really has any need to know much about how it works internally, which is great for users but unfortunate for anyone that is more curious.
Fortunately though, Koh Nakagawa recently published <a href="https://ffri.github.io/ProjectChampollion/">a detailed analysis of Rosetta 2</a> produced through some careful reverse engineering work.
What I was interested in examining was how Rosetta 2’s output arm64 assembly looks compared with natively compiled arm64 assembly, so I’ll briefly summarize the relevant parts of how Rosetta 2 generates arm64 code.
There’s a lot more cool stuff about Rosetta 2, such as how the runtime and JIT mode works, that I won’t touch on here; if you’re interested, I’d highly recommend checking out Koh Nakagawa’s writeups.</p>
<p>When a user tries to run an x86-64 binary on an Apple Silicon Mac, Rosetta 2 first checks if this particular binary has already been translated by Rosetta 2 before; Rosetta 2 does this through a system daemon called <code class="language-plaintext highlighter-rouge">oahd</code>.
If Rosetta 2 has never encountered this particular binary before, <code class="language-plaintext highlighter-rouge">oahd</code> kicks off a new process called <code class="language-plaintext highlighter-rouge">oahd-helper</code> that carries out the ahead-of-time (AOT) binary translation process and caches the result in a folder located at <code class="language-plaintext highlighter-rouge">/var/db/oah</code>; cached AOT arm64 binaries are stored in subfolders named using a SHA-256 hash calculated from the contents and path of the original x86-64 binary.
If Rosetta 2 has encountered a binary before, as determined by finding an SHA-256 hash collision in <code class="language-plaintext highlighter-rouge">/var/db/oah</code>, then <code class="language-plaintext highlighter-rouge">oahd</code> just loads the cached AOT binary from before.</p>
<p>So what do these cached AOT binaries look like?
Unfortunately, <code class="language-plaintext highlighter-rouge">/var/db/oah</code> is by default not accessible to users at all, not even admin and root users.
Fortunately, like with all protected components of macOS, access can be granted by disabling System Integrity Protection (SIP).
I don’t recommend disabling SIP unless you have a very good reason to, since SIP is designed to protect core macOS files from getting damaged or modified, but for this exploration I temporarily disabled SIP just long enough to take a look in <code class="language-plaintext highlighter-rouge">/var/db/oah</code>.
Well, it turns out that the cached AOT binaries are just regular-ish arm64 Mach-O files named with an <code class="language-plaintext highlighter-rouge">.aot</code> extension; I say “regular-ish” because while the <code class="language-plaintext highlighter-rouge">.aot</code> files are completely normal Mach-O binaries, they cannot actually be executed on their own.
Attempting to directly run a <code class="language-plaintext highlighter-rouge">.aot</code> binary results in an immediate <code class="language-plaintext highlighter-rouge">SIGKILL</code>.
Instead, <code class="language-plaintext highlighter-rouge">.aot</code> binaries must be loaded by the Rosetta 2 runtime and require some special memory mapping to run correctly.
But that’s fine; I wasn’t interested in running the <code class="language-plaintext highlighter-rouge">.aot</code> file, I was interested in learning what it looks like inside, and since the <code class="language-plaintext highlighter-rouge">.aot</code> file is a Mach-O file, we can disassemble <code class="language-plaintext highlighter-rouge">.aot</code> files just like any other Mach-O file.</p>
<p>Let’s go through a simple example to compare how the same piece of C++ code compiles to arm64 natively, versus what Rosetta 2 generates from a x86-64 binary.
The simple example C++ code I’ll use here is the same basic atomic float addition implementation that I wrote about in my previous post; since that post already contains an exhaustive analysis of how this example compiles to both x86-64 and arm64 assembly, I figure that means I don’t need to go over all of that again and can instead dive straight into the Rosetta 2 comparison.
To make an actually executable binary though, I had to wrap the example <code class="language-plaintext highlighter-rouge">addAtomicFloat()</code> function in a simple <code class="language-plaintext highlighter-rouge">main()</code> function:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>#include <atomic>
float addAtomicFloat(std::atomic<float>& f0, const float f1) {
do {
float oldval = f0.load();
float newval = oldval + f1;
if (f0.compare_exchange_weak(oldval, newval)) {
return oldval;
}
} while (true);
}
int main() {
std::atomic<float> t(0);
addAtomicFloat(t, 1.0f);
return 0;
}
</code></pre></div></div>
<div class="codecaption">Listing 1: Example <code class="language-plaintext highlighter-rouge">addAtomicFloat()</code> implementation and a very simple <code class="language-plaintext highlighter-rouge">main()</code> function to make a executable program. The <code class="language-plaintext highlighter-rouge">addAtomicFloat()</code> implementation is the same one from <a href="https://blog.yiningkarlli.com/2021/05/porting-takua-to-arm-pt1.html#listing2">Listing 2 in my previous "Porting Takua Renderer to 64-bit ARM- Part 1</a>" post.</div>
<p>Modern versions of macOS’s Xcode Command Line Tools helpfully come with both otool and with <a href="https://llvm.org/docs/CommandGuide/llvm-objdump.html">LLVM’s version of objdump</a>, both of which can be used to disassembly Mach-O binaries.
For this exploration, I used otool to disassemble arm64 binaries and objdump to disassembly x86-64 binaries.
I used different tools for disassembling x86-64 versus arm64 because of slightly different feature sets that I needed on each platform.
By default, Apple’s version of Clang uses newer ARMv8.1-A instructions like <code class="language-plaintext highlighter-rouge">casal</code>.
However, the version of objdump that Apple ships with the Xcode Command Line Tools only seems to support base ARMv8-a and doesn’t understand newer ARMv8.1-A instructions like <code class="language-plaintext highlighter-rouge">casal</code>, whereas otool does seem to know about ARMv8.1 instructions, hence using otool for arm64 binaries.
For x86-64 binaries, however, otool outputs x86-64 assembly using AT&T syntax, whereas I prefer reading x86-64 assembly in Intel syntax, which matches what <a href="https://godbolt.org">Godbolt Compiler Explorer</a> defaults to.
So, for x86-64 binaries, I used objdump, which can be set to output x86-64 assembly using Intel syntax with the <code class="language-plaintext highlighter-rouge">-x86-asm-syntax=intel</code> flag.</p>
<p>On both x86-64 and on arm64, I compiled the example in Listing 1 using the default Clang that comes with Xcode 12.5.1, which reports its version string as “Apple clang version 12.0.5 (clang-1205.0.22.11)”.
Note that Apple’s Clang version numbers have nothing to do with mainline upstream Clang version numbers; according to <a href="https://en.wikipedia.org/wiki/Xcode#12.x_series">this table on Wikipedia</a>, “Apple clang version 12.0.5” corresponds roughly with mainline LLVM/Clang 11.1.0.
Also, I compiled using the <code class="language-plaintext highlighter-rouge">-O3</code> optimization flag.</p>
<p>Disassembling the x86-64 binary using <code class="language-plaintext highlighter-rouge">objdump -disassemble -x86-asm-syntax=intel</code> produces the following x86-64 assembly.
I’ve only included the assembly for the <code class="language-plaintext highlighter-rouge">addAtomicFloat()</code> function and not the assembly for the dummy <code class="language-plaintext highlighter-rouge">main()</code> function.
For readability, I have also replaced the offset for the <code class="language-plaintext highlighter-rouge">jne</code> instruction with a more readable label and added the label into the correct place in the assembly code:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><__Z14addAtomicFloatRNSt3__16atomicIfEEf>: # f0 is dword ptr [rdi], f1 is xmm0
push rbp # save address of previous stack frame
mov rbp, rsp # move to address of current stack frame
nop word ptr cs:[rax + rax] # multi-byte no-op, probably to align
# subsequent instructions better for
# instruction fetch performance
nop # no-op
.LBB0_1:
mov eax, dword ptr [rdi] # eax = *arg0 = f0.load()
movd xmm1, eax # xmm1 = eax = f0.load()
movdqa xmm2, xmm1 # xmm2 = xmm1 = eax = f0.load()
addss xmm2, xmm0 # xmm2 = (xmm2 + xmm0) = (f0 + f1)
movd ecx, xmm2 # ecx = xmm2 = (f0 + f1)
lock cmpxchg dword ptr [rdi], ecx # if eax == *arg0 { ZF = 1; *arg0 = arg1 }
# else { ZF = 0; eax = *arg0 };
# "lock" means all done exclusively
jne .LBB0_1 # if ZF == 0 goto .LBB0_1
movdqa xmm0, xmm1 # return f0 value from before cmpxchg
pop rbp # restore address of previous stack frame
ret # return control to previous stack frame address
nop
</code></pre></div></div>
<div class="codecaption">Listing 2: The <code class="language-plaintext highlighter-rouge">addAtomicFloat()</code> function from Listing 1 compiled to x86-64 using <code class="language-plaintext highligher-rouge">clang++ -O3</code> and disassembled using <code class="language-plaintext highligher-rouge">objdump -disassemble -x86-asm-syntax=intel</code>, with some minor tweaks for formatting and readability. My annotations are also included as comments.</div>
<p>If we compare the above code with <a href="https://blog.yiningkarlli.com/2021/05/porting-takua-to-arm-pt1.html#listing5">Listing 5 in my previous post</a>, we can see that the above code matches what we got from Clang in Godbolt Compiler Explorer.
The only difference is the stack pointer pushing and popping code that happens in the beginning and end to make this function usable in a larger program; the core functionality in lines 8 through 18 of the above code matches the output from Clang in Godbolt Compiler Explorer exactly.</p>
<p>Next, here’s the assembly produced by disassembling the arm64 generated using Clang.
I disassembled the arm64 binary using <code class="language-plaintext highlighter-rouge">otool -Vt</code>; here’s the relevant <code class="language-plaintext highlighter-rouge">addAtomicFloat()</code> function with the same minor changes as in Listing 2 for more readable section labels:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>__Z14addAtomicFloatRNSt3__16atomicIfEEf:
.LBB0_1:
ldar w8, [x0] // w8 = *arg0 = f0, non-atomically loaded
fmov s1, w8 // s1 = w8 = f0
fadd s2, s1, s0 // s2 = s1 + s0 = (f0 + f1)
fmov w9, s2 // w9 = s2 = (f0 + f1)
mov x10, x8 // x10 (same as w10) = x8 (same as w8)
casal w10, w9, [x0] // atomically read the contents of the address stored
// in x0 (*arg0 = f0) and compare with w10;
// if [x0] == w10:
// atomically set the contents of the
// [x0] to the value in w9
// else:
// w10 = value loaded from [x0]
cmp w10, w8 // compare w10 and w8 and store result in N
cset w8, eq // if previous instruction's compare was true,
// set w8 = 1
cmp w8, #0x1 // compare if w8 == 1 and store result in N
b.ne .LBB0_1 // if N==0 { goto .LBB0_1 }
mov.16b v0, v1 // return f0 value from ldar
ret
</code></pre></div></div>
<div class="codecaption">Listing 3: The <code class="language-plaintext highlighter-rouge">addAtomicFloat()</code> function from Listing 1 compiled to arm64 using <code class="language-plaintext highligher-rouge">clang++ -O3</code> and disassembled using <code class="language-plaintext highligher-rouge">otool -Vt</code>, with some minor tweaks for formatting and readability. <br />My annotations are also included as comments.</div>
<p>Note the use of the ARMv8.1-A <code class="language-plaintext highlighter-rouge">casal</code> instruction.
Apple’s version of Clang defaults to using ARMv8.1-A instructions when compiling for macOS because the M1 chip implements ARMv8.4-A, and since the M1 chip is the first arm64 processor that macOS supports, that means macOS can safely assume a more advanced minimum target instruction set.
Also, the arm64 assembly output in Listing 3 looks almost exactly identical structurally to the Godbolt Compiler Explorer Clang output in <a href="https://blog.yiningkarlli.com/2021/05/porting-takua-to-arm-pt1.html#listing9">Listing 9 from my previous post</a>.
The only differences are in small syntactical differences with how the <code class="language-plaintext highlighter-rouge">mov</code> instruction in line 20 specifies a 16 byte (128 bit) SIMD register, some different register choices, and a different ordering of <code class="language-plaintext highlighter-rouge">fmov</code> and <code class="language-plaintext highlighter-rouge">mov</code> instructions in lines 6 and 7.</p>
<p>Finally, let’s take a look at the arm64 assembly that Rosetta 2 generates through the AOT process described earlier.
Disassembling the Rosetta 2 AOT file using <code class="language-plaintext highlighter-rouge">otool -Vt</code> produces the following arm64 assembly; like before, I’m only including the relevant <code class="language-plaintext highlighter-rouge">addAtomicFloat()</code> function.
Since the code below switches between <code class="language-plaintext highlighter-rouge">x</code> and <code class="language-plaintext highlighter-rouge">w</code> registers a lot, remember that in arm64 assembly, <code class="language-plaintext highlighter-rouge">x0</code>-<code class="language-plaintext highlighter-rouge">x30</code> and <code class="language-plaintext highlighter-rouge">w0</code>-<code class="language-plaintext highlighter-rouge">w30</code> are really the same registers; <code class="language-plaintext highlighter-rouge">x</code> just means use the full 64-bit register, whereas <code class="language-plaintext highlighter-rouge">w</code> just means use the lower 32 bits of the <code class="language-plaintext highlighter-rouge">x</code> register with the same register number.
Also, the <code class="language-plaintext highlighter-rouge">v</code> registers are 128-bit vector registers that are separate from the <code class="language-plaintext highlighter-rouge">x</code>/<code class="language-plaintext highlighter-rouge">y</code> set of registers; <code class="language-plaintext highlighter-rouge">s</code> registers are the bottom 32 bits of <code class="language-plaintext highlighter-rouge">v</code> registers.
In my annotations, I’ll use <code class="language-plaintext highlighter-rouge">x</code> for both <code class="language-plaintext highlighter-rouge">x</code> and <code class="language-plaintext highlighter-rouge">w</code> registers, and I’ll use <code class="language-plaintext highlighter-rouge">v</code> for both <code class="language-plaintext highlighter-rouge">v</code> and <code class="language-plaintext highlighter-rouge">s</code> registers.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>__Z14addAtomicFloatRNSt3__16atomicIfEEf:
str x5, [x4, #-0x8]! // store value at x5 to ((address in x4) - 8) and
// write calculated address back into x4
mov x5, x4 // x5 = address in x4
.LBB0_1
ldr w0, [x7] // x0 = *arg0 = f0, non-atomically loaded
fmov s1, w0 // v1 = x0 = f0
mov.16b v2, v1 // v2 = v1 = f0
fadd s2, s2, s0 // v2 = v2 + v0 = (f0 + f1)
mov.s w1, v2[0] // x1 = v2 = (f0 + f1)
mov w22, w0 // x22 = x0 = f0
casal w22, w1, [x7] // atomically read the contents of the address stored
// in x7 (*arg0 = f0) and compare with x22;
// if [x7] == x22:
// atomically set the contents of the
// [x7] to the value in x1
// else:
// x22 = value loaded from [x7]
cmp w22, w0 // compare x22 and x0 and store result in N
csel w0, w0, w22, eq // if N==1 { x0 = x0 } else { x0 = x22 }
b.ne .LBB0_1 // if N==0 { goto .LBB0_1 }
mov.16b v0, v1 // v0 = v1 = f0
ldur x5, [x4] // x5 = value at address in x4, using unscaled load
add x4, x4, #0x8 // add 8 to address stored in x4
ldr x22, [x4], #0x8 // x22 = value at ((address in x4) + 8)
ldp x23, x24, [x21], #0x10 // x23 = value at address in x21 and
// x24 = value at ((address in x21) + 8)
sub x25, x22, x23 // x25 = x22 - x23
cbnz x25, .LBB0_2 // if x22 != x23 { goto .LBB0_2 }
ret x24
.LBB0_2
bl 0x4310 // branch (with link) to address 0x4310
</code></pre></div></div>
<div class="codecaption">Listing 4: The x86-64 assembly from Listing 2 translated to arm64 by Rosetta 2's ahead-of-time translator. Disassembled using <code class="language-plaintext highligher-rouge">otool -Vt</code>, with some minor tweaks for formatting and readability. My annotations are also included as comments.</div>
<p>In some ways, we can see similarities between the Rosetta 2 arm64 assembly in Listing 4 and the natively compiled arm64 assembly in Listing 3, but there are also a lot of things in the Rosetta 2 arm64 assembly that look very different from the natively compiled arm64 assembly.
The core functionality in lines 9 through 21 of Listing 4 bear a strong resemblance to the core functionality in lines 5 through 19 of of Listing 3; both versions use a <code class="language-plaintext highlighter-rouge">fadd</code>, followed by a <code class="language-plaintext highlighter-rouge">casal</code> instruction to implement the atomic comparison, then follow with a <code class="language-plaintext highlighter-rouge">cmp</code> to compare the expected and actual outcomes, and then have some logic about whether or not to jump back to the top of the loop.
However, if we look more closely at the core functionality in the Rosetta 2 version, we can see some oddities.
In preparing for the <code class="language-plaintext highlighter-rouge">fadd</code> instruction on line 9, the Rosetta 2 version does a <code class="language-plaintext highlighter-rouge">fmov</code> followed by a 16-bit <code class="language-plaintext highlighter-rouge">mov</code> into register <code class="language-plaintext highlighter-rouge">v2</code>, and then the <code class="language-plaintext highlighter-rouge">fadd</code> takes a value from <code class="language-plaintext highlighter-rouge">v2</code>, adds the value to what is in <code class="language-plaintext highlighter-rouge">v0</code>, and stores the result back into <code class="language-plaintext highlighter-rouge">v2</code>.
The 16-bit move is pointless!
Instead of two <code class="language-plaintext highlighter-rouge">mov</code> instructions and an <code class="language-plaintext highlighter-rouge">fadd</code> where the first source registers and destination registers are the same, a better version would be to omit the second <code class="language-plaintext highlighter-rouge">mov</code> instruction and instead just do <code class="language-plaintext highlighter-rouge">fadd s2 s1 s0</code>.
In fact, in Listing 3 we can see that the natively compiled version does in fact just use a single <code class="language-plaintext highlighter-rouge">mov</code> and do <code class="language-plaintext highlighter-rouge">fadd s2 s1 s0</code>.
So, what’s going on here?</p>
<p>Things begin to make more sense once we look at the x86-64 assembly that the Rosetta 2 version is translated from.
In Listing 2’s x86-64 version, the <code class="language-plaintext highlighter-rouge">addss</code> instruction only has two inputs because the first source register is always also the destination register.
So, the x86-64 version has no choice but to use a few extra <code class="language-plaintext highlighter-rouge">mov</code> instructions to make sure values that are needed later aren’t overwritten by the <code class="language-plaintext highlighter-rouge">addss</code> instruction; whatever value needs to be in <code class="language-plaintext highlighter-rouge">xmm2</code> during the <code class="language-plaintext highlighter-rouge">addss</code> instruction must also be squirreled away in a second location if that value is still needed after <code class="language-plaintext highlighter-rouge">addss</code> is executed.
Since the Rosetta 2 arm64 assembly is a direct translation from the x86-64 assembly, the extra <code class="language-plaintext highlighter-rouge">mov</code> needed in the x86-64 version gets translated into the extraneous <code class="language-plaintext highlighter-rouge">mov.16b</code> in Listing 4, and the two-operand x86-64 <code class="language-plaintext highlighter-rouge">addss</code> gets translated into a strange looking <code class="language-plaintext highlighter-rouge">fadd</code> where the same register is duplicated for the first source and destination operands; this duplication is a direct one-to-one mapping to what <code class="language-plaintext highlighter-rouge">addss</code> does.</p>
<p>I think from the above we can see two very interesting things about Rosetta 2’s translation.
On one hand, the fact that the overall structure of the core functionality in the Rosetta 2 and natively compiled versions is so similar is very impressive, especially when considering that Rosetta 2 had absolutely no access to the original high-level C++ source code!
I guess my example function here is a very simple test case, but nonetheless I was impressed that Rosetta 2’s output overall isn’t too bad.
On the other hand though, the Rosetta 2 version does have small oddities and inefficiencies that arise from doing a direct mechanical translation from x86-64.
Since Rosetta 2 has no access to the original source code, no context for what the code does, and has no ability to build any kind of higher-level syntactic understanding, the best Rosetta 2 really can do is a direct mechanical translation with a relatively high level of conservatism with respect to preserving what the original x86-64 code is doing on an instruction-by-instruction basis.
I don’t think that this is actually a fault in Rosetta 2; I think it’s actually pretty much the only reasonable solution.
I don’t know how Rosetta 2’s translator is actually implemented internally, but my guess is that the translator is parsing the x86-64 machine code, generating some kind of IR, and then lowering that IR back to arm64 (who knows, maybe it’s even LLIR).
But, even if Rosetta 2 is generating some kind of IR, that IR at best can only correspond well to the IR that was generated by the last optimization pass in the original compilation to x86-64, and in any last optimization pass, a huge amount of higher level context is likely already lost from the original source program.
Short of doing heroic amounts of program analysis, there’s nothing Rosetta 2 can do about this lost higher level context, and even if implementing all of that program analysis was worthwhile (Which it almost certainly is not) there’s only so much that static analysis can do anyway.
I guess all of the above is a long way of saying: looking at the above example, I think Rosetta 2’s output is really impressive and surprisingly more optimal than I would have guessed before, but at the same time the inherent advantage that natively compiling to arm64 has is obvious.</p>
<p>However, all of the above is just looking at the core functionality of the original function.
If we look at the arm64 assembly surrounding this core functionality in Listing 4 though, we can see some truly strange stuff.
The Rosetta 2 version is doing a ton of pointer arithmetic and moving around addresses and stuff, and operands seem to be passed into the function using the wrong registers (<code class="language-plaintext highlighter-rouge">x7</code> instead of <code class="language-plaintext highlighter-rouge">x0</code>).
What is this stuff all about?
The answer lies in how the Rosetta 2 runtime works, and in what makes a Rosetta 2 AOT Mach-O file different from a standard macOS Mach-O binary.</p>
<p>One key fundamental difference between Rosetta 2 AOT binaries and regular arm64 macOS binaries is that Rosetta 2 AOT binaries use <em>a completely different ABI</em> from standard arm64 macOS.
On Apple platforms, the ABI used for normal arm64 Mach-O binaries is largely based on the standard ARM-developed arm64 ABI <a href="https://developer.arm.com/documentation/den0024/a/The-ABI-for-ARM-64-bit-Architecture/Register-use-in-the-AArch64-Procedure-Call-Standard/Parameters-in-general-purpose-registers">[ARM Holdings 2015]</a>, with some small differences <a href="https://developer.apple.com/documentation/xcode/writing-arm64-code-for-apple-platforms">[Apple 2020]</a> in function calling conventions and how some data types are implemented and aligned.
However, Rosetta 2 AOT binaries use an arm64-ized version of the System V AMD64 ABI, with a direct mapping between x86_64 and arm64 registers <a href="https://ffri.github.io/ProjectChampollion/part1/">[Nakagawa 2021]</a>.
This different ABI means that intermixing native arm64 code and Rosetta 2 arm64 code is not possible (or at least not at all practical), and this difference is also the explanation for why the Rosetta 2 assembly uses unusual registers for passing parameters into the function.
In the standard arm64 ABI calling convention, registers <code class="language-plaintext highlighter-rouge">x0</code> through <code class="language-plaintext highlighter-rouge">x7</code> are used to pass function arguments 0 through 7, with the rest going on the stack.
In the System V AMD64 ABI calling convention, function arguments are passed using registers <code class="language-plaintext highlighter-rouge">rdi</code>, <code class="language-plaintext highlighter-rouge">rsi</code>, <code class="language-plaintext highlighter-rouge">rdx</code>, <code class="language-plaintext highlighter-rouge">rcx</code>, <code class="language-plaintext highlighter-rouge">r8</code>, and <code class="language-plaintext highlighter-rouge">r9</code> for arguments 0 through 5 respectively, with everything else on the stack in reverse order.
In the arm64-ized version of the System V AMD64 ABI that Rosetta 2 AOT uses, the x86-64 <code class="language-plaintext highlighter-rouge">rdi</code>, <code class="language-plaintext highlighter-rouge">rsi</code>, <code class="language-plaintext highlighter-rouge">rdx</code>, <code class="language-plaintext highlighter-rouge">rcx</code>, <code class="language-plaintext highlighter-rouge">r8</code>, and <code class="language-plaintext highlighter-rouge">r9</code> registers map to the arm64 <code class="language-plaintext highlighter-rouge">x7</code>, <code class="language-plaintext highlighter-rouge">x6</code>, <code class="language-plaintext highlighter-rouge">x2</code>, <code class="language-plaintext highlighter-rouge">x1</code>, <code class="language-plaintext highlighter-rouge">x8</code>, and <code class="language-plaintext highlighter-rouge">x9</code>, respectively <a href="https://ffri.github.io/ProjectChampollion/part1/">[Nakagawa 2021]</a>.
So, that’s why in line 6 of Listing 4 we see a load from an address stored in <code class="language-plaintext highlighter-rouge">x7</code> instead of <code class="language-plaintext highlighter-rouge">x0</code>, because <code class="language-plaintext highlighter-rouge">x7</code> maps to x86-64’s <code class="language-plaintext highlighter-rouge">rdi</code> register, which is the first register used for passing arguments in the System V AMD64 ABI <a href="https://wiki.osdev.org/System_V_ABI">[OSDev 2018]</a>.
If we look at the corresponding instruction on line 9 of Listing 2, we can see that the x86-64 code does indeed use a <code class="language-plaintext highlighter-rouge">mov</code> instruction from the address stored in <code class="language-plaintext highlighter-rouge">rdi</code> to get the first function argument.</p>
<p>As for all of the pointer arithmetic and address trickery in lines 23 through 28 of Listing 4, I’m not 100% sure what it is for, but I have a guess.
Earlier I mentioned that <code class="language-plaintext highlighter-rouge">.aot</code> binaries cannot run like a normal binary and instead require some special memory mapping to work; I think all of this pointer arithmetic may have to do with that.
The way that the Rosetta 2 runtime interacts with the AOT arm64 code is that both the runtime and the AOT arm64 code are mapped into the same memory space at startup and the program counter is set to the entry point of the Rosetta 2 runtime; while running, the AOT arm64 code frequently can jump back into the Rosetta 2 runtime because the Rosetta 2 runtime is what handles things like translating x86_64 addresses into addresses in the AOT arm64 code <a href="https://ffri.github.io/ProjectChampollion/part1/">[Nakagawa 2021]</a>.
The Rosetta 2 runtime also directs system calls to native frameworks, which helps improve performance; this property of the Rosetta 2 runtime means that if an x86-64 binary does most of its work by calling macOS frameworks, the translated Rosetta 2 AOT binary can still run very close to native speed (as an interesting aside: Microsoft is adding a much more generalized version of this concept to Windows 11’s counterpart to Rosetta 2: Windows 11 on ARM will allow arbitrary mixing of native arm64 code and translated x86-64 code <a href="https://blogs.windows.com/windowsdeveloper/2021/06/28/announcing-arm64ec-building-native-and-interoperable-apps-for-windows-11-on-arm/">[Sweetgall 2021]</a>.
Finally, when a Rosetta 2 AOT binary is run, not only the arm64 and Rosetta 2 runtime are mapped into the running program memory; the original x86-64 binary is mapped in as well.
The AOT binary that Rosetta 2 generates does not actually contain any constant data from the original x86-64 binary; instead, the AOT file references the constant data from the x86-64 binary, which is why the x86-64 binary also needs to be loaded in.
My guess is that the pointer arithmetic stuff happening in the end of Listing 4 is possibly either to calculate offsets to stuff in the x86-64 binary, or to calculate offsets into the Rosetta 2 runtime itself.</p>
<p>Now that we have a better understanding of what Rosetta 2 is actually doing under the hood and how good the translated arm64 code is compared with natively compiled arm64 code, how does Rosetta 2 actually perform in the real world?
I compared Takua Renderer running as native arm64 code versus as x86-64 code running through Rosetta 2 on four different scenes, and generally running through Rosetta 2 yielded about 65% to 70% of the performance of running as native arm64 code.
The results section at the end of this post contains the detailed numbers and data.
Generally, I’m very impressed with this amount of performance for emulating x86-64 code on an arm64 processor, especially when considering that with high-performance code like Takua Renderer, Rosetta 2 has close to zero opportunities to provide additional performance by calling into native system frameworks.
As can be seen in the <a href="#perftesting">data in the results section</a>, even more impressive is the fact that even running at 70% of native speed, x86-64 Takua Renderer running on the M1 chip through Rosetta 2 is often on-par with or <em>even faster</em> than x86-64 Takua Renderer running natively on a contemporaneous current-generation 2019 16-inch MacBook Pro with a 6-core Intel Core i7-9750H processor!</p>
<p><strong>TSO Memory Ordering on the M1 Processor</strong></p>
<p>As I covered extensively in my previous post, one major crucial architectural difference between arm64 and x86-64 is in memory ordering: arm64 is a weakly ordered architecture, whereas x86-64 is a strongly ordered architecture <a href="https://preshing.com/20121019/this-is-why-they-call-it-a-weakly-ordered-cpu/">[Preshing 2012]</a>.
Any system emulating x86-64 binaries on an arm64 processor needs to overcome this memory ordering difference, which means emulating strong memory ordering on a weak memory architecture.
Unfortunately, doing this memory ordering emulation in software is extremely difficult and extremely inefficient. since emulating strong memory ordering on a weak memory architecture means providing stronger memory ordering guarantees than the hardware actually provides.
This memory ordering emulation is widely understood to be one of the main reasons why Microsoft’s x86 emulation mode for Windows on ARM incurs a much higher performance penalty compared with Rosetta 2, even though the two systems have broadly similar architectures <a href="https://docs.microsoft.com/en-us/windows/uwp/porting/apps-on-arm-x86-emulation">[Hickey et al. 2021]</a> at a high level.</p>
<p>Apple’s solution to the difficult problem of emulating strong memory ordering in software was to… just completely bypass the problem altogether.
Rosetta 2 does nothing whatsoever to emulate strong memory ordering in software; instead, Rosetta 2 provides strong memory ordering through <em>hardware</em>.
Apple’s M1 processor has an unusual feature for an ARM processor: the M1 processor has optional total store memory ordering (TSO) support!
By default, the M1 processor only provides the weak memory ordering guarantees that the arm64 architecture specifies, but for x86-64 binaries running under Rosetta 2, the M1 processor is capable of switching to strong memory ordering in hardware on a core-by-core basis.
This capability is a great example of the type of hardware-software integration that Apple is able to accomplish by owning and building the entire tech stack from the software all the way down to the silicon.</p>
<p>Actually, the M1 is not the first Apple Silicon chip to have TSO support.
The A12Z chip that was in the Apple Silicon DTK also has TSO support, and the A12Z is known to be a re-binned but otherwise identical variant of the A12X chip from 2018, so we can likely safely assume that the TSO hardware support has been present (albeit unused) as far back as the 2018 iPad Pro!
However, the M1 processor’s TSO implementation does have a significant leg up on the implementation in the A12Z.
Both the M1 and the A12Z implement a version of ARM’s big.LITTLE technology, where the processor contains two different types of CPU cores: lower-power energy-efficient cores, and high-power performance cores.
On the A12Z, hardware TSO support is only implemented in the high-power performance cores, whereas in the M1, hardware TSO support is implement on both the efficiency and performance cores.
As a result, on the A12Z-based Apple Silicon DTK, Rosetta 2 can only use four out of eight total CPU cores on the chip, whereas on M1-based Macs, Rosetta 2 can use all eight CPU cores.</p>
<p>I should mentioned here that, interestingly, the A12Z and M1 are actually not the first ARM CPUs to implement TSO as the memory model <a href="https://threedots.ovh/blog/2021/02/cpus-with-sequential-consistency/">[Threedots 2021]</a>.
Remember, when ARM specifies weak ordering in the architecture, what this actually means is that any arm64 implementation can actually choose to have any kind of stronger memory model since code written for a weaker memory model should also work correctly on a stronger memory model; only going the other way doesn’t work.
NVIDIA’s Denver and Carmel CPU microarchitectures (found in various NVIDIA Tegra and Xaviar system-on-a-chips) are also arm64 designs that implement a sequentially consistency memory model.
If I had to guess, I would guess that Denver and Carmel’s sequential consistency memory model is a legacy of the Denver Projects’s origins as a project to build an x86-64 CPU; the project was shifted to arm64 before release.
Fujitsu’s A64FX processor is another arm64 design that implements TSO as its memory model, which makes sense since the A64FX processor is meant for use in supercomputers as a successor to Fujitsu’s previous SPARC-based supercomputer processors, which also implemented TSO.
However, to the best of my knowledge, Apple’s A12Z and M1 are unique in their ability to execute in <em>both</em> the usual weak ordering mode and TSO mode.</p>
<p>To me, probably the most interesting thing about hardware TSO support in Apple Silicon is that switching ability.
Even more interesting is that the switching ability doesn’t require a reboot or anything like that- each core can be <em>independently</em> switched between strong and weak memory ordering on-the-fly at runtime through software.
On Apple Silicon processors, hardware TSO support is enabled by modifying a special register named <code class="language-plaintext highlighter-rouge">actlr_el1</code>; this register is actually <a href="https://developer.arm.com/documentation/100442/0100/register-descriptions/aarch64-system-registers/actlr-el1--auxiliary-control-register--el1">defined by the arm64 specification</a> as an implementation-defined auxiliary control register.
Since <code class="language-plaintext highlighter-rouge">actlr_el1</code> is implementation-defined, Apple has chosen to use it for toggling TSO and possibly for toggling other, so far publicly unknown special capabilities.
However, the <code class="language-plaintext highlighter-rouge">actlr_el1 </code> register, being a special register, cannot be modified by normal code; modifications to <code class="language-plaintext highlighter-rouge">actlr_el1</code> can only be done by the kernel, and the only thing in macOS that the kernel enables TSO for is Rosetta 2…</p>
<p>…at least by default!
Shortly after Apple started shipping out Apple Silicon DTKs last year, <a href="https://saagarjha.com">Saagar Jha</a> figured out how to allow any program to toggle TSO mode through <a href="https://github.com/saagarjha/TSOEnabler">a custom kernel extension</a>.
The way the TSOEnabler kext works is extremely clever; the kext searches through the kernel to find where the kernel is modifying <code class="language-plaintext highlighter-rouge">actlr_el1</code> and then traces backwards to figure out what pointer the kernel is reading a flag from for whether or not to enable TSO mode.
Instead of setting TSO mode itself, the kext then intercepts the pointer to the flag and writes to it, allowing the kernel to handle all of the TSO mode setup work since there’s some other stuff that needs to happen in addition to modifying <code class="language-plaintext highlighter-rouge">actlr_el1</code>.
Out of sheer curiosity, I compiled the TSOEnabler kext and installed it on my M1 Mac Mini to give it a try!
I don’t suggest installing and using TSOEnabler casually, and definitely not for normal everyday use; installing a custom self-compiled, unsigned kext on modern macOS requires disabling SIP.
However, I already had SIP disabled due to my earlier Rosetta 2 AOT exploration, and so I figured why not give this a shot before I reset everything and reenable SIP.</p>
<p>The first thing I wanted to try was a simple test to confirm that the TSOEnabler kext was working correctly.
In my last post, I wrote about a case where weak memory ordering was exposing a bug in some code written around incrementing an atomic integer; the “canonical” example of this specific type of situation is <a href="https://preshing.com/20121019/this-is-why-they-call-it-a-weakly-ordered-cpu/">Jeff Preshing’s multithreaded atomic integer incrementer example</a> using <code class="language-plaintext highlighter-rouge">std::memory_order_relaxed</code>.
I adapted Jeff Preshing’s example for my test; in this test, two threads both increment a shared integer counter 1000000 times, with exclusive access to the integer guarded using an atomic integer flag.
Operations on the atomic integer flag use <code class="language-plaintext highlighter-rouge">std::memory_order_relaxed</code>.
On strongly-ordered CPUs, using <code class="language-plaintext highlighter-rouge">std::memory_order_relaxed</code> works fine and at the end of the program, the value of the shared integer counter is always 2000000 as expected.
However, on weakly-ordered CPUs, weak memory ordering means that two threads can end up in a race condition to increment the shared integer counter; as a result, on weakly-ordered CPUs, at the end of the program the value of the shared integer counter is very often something slightly less than 2000000.
The key modification I made to this test program was to enable the M1 processor’s hardware TSO mode for each thread; if hardware TSO mode is correctly enabled, then the value of the shared integer counter should always end up being 2000000.
If you want to try for yourself, Listing 5 below includes the test program in its entirety; compile using <code class="language-plaintext highlighter-rouge">c++ tsotest.cpp -std=c++11 -o tsotest</code>.
The test program takes a single input parameter: <code class="language-plaintext highlighter-rouge">1</code> to enable hardware TSO mode, and anything else to leave TSO mode disabled.
Remember, to use this program, you must have compiled and installed the TSOEnabled kernel extension mentioned above.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>#include <atomic>
#include <iostream>
#include <thread>
#include <sys/sysctl.h>
static void enable_tso(bool enable_) {
int enable = int(enable_);
size_t size = sizeof(enable);
int err = sysctlbyname("kern.tso_enable", NULL, &size, &enable, size);
assert(err == 0);
}
int main(int argc, char** argv) {
bool useTSO = false;
if (argc > 1) {
useTSO = std::stoi(std::string(argv[1])) == 1 ? true : false;
}
std::cout << "TSO is " << (useTSO ? "enabled" : "disabled") << std::endl;
std::atomic<int> flag(0);
int sharedValue = 0;
auto counter = [&](bool enable) {
enable_tso(enable);
int count = 0;
while (count < 1000000) {
int expected = 0;
if (flag.compare_exchange_strong(expected, 1, std::memory_order_relaxed)) {
// Lock was successful
sharedValue++;
flag.store(0, std::memory_order_relaxed);
count++;
}
}
};
std::thread thread1([&]() { counter(useTSO); });
std::thread thread2([&]() { counter(useTSO); });
thread2.join();
thread1.join();
std::cout << sharedValue << std::endl;
}
</code></pre></div></div>
<div class="codecaption">Listing 5: Jeff Preshing's weakly ordered atomic integer test program, modified to support using the M1 processor's hardware TSO mode.</div>
<p>Running my test program indicated that the kernel extension was working properly!
In the screenshot below, I check that the Mac I’m running on has an arm64 processor, then I compile the test program and check that the output is a native arm64 binary, and then I run the test program four times each with and without hardware TSO mode enabled.
As expected, with hardware TSO mode disabled, the program counts slightly less than 2000000 increments on the shared atomic counter, whereas with hardware TSO mode enabled, the program counts exactly 2000000 increments every time:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Jul/takua-on-arm-pt2/tsotest.png"><img src="https://blog.yiningkarlli.com/content/images/2021/Jul/takua-on-arm-pt2/tsotest.png" alt="Figure 4: Building, examining, and running the test program to demonstrate hardware TSO mode disabled and then enabled." /></a></p>
<p>Being able to enable hardware TSO mode in a native arm64 binary outside of Rosetta 2 actually does have some practical uses.
After I confirmed that the kernel extension was working correctly, I temporarily hacked hardware TSO mode into Takua Renderer’s native arm64 version, which allowed me to further verify that everything was working correctly with all of the various weakly ordered atomic fixes that I described in my previous post.
As mentioned in my previous post, comparing renders across different processor architectures is difficult for a variety of reasons, and previously comparing Takua Renderer running on a weakly ordered CPU versus on a strongly ordered CPU required comparing renders made on arm64 versus renders made on x86-64.
Using the M1’s hardware TSO mode though, I was able to compare renders made on exactly the same processor, which confirmed that everything works correctly!
After doing this test, I then removed the hardware TSO mode from Takua Renderer’s native arm64 version.</p>
<p>One silly idea I tried was to <em>disable</em> hardware TSO mode from inside of Rosetta 2, just to see what would happen.
Rosetta 2 does not support running x86-64 kernel extensions on arm64; all macOS kernel extensions must be native to the architecture they are running on.
However, as mentioned earlier, the Rosetta 2 runtime bridges system framework calls from inside of x86-64 binaries to their native arm64 counterparts, and this includes <code class="language-plaintext highlighter-rouge">sysctl</code> calls!
So we can actually call <code class="language-plaintext highlighter-rouge">sysctlbyname("kern.tso_enable")</code> from inside of an x86-64 binary running through Rosetta 2, and Rosetta 2 will pass the call along correctly to the native TSOEnabler kernel extension, which will then properly set hardware TSO mode.
For a simple test, I added a bit of code to test if a binary is running under Rosetta 2 or not and compiled the test program from Listing 5 for x86-64.
For the sake of completeness, here is how to check if a process is running under Rosetta 2; this code sample was provided by Apple in <a href="https://developer.apple.com/videos/play/wwdc2020/10686/">a WWDC 2020 talk about Apple Silicon</a>:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>// Use "sysctl.proc_translated" to check if running in Rosetta
// Returns 1 if running in Rosetta
int processIsTranslated() {
int ret = 0;
size_t size = sizeof(ret);
// Call the sysctl and if successful return the result
if (sysctlbyname("sysctl.proc_translated", &ret, &size, NULL, 0) != -1)
return ret;
// If "sysctl.proc_translated" is not present then must be native
if (errno == ENOENT)
return 0;
return -1;
}
</code></pre></div></div>
<div class="codecaption">Listing 6: Example code from Apple on how to check if the current process is running through Rosetta 2.</div>
<p>In Figure 5, I build the test program from Listing 5 as an x86-64 binary, with the Rosetta 2 detection function from Listing 6 added in.
I then check that the system architecture is arm64 and that the compiled program is x86-64, and run the test program with TSO disabled from inside of Rosetta 2.
The program reports that it is running through Rosetta 2 and reports that TSO is disabled, and then proceeds to report slightly less than 2000000 increments to the shared atomic counter:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Jul/takua-on-arm-pt2/tsotest.png"><img src="https://blog.yiningkarlli.com/content/images/2021/Jul/takua-on-arm-pt2/tsotest.png" alt="Figure 5: Building, examining, and running the test program to demonstrate hardware TSO mode disabled and then enabled." /></a></p>
<p>Of course, being able to disable hardware TSO mode from inside of Rosetta 2 is only a curiosity; I can’t really think of any practical reason why anyone would ever want to do this.
I guess one possible answer is to try to claw back some performance whilst running through Rosetta 2, since the hardware TSO mode does have a tangible performance impact, but this answer isn’t actually valid, since there is no guarantee that x86-64 binaries running through Rosetta 2 will work correctly with hardware TSO mode enabled.
The simple example here only works precisely because it is extremely simple; I also tried hacking disabling hardware TSO mode into the x86-64 version of Takua Renderer and running that through Rosetta 2.
The result was that this hacked version of Takua Renderer would run for only a fraction of a second before running into a hard crash from somewhere inside of TBB.
More complex x86-64 programs with hardware TSO mode not working correctly or even crashing shouldn’t be surprising, since the x86-64 code itself can have assumptions about strong memory ordering baked into whatever optimizations the code was compiled with.
As mentioned earlier, running a program written and compiled with weak memory ordering assumptions on a stronger memory model should work correctly, but running a program written and compiled with strong memory ordering assumptions on a weaker memory model can cause problems.</p>
<p>Speaking of the performance of hardware TSO mode, the last thing I tried was measuring the performance impact of enabling hardware TSO mode.
I hacked enabling hardware TSO mode into the native arm64 version of Takua Renderer, with the idea being that by comparing the Rosetta 2, custom TSO-enabled native arm64, and default TSO-disabled native arm64 versions of Takua Renderer, I could get a better sense of exactly how much performance cost there is to running the M1 with TSO enabled, and how much of the performance cost of Rosetta 2 comes from less efficient translated arm64 code versus from TSO-enabled mode.
The <a href="#perftesting">results section at the end of this post</a> contains the exact numbers and data for the four scenes that I tested; the general trend I found was that native arm64 code with hardware TSO enabled ran about 10% to 15% slower than native arm64 code with hardware TSO disabled.
When comparing with Rosetta 2’s overall performance, I think we can reasonably estimate that on the M1 chip, hardware TSO is responsible for somewhere between a third to a half of the performance discrepancy between Rosetta 2 and native weakly ordered arm64 code.</p>
<p>Apple Silicon’s hardware TSO mode is a fascinating example of Apple extending the base arm64 architecture and instruction set to accelerate application-specific needs.
Hardware TSO mode to support and accelerate Rosetta 2 is just the start; Apple Silicon is well known to already contain some other interesting custom extensions as well.
For example, Apple Silicon contains an entire new, so far undocumented arm64 ISA extension centered around doing fast matrix operations for Apple’s “Accelerate” framework, which supports various deep learning and image procesing applications <a href="https://gist.githubusercontent.com/dougallj/7a75a3be1ec69ca550e7c36dc75e0d6f/raw/60d491aeb70863363af1d4bdf4b8ade9be486af3/aarch64_amx.py">[Johnson 2020]</a>.
This extension, called AMX (for Apple Matrix coprocessor), is separate but likely related to the “Neural Engine” hardware <a href="https://medium.com/swlh/apples-m1-secret-coprocessor-6599492fc1e1">[Engheim 2021]</a> that ships on the M1 chip alongside the M1’s arm64 processor and custom Apple-designed GPU.
Recent open-source code releases from Apple <a href="https://mobile.twitter.com/_saagarjha/status/1398959235954745346">also hint at</a> future Apple Silicon chips having dedicated built-in hardware for doing branch predicion around Objective C’s objc_msgSend, which would considerably accelerate message passing in Cocoa apps.</p>
<p><strong>Embree on arm64 using sse2neon</strong></p>
<p>As mentioned earlier, porting Takua and Takua’s dependencies was relatively easy and straightforward and in large part worked basically out-of-the-box, because Takua and most of Takua’s dependencies are written in vanilla C++.
Gotchas like memory-ordering correctness in atomic and multithreaded code aside, porting vanilla C++ code between x86-64 and arm64 largely just involves recompiling, and popular modern compilers such as Clang, GCC, and MSVC all have mature, robust arm64 backends today.
However, for code written using inline assembly or architecture-specific vector SIMD intrinsics, recompilation is not enough to get things working on a different processor architecture.</p>
<p>A huge proportion of the raw compute power in modern processors is actually located in vector <a href="https://en.wikipedia.org/wiki/SIMD">SIMD instruction set extensions</a>, such as the various SSE and AVX extensions found in modern x86-64 processors and the NEON and upcoming SVE extensions found in arm64.
For workloads that can benefit from vectorization, using SIMD extensions means up to a 4x speed boost over scalar code when using SSE or NEON, and potentially even more using AVX or SVE.
One way to utilize SIMD extensions is just to write scalar C++ code like normal and let the compiler auto-vectorize the code at compile-time.
However, relying on auto-vectorization to leverage SIMD extensions in practice can be surprisingly tricky.
In order for compilers to be able to efficiently auto-vectorize code that was written to be scalar, compilers need to be able to deduce and infer an enormous amount of context and knowledge about what the code being compiled actually does, and doing this kind of work is extremely difficult and extremely prone to defeat by edge cases, complex scenarios, or even just straight up implementation bugs.
The end result is that getting scalar C++ code to go through auto-vectorization well in practice ends up requiring a lot of deep knowledge about how the compiler’s auto-vectorization implementation actually works under the hood, and small innocuous changes can often suddenly lead to the compiler falling back to generating completely scalar assembly.
Without a robust performance test suite, these fallbacks can happen unbeknownst to the programmer; I like the term that my friend <a href="https://twitter.com/superfunc">Josh Filstrup</a> uses for these scenarios: “real rugpull moments”.
Most high-performance applications that require good vectorization usually rely on at least one of several other options: write code directly in assembly utilizing SIMD instructions, write code using SIMD intrinsics, or write code for use with <a href="https://ispc.github.io">ISPC: the Intel SPMD Program Compiler</a>.</p>
<p>Writing SIMD code directly in assembly is more or less just like writing regular assembly, just with different instructions and wider registers; SSE uses <code class="language-plaintext highlighter-rouge">XMM</code> registers and many SSE instructions end in either <code class="language-plaintext highlighter-rouge">SS</code> or <code class="language-plaintext highlighter-rouge">PS</code>, AVX uses <code class="language-plaintext highlighter-rouge">ZMM</code> registers, and NEON uses <code class="language-plaintext highlighter-rouge">D</code> and <code class="language-plaintext highlighter-rouge">Q</code> registers.
Since writing directly in assembly is often not desirable for a variety of readability and ease-of-use reasons, writing vector code directly in assembly is not nearly as common as writing vector code in normal C or C++ using vector intrinsics.
Vector intrinsics are functions that look like regular functions from the outside, but within the compiler have a direct one-to-one or near one-to-one mapping to specific assembly instructions.
For SSE and AVX, vector intrinsics are typically found in headers named using the pattern <code class="language-plaintext highlighter-rouge">*mmintrin.h</code>, where <code class="language-plaintext highlighter-rouge">*</code> is a letter of the alphabet corresponding to a specific subset or version of either SSE of AVX (for example, <code class="language-plaintext highlighter-rouge">x</code> for SSE, <code class="language-plaintext highlighter-rouge">e</code> for SSE2, <code class="language-plaintext highlighter-rouge">n</code> for SSE4.2, <code class="language-plaintext highlighter-rouge">i</code> for AVX, etc.).
For NEON, vector intrinsics are typically found in <code class="language-plaintext highlighter-rouge">arm_neon.h</code>.
Vector intrinsics are commonly found in many high-performance codebases, but another powerful and increasingly popular way to vectorize code is by using ISPC.
ISPC compiles a special variant of the C programming language using a <a href="https://en.wikipedia.org/wiki/SPMD">SPMD, or single-program-multiple-data</a>, programming model compiled to run on SIMD execution units; the idea is that an ISPC program describes what a single lane in a vector unit does, and ISPC itself takes care of making that program run across all of the lanes of the vector unit <a href="https://doi.org/10.1109/InPar.2012.6339601">[Pharr and Mark 2012]</a>.
While this may sound superficially like a form of auto-vectorization, there’s a crucial difference that makes ISPC far more reliable in outputting good vectorized assembly: ISPC bakes a vectorization-friendly programming model directly into the language itself, whereas normal C++ has no such affordances that C++ compilers can rely on.
This SPMD model is broadly very similar to how writing a GPU kernel works, although there are some key differences between SPMD as a programming model and the <a href="https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads">SIMT model</a> that GPU run on (namely, a SPMD program can be at a different point on each lane, whereas a SIMT program keeps the progress across all lanes in lockstep).
A big advantage of using ISPC over vector intrinsics or vector assembly is that ISPC code is basically just normal C code; in fact, ISPC programs can often compile as normal scalar C code with little to no modification.
Since the actual transformation to vector assembly is up to the compiler, writing code for ISPC is far more processor architecture independent than vector intrinsics are; ISPC today includes backends to generate SSE, AVX, and NEON binaries.
<a href="https://pharr.org/matt/blog/2018/04/30/ispc-all">Matt Pharr has a great blog post series</a> that goes into much more detail about the history and motivations behind ISPC and the benefits of using ISPC.</p>
<p>In general, graphics workloads tend to fit the bill well for vectorization, and as a result, graphics libraries often make extensive use of SIMD instructions (actually, a surprisingly large number of problem types can be vectorized, including even <a href="https://github.com/simdjson/simdjson">JSON parsing</a>).
Since SIMD intrinsics are architecture-specific, I didn’t fully expect all of Takua’s dependencies to compile right out of the box on arm64; I expected that a lot of them would contain chunks of code written using x86-64 SSE and/or AVX intrinsics!
However, almost all of Takua’s dependencies compiled without a problem either because they provided arm64 NEON or scalar C++ fallback codepaths for every SSE/AVX codepath, or because they rely on auto-vectorization by the compiler instead of using intrinsics directly.
OpenEXR is an example of the former, while OpenVDB and OpenSubdiv are examples of the latter.
Embree was the notable exception: Embree is heavily vectorized using code implemented directly using SSE and/or AVX intrinsics with no alternative scalar C++ or arm64 NEON fallback, and Embree also provides an ISPC interfaces.
Starting with Embree v3.13.0, Embree now provides an arm64 NEON codepath as well, but at the time I first ported Takua to arm64, Embree didn’t come with anything other than SSE and AVX implementations.</p>
<p>Fortunately, Embree is actually written in such a way that porting Embree to different processor architectures with different vector intrinsics is, at least in theory, relatively straightforward.
The Embree codebase internally is written as several different “layers”, where the bottommost layer is located in <code class="language-plaintext highlighter-rouge">embree/common/simd/</code> in the Embree source tree.
As one might be able to guess from the name, this bottommost layer is where all of the core SIMD functionality in Embree is implemented; this part of the codebase implements SIMD wrappers for things like 4/8/16 wide floats, SIMD math operations, and so on.
The rest of the Embree codebase doesn’t really contain many direct vector intrinsics at all; the parts of Embree that actually implement BVH construction and traversal and ray intersection all call into this base SIMD library.
As suggested by <a href="https://ingowald.blog/2018/07/15/cfi-embree-on-arm-power/">Ingo Wald in a 2018 blog post</a>, porting Embree to use something other than SSE/AVX mostly requires just reimplementing this base SIMD wrapper layer, and the rest of the Embree should more or less “just work”.</p>
<p>In his blog post, Ingo mentioned experimenting with replacing all of Embree’s base SIMD layer with scalar implementations of all of the vectorized code.
Back in early 2020, as part of my effort to get Takua up and running on arm64 Linux, I actually tried doing a scalar rewrite of the base SIMD layer of Embree as well as a first attempt at porting to arm64.
Overall the process to rewrite to scalar was actually very straightforward; most things were basically just replacing a function that did something with float4 inputs using SSE instructions with a simple loop that iterates over the four floats in a float4.
I did find that in addition to rewriting all of the SIMD wrapper functions to replace SSE intrinsics with scalar implementations, I also had to replace some straight-up inlined x86-64 assembly with equivalent compiler intrinsics; basically all of this code lives in <code class="language-plaintext highlighter-rouge">common/sys/intrinsics.h</code>.
None of the inlined assembly replacement was very complicated either though, most of it was things like replacing an inlined assembly call to x86-64’s <code class="language-plaintext highlighter-rouge">bsf</code> bit-scan-forward instruction with a call to the more portable <code class="language-plaintext highlighter-rouge">__builtin_ctz()</code> integer trailing zero counter builin compiler function.
Embree’s build system also required modifications; since I was just doing this as an initial test, I just did a terribly hack-job on the CMake scripts and, with some troubleshooting, got things building and running on arm64 Linux.
Unfortunately, the performance of my quick-and-rough scalar Embree port was… very disappointing.
I had hoped that the compiler would be able to do a decent job of autovectorizing the scalar reimplementations of all of the SIMD code, but overall my scalar Embree port on x86-64 was basically between three to four times slower than standard SSE Embree, which indicated that the compiler basically hadn’t effectively autovectorized anything at all.
This level of performance regression basically meant that my scalar Embree port wasn’t actually significantly faster than Takua’s own internal scalar BVH implementation; the disappointing performance combined with how hacky and rough my scalar Embree port was led me to abandon using Embree on arm64 Linux for the time being.</p>
<p>A short while later in the spring of 2020 though, I remembered that Syoyo Fujita had already succesfully ported Embree to arm64 with vectorization support!
Actually, Syoyo had started his <a href="https://github.com/lighttransport/embree-aarch64">Embree-aarch64</a> fork three years earlier in 2017 and had kept the project up-to-date with each new upstream official Embree release; I had just forgotten about the project until it popped up in my Twitter feed one day.
The approach that Syoyo took to getting vectorization working in the Embree-aarch64 fork was by using the <a href="https://github.com/DLTcollab/sse2neon">sse2neon</a> project, which implements SSE intrinsics on arm64 using NEON instructions and serves as a drop-in replacement for the various x86-64 <code class="language-plaintext highlighter-rouge">*mmintrin.h</code> headers.
Using sse2neon is actually the same strategy that had previously been used by <a href="https://mightynotes.wordpress.com/2017/01/24/porting-intel-embree-to-arm/">Martin Chang in 2017</a> to port Embree 2.x to work on arm64; Martin’s earlier effort provided the proof-of-concept that paved the way for Syoyo to fork Embree 3.x into Embree-aarch64.
Building the Embree-aarch64 fork on arm64 worked out-of-the-box, and on my Raspberry Pi 4, using Embree-aarch64 with Takua’s Embree backend produced a performance increase over Takua’s internal BVH implementation that was in the general range of what I expected.</p>
<p>Taking a look at the process that was taken to get Embree-aarch64 to a production-ready state with results that matched x86-64 Embree exactly provides a lot of interesting insights into how NEON works versus how SSE works.
In my previous post I wrote about how getting identical floating point behavior between different processor architectures can be challenging for a variety of reasons; getting floating point behavior to match between NEON and SSE is even harder!
Various NEON instructions such as <code class="language-plaintext highlighter-rouge">rcp</code> and <code class="language-plaintext highlighter-rouge">rsqt</code> have different levels of accuracy from their corresponding SSE counterparts, which required the Embree-aarch64 project to <a href="https://github.com/lighttransport/embree-aarch64/issues/24">implement more accurate versions</a> of some SSE intrinsics than what sse2neon provided at the time; a lot of these improvements were later contributed back to sse2neon.
I originally was planning to include a deep dive into comparing SSE, NEON, ISPC, sse2neon, and SSE instructions running on Rosetta 2 as part of this post, but the writeup for that comparison has now gotten so large that it’s going to have to be its own post as a later follow-up to this post; stay tuned!</p>
<p>As a bit of an aside: the history of the sse2neon project is a great example of a community forming to build an open-source project around a new need.
The sse2neon project was originally started by John W. Ratcliff at NVIDIA along with a few other NVIDIA folks and <a href="https://github.com/jratcliff63367/sse2neon">implemented only a small subset of SSE</a> that was just enough for their own needs.
However, after posting the project to Github with the MIT license, a community gradually formed around sse2neon and fleshed it out into a full project with full coverage of MMX and all versions of SSE from SSE1 all the way through SSE4.2.
Over the years sse2neon has seen <a href="https://github.com/DLTcollab/sse2neon/blob/master/sse2neon.h#L9">contributions and improvements</a> from NVIDIA, Amazon, Google, the Embree-aarch64 project, the Blender project, and recently Apple as part of Apple’s larger slew of contributions to various projects to improve arm64 support for Apple Silicon.</p>
<p>Starting with Embree v3.13.0, released in May 2021, the official main Embree project now has also gained full support for arm64 NEON; I have since switched Takua Renderer’s arm64 builds from using the Embree-aarch64 fork to using the new official arm64 support in Embree v3.13.0.
The approach the official Embree project takes is directly based off of the work that Syoyo Fujita and others did in the Embree-aarch64 fork; sse2neon is used to emulate SSE, and the same math precision improvements that were made in Embree-aarch64 were also adopted upstream by the official Embree project.
Much like Embree-aarch64, the arm64 NEON backend for Embree v3.13.0 does not include ISPC support, even though ISPC has an arm64 NEON backend as well; maybe this will come in the future.
Brecht Van Lommel from the Blender project seems to have done <a href="https://github.com/embree/embree/pull/316">most of the work</a> to upstream Embree-aarch64’s changes, with additional work and additional optimizations from Sven Woop on the Intel Embree team.
Interestingly and excitingly, <a href="https://github.com/embree/embree/pull/330">Apple also recently submitted a patch</a> to the official Embree project that adds AVX2 support on arm64 by treating each 8-wide AVX value as a pair of 4-wide NEON values.</p>
<p><strong>(More) Differences in arm64 versus x86-64</strong></p>
<p>In my previous post and in this post, I’ve covered a bunch of interesting differences and quirks that I ran into and had to take into account while porting from x86-64 to arm64.
There are, of course, far more differences that I didn’t touch on.
However, in this small section, I thought I’d list a couple more small but interesting differences that I ran into and had to think about.</p>
<ul>
<li>arm64 and x86-64 handle float-to-int conversions slightly differently for some edge cases. Specifically, for edge values such as a uint32_t set to <code class="language-plaintext highlighter-rouge">INF</code>, arm64 will make a best attempt to find the nearest possible integer to convert to, which would be 4294967295. x86-64, on the other hand, treats the <code class="language-plaintext highlighter-rouge">INF</code> case as basically undefined behavior and defaults to just zero. In path tracing code where occasional infinite values need to be handled for things like edge cases in sampling Dirac distributions, some care needs to be taken to make sure that the renderer is understanding and processing <code class="language-plaintext highlighter-rouge">INF</code> values correctly on both arm64 and x86-64.</li>
<li>Similarly, implicit conversion from signed integers to unsigned integers can have some different behavior between the two platforms. On arm64, negative signed integers get trimmed to zero when implicitly converted to an unsigned integer; for code that must cast between signed and unsigned integers, care must be taken to make sure that all conversions are explicitly cast and that the edge case behavior on arm64 and x86-64 are accounted for.</li>
<li>The signedness of <code class="language-plaintext highlighter-rouge">char</code> is platform specific and defaults to being signed on x86-64 but defaults to being unsigned on ARM architectures <a href="https://www.drdobbs.com/architecture-and-design/portability-the-arm-processor/184405435#">[Harmon 2003]</a>, including arm64. For custom string processing functions, this may have to be taken into account.</li>
<li>x86-64 is always little-endian, but arm64 is a <a href="https://en.wikipedia.org/wiki/Endianness#Bi-endianness">bi-endian</a> architecture that can be either little-endian or big-endian, as set by the operating system at startup time. Most Linux flavors, including Fedora, default to little-endian on arm64, and Apple’s various operating systems all exclusively use little-endian mode on arm64 as well, so this shouldn’t be too much of a problem for most use cases. However, for software that does expect to have to run on both little and big endian systems, endianess has to be taken into account for reading/writing/handling binary data. For example, Takua has a checkpointing system that basically dumps state information from the renderer’s memory straight to disk; these checkpoint files would need to have their endianess checked and handled appropriately if I were to make Takua bi-endian. However, since I don’t expect to ever run my own hobby stuff on a big-endian system, I just have Takua check the endianess at startup right now and refuse to run if the system is big-endian.</li>
</ul>
<p>For more details to look out for when porting x86-64 code to arm64 code on macOS specifically, Apple’s developer documentation <a href="https://developer.apple.com/documentation/apple-silicon/addressing-architectural-differences-in-your-macos-code">has a whole article</a> covering various things to consider.
Another fantastic resource for diving into arm64 assembly is Howard Oakley’s <a href="https://eclecticlight.co/2021/07/27/code-in-arm-assembly-rounding-and-arithmetic/">“Code in ARM Assembly” series</a>, which covers arm64 assembly programming on Apple Silicon in extensive detail (the bottom of each article in Howard Oakley’s series contains a table of contents linking out to all of the previous articles in the series).</p>
<div id="perftesting"></div>
<p><strong>(More) Performance Testing</strong></p>
<p>In my previous post, I included performance testing results from my initial port to arm64 Linux, running on a Raspberry Pi 4B.
Now that I have Takua Renderer up and running on a much more powerful M1 Mac Mini with 16 GB of memory, how does performance look on “big” arm64 hardware?
Last time around the machines / processors I compared were a Raspberry Pi 4B, which uses a Broadcom BCM2711 CPU with 4 Cortex-A72 cores dating back to 2015, a 2015 MacBook Air with a 2 core / 4 thread Intel Core i5-5250U CPU, and as an extremely unfair comparison point, my personal workstation with dual Intel Xeon E5-2680 CPUs from 2012 with 8 cores / 16 threads each (16 cores / 32 threads total).
The conclusion last time was that even though the Raspberry Pi 4B’s arm64 processor basically lost in terms of render time on almost every test, the Raspberry Pi 4B was actually the absolute <em>winner</em> by a wide margin when it came to <em>total energy usage</em> per render job.</p>
<p>This time around, since my expectation is that Apple’s M1 chip should be able to perform extremely well, I think my dual-Xeon personal workstation should absolutely be a fair competitor.
In fact, I think the comparison might actually be kind of <em>unfair</em> towards the dual-Xeon workstation, since the processors are from 2012 and were manufactured on the now-ancient 32 nm process, whereas the M1 is made on TSMC’s currently bleeding edge 5 nm process.
So, to give x86-64 more of a fighting chance, I’m also including a 2019 16 inch MacBook Pro with a 6 core / 8 thread Intel Core i7-9750H processor and 32 GB of memory, a.k.a. one of the fastest Intel-based laptops that Apple currently sells.</p>
<p>The first three test scenes are the same as last time: a standard Cornell Box, the glass teacup with ice seen in my <a href="https://blog.yiningkarlli.com/2019/05/nested-dielectrics.html">Nested Dielectrics post</a>, and the bedroom scene from my <a href="https://blog.yiningkarlli.com/2020/02/shadow-terminator-in-takua.html">Shadow Terminator in Takua post</a>.
Last time these three scenes were chosen since they fit in the 4 GB memory constraint that the Raspberry Pi 4B and the 2015 MacBook Air both have.
This time though, since the M1 Mac Mini has a much more modern 16 GB of memory, I’m including one more scene: <a href="https://blog.yiningkarlli.com/2018/02/scandinavian-room-scene.html">my Scandinavian Room scene</a>, as seen in Figure 1 of this post.
The Scandinavian Room scene is a much more realistic example of the type of complexity found in a real production render, and has much more interesting and difficult light transport.
Like before, the Cornell Box is rendered to 16 SPP using unidirectional path tracing and at 1024x1024 resolution, the Tea Cup is rendered to 16 SPP using VCM and at 1920x1080 resolution, and the Bedroom is rendered to 16 SPP using unidirectional path tracing and at 1920x1080 resolution.
Because the Scandinavian Room scene takes much longer to render due to being a much more complex scene, I’m rendered the Scandinavian Room scene to 4 SPP using unidirectional path tracing and at 1920x1080 resolution.
I left Takua Renderer’s texture caching system enabled for the Scandinavian Room scene, in order to test that the texture caching system was working correctly on arm64.
Using the texture cache could alter the performance results slightly due to disk latency to fetch texture tiles to populate the texture cache, but the texture cache hit rate after the first SPP on this scene is so close to 100% that it basically doesn’t make a difference after the first SPP, so I actually rendered the Scandinavian Room scene to 5 spp and counted the times for the last 4 and threw out timings for the first SPP.</p>
<p>Each test’s recorded time below is the average of the three best runs, chosen out of five runs in total for each processor.
For the M1 processor, I actually did three different types of runs, which are presented separately below.
I did one test with the native arm64 build of Takua Renderer, a second test with a version of the native arm64 build hacked to run with the M1’s hardware TSO mode enabled, and a third test running the x86-64 build on the M1 through Rosetta 2.
Also, for the Cornell Box, Tea Cup, and Bedroom scenes, I used Takua Renderer’s internal BVH implementation instead of Embree in order to match the tests from the last post, which were done before I had Embree working on arm64.
The Scandinavian Room tests use Embree as the traverser instead.</p>
<p>Here are the results:</p>
<table>
<thead>
<tr>
<th style="text-align: right"> </th>
<th style="text-align: center">CORNELL BOX</th>
<th style="text-align: left"> </th>
</tr>
<tr>
<th style="text-align: right"> </th>
<th style="text-align: center">1024x1024, PT</th>
<th style="text-align: left"> </th>
</tr>
<tr>
<th style="text-align: right">Processor:</th>
<th style="text-align: center">Wall Time:</th>
<th style="text-align: left">Core-Seconds:</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: right">Broadcom BCM2711:</td>
<td style="text-align: center">440.627 s</td>
<td style="text-align: left">approx 1762.51 s</td>
</tr>
<tr>
<td style="text-align: right">Intel Core i5-5250U:</td>
<td style="text-align: center">272.053 s</td>
<td style="text-align: left">approx 1088.21 s</td>
</tr>
<tr>
<td style="text-align: right">Intel Xeon E5-2680 x2:</td>
<td style="text-align: center">36.6183 s</td>
<td style="text-align: left">approx 1139.79 s</td>
</tr>
<tr>
<td style="text-align: right">Intel Core i7-9750H:</td>
<td style="text-align: center">41.7408 s</td>
<td style="text-align: left">approx 500.890 s</td>
</tr>
<tr>
<td style="text-align: right">Apple M1 Native:</td>
<td style="text-align: center">28.0611 s</td>
<td style="text-align: left">approx 224.489 s</td>
</tr>
<tr>
<td style="text-align: right">Apple M1 TSO-Enabled:</td>
<td style="text-align: center">32.5621 s</td>
<td style="text-align: left">approx 260.497 s</td>
</tr>
<tr>
<td style="text-align: right">Apple M1 Rosetta 2:</td>
<td style="text-align: center">42.5824 s</td>
<td style="text-align: left">approx 340.658 s</td>
</tr>
</tbody>
</table>
<table>
<thead>
<tr>
<th style="text-align: right"> </th>
<th style="text-align: center">TEA CUP</th>
<th style="text-align: left"> </th>
</tr>
<tr>
<th style="text-align: right"> </th>
<th style="text-align: center">1920x1080, VCM</th>
<th style="text-align: left"> </th>
</tr>
<tr>
<th style="text-align: right">Processor:</th>
<th style="text-align: center">Wall Time:</th>
<th style="text-align: left">Core-Seconds:</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: right">Broadcom BCM2711:</td>
<td style="text-align: center">2205.072 s</td>
<td style="text-align: left">approx 8820.32 s</td>
</tr>
<tr>
<td style="text-align: right">Intel Core i5-5250U:</td>
<td style="text-align: center">2237.136 s</td>
<td style="text-align: left">approx 8948.56 s</td>
</tr>
<tr>
<td style="text-align: right">Intel Xeon E5-2680 x2:</td>
<td style="text-align: center">174.872 s</td>
<td style="text-align: left">approx 5593.60 s</td>
</tr>
<tr>
<td style="text-align: right">Intel Core i7-9750H:</td>
<td style="text-align: center">158.729 s</td>
<td style="text-align: left">approx 1904.75 s</td>
</tr>
<tr>
<td style="text-align: right">Apple M1 Native:</td>
<td style="text-align: center">115.253 s</td>
<td style="text-align: left">approx 922.021 s</td>
</tr>
<tr>
<td style="text-align: right">Apple M1 TSO-Enabled:</td>
<td style="text-align: center">128.299 s</td>
<td style="text-align: left">approx 1026.39 s</td>
</tr>
<tr>
<td style="text-align: right">Apple M1 Rosetta 2:</td>
<td style="text-align: center">164.289 s</td>
<td style="text-align: left">approx 1314.31 s</td>
</tr>
</tbody>
</table>
<table>
<thead>
<tr>
<th style="text-align: right"> </th>
<th style="text-align: center">BEDROOM</th>
<th style="text-align: left"> </th>
</tr>
<tr>
<th style="text-align: right"> </th>
<th style="text-align: center">1920x1080, PT</th>
<th style="text-align: left"> </th>
</tr>
<tr>
<th style="text-align: right">Processor:</th>
<th style="text-align: center">Wall Time:</th>
<th style="text-align: left">Core-Seconds:</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: right">Broadcom BCM2711:</td>
<td style="text-align: center">5653.66 s</td>
<td style="text-align: left">approx 22614.64 s</td>
</tr>
<tr>
<td style="text-align: right">Intel Core i5-5250U:</td>
<td style="text-align: center">4900.54 s</td>
<td style="text-align: left">approx 19602.18 s</td>
</tr>
<tr>
<td style="text-align: right">Intel Xeon E5-2680 x2:</td>
<td style="text-align: center">310.35 s</td>
<td style="text-align: left">approx 9931.52 s</td>
</tr>
<tr>
<td style="text-align: right">Intel Core i7-9750H:</td>
<td style="text-align: center">362.29 s</td>
<td style="text-align: left">approx 4347.44 s</td>
</tr>
<tr>
<td style="text-align: right">Apple M1 Native:</td>
<td style="text-align: center">256.68 s</td>
<td style="text-align: left">approx 2053.46 s</td>
</tr>
<tr>
<td style="text-align: right">Apple M1 TSO-Enabled:</td>
<td style="text-align: center">291.69 s</td>
<td style="text-align: left">approx 2333.50 s</td>
</tr>
<tr>
<td style="text-align: right">Apple M1 Rosetta 2:</td>
<td style="text-align: center">366.01 s</td>
<td style="text-align: left">approx 2928.08 s</td>
</tr>
</tbody>
</table>
<table>
<thead>
<tr>
<th style="text-align: right"> </th>
<th style="text-align: center">SCANDINAVIAN ROOM</th>
<th style="text-align: left"> </th>
</tr>
<tr>
<th style="text-align: right"> </th>
<th style="text-align: center">1920x1080, PT</th>
<th style="text-align: left"> </th>
</tr>
<tr>
<th style="text-align: right">Processor:</th>
<th style="text-align: center">Wall Time:</th>
<th style="text-align: left">Core-Seconds:</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: right">Intel Xeon E5-2680 x2:</td>
<td style="text-align: center">119.16 s</td>
<td style="text-align: left">approx 3813.18 s</td>
</tr>
<tr>
<td style="text-align: right">Intel Core i7-9750H:</td>
<td style="text-align: center">151.81 s</td>
<td style="text-align: left">approx 1821.80 s</td>
</tr>
<tr>
<td style="text-align: right">Apple M1 Native:</td>
<td style="text-align: center">109.94 s</td>
<td style="text-align: left">approx 879.55 s</td>
</tr>
<tr>
<td style="text-align: right">Apple M1 TSO-Enabled:</td>
<td style="text-align: center">124.95 s</td>
<td style="text-align: left">approx 999.57 s</td>
</tr>
<tr>
<td style="text-align: right">Apple M1 Rosetta 2:</td>
<td style="text-align: center">153.66 s</td>
<td style="text-align: left">approx 1229.32 s</td>
</tr>
</tbody>
</table>
<p>The first takeaway from these new results is that Intel CPUs have advanced enormously over the past decade!
My wife’s 2019 16 inch MacBook Pro comes extremely close to matching my 2012 dual Xeon workstation’s performance on most tests and even wins on the Tea Cup scene, which is extremely impressive considering that the Intel Core i7-9750H cost around a tenth as much MSRP than the dual Intel Xeon E5-2680s would have cost new in 2012, and the Intel Core i7-9750H also uses 5 times less energy at peak than the dual Intel Xeon E5-2680s do at peak.</p>
<p>The real story though, is in the Apple M1 processor.
Quite simply, the Apple M1 processor completely smokes everything else on the list, often by margins that are downright stunning.
Depending on the test, the M1 processor beats the dual Xeons by anywhere between 10% and 30% in wall time and beats the 2019 MacBook Pro’s Core i7 by even more.
In terms of core-seconds, which is a measure of the overall performance of each processor core that approximates how long the render would have taken completely single-threaded, the M1’s wins are simply stunning; each of the M1’s processor cores is somewhere betweeen 4 to 6 times faster than the dual Xeons’ individual cores and between 2 to 3 times faster than the more contemporaneous Intel Core i7-9750H’s individual cores.
The even more impressive result from the M1 though, is that even running the x86-64 version of Takua Renderer using Rosetta 2’s dynamic translation system, the M1 still matches <em>or beats</em> the Intel Core i7-9750H.</p>
<p>Below is the breakdown of energy utilization for each test; the total energy used for each render is the wall clock render time multiplied by the maximum TDP of each processor to get watt-seconds, which is then divided by 3600 seconds per hour to get watt-hours.
Maximum TDP is used since Takua Renderer pushes processor utilization to 100% during each render.
As a point of comparison, I’ve also included all of the results from my previous post:</p>
<table>
<thead>
<tr>
<th style="text-align: right"> </th>
<th style="text-align: center">CORNELL BOX</th>
<th style="text-align: left"> </th>
</tr>
<tr>
<th style="text-align: right"> </th>
<th style="text-align: center">1024x1024, PT</th>
<th style="text-align: left"> </th>
</tr>
<tr>
<th style="text-align: right">Processor:</th>
<th style="text-align: center">Max TDP:</th>
<th style="text-align: left">Total Energy Used:</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: right">Broadcom BCM2711:</td>
<td style="text-align: center">4 W</td>
<td style="text-align: left">0.4895 Wh</td>
</tr>
<tr>
<td style="text-align: right">Intel Core i5-5250U:</td>
<td style="text-align: center">15 W</td>
<td style="text-align: left">1.1336 Wh</td>
</tr>
<tr>
<td style="text-align: right">Intel Xeon E5-2680 x2:</td>
<td style="text-align: center">260 W</td>
<td style="text-align: left">2.6450 Wh</td>
</tr>
<tr>
<td style="text-align: right">Intel Core i7-9750H:</td>
<td style="text-align: center">45 W</td>
<td style="text-align: left">0.5218 Wh</td>
</tr>
<tr>
<td style="text-align: right">Apple M1 Native:</td>
<td style="text-align: center">15 W</td>
<td style="text-align: left">0.1169 Wh</td>
</tr>
<tr>
<td style="text-align: right">Apple M1 TSO-Enabled:</td>
<td style="text-align: center">15 W</td>
<td style="text-align: left">0.1357 Wh</td>
</tr>
<tr>
<td style="text-align: right">Apple M1 Rosetta 2:</td>
<td style="text-align: center">15 W</td>
<td style="text-align: left">0.1774 Wh</td>
</tr>
</tbody>
</table>
<table>
<thead>
<tr>
<th style="text-align: right"> </th>
<th style="text-align: center">TEA CUP</th>
<th style="text-align: left"> </th>
</tr>
<tr>
<th style="text-align: right"> </th>
<th style="text-align: center">1920x1080, VCM</th>
<th style="text-align: left"> </th>
</tr>
<tr>
<th style="text-align: right">Processor:</th>
<th style="text-align: center">Max TDP:</th>
<th style="text-align: left">Total Energy Used:</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: right">Broadcom BCM2711:</td>
<td style="text-align: center">4 W</td>
<td style="text-align: left">2.4500 Wh</td>
</tr>
<tr>
<td style="text-align: right">Intel Core i5-5250U:</td>
<td style="text-align: center">15 W</td>
<td style="text-align: left">9.3214 Wh</td>
</tr>
<tr>
<td style="text-align: right">Intel Xeon E5-2680 x2:</td>
<td style="text-align: center">260 W</td>
<td style="text-align: left">12.6297 Wh</td>
</tr>
<tr>
<td style="text-align: right">Intel Core i7-9750H:</td>
<td style="text-align: center">45 W</td>
<td style="text-align: left">1.9841 Wh</td>
</tr>
<tr>
<td style="text-align: right">Apple M1 Native:</td>
<td style="text-align: center">15 W</td>
<td style="text-align: left">0.4802 Wh</td>
</tr>
<tr>
<td style="text-align: right">Apple M1 TSO-Enabled:</td>
<td style="text-align: center">15 W</td>
<td style="text-align: left">0.5346 Wh</td>
</tr>
<tr>
<td style="text-align: right">Apple M1 Rosetta 2:</td>
<td style="text-align: center">15 W</td>
<td style="text-align: left">0.6845 Wh</td>
</tr>
</tbody>
</table>
<table>
<thead>
<tr>
<th style="text-align: right"> </th>
<th style="text-align: center">BEDROOM</th>
<th style="text-align: left"> </th>
</tr>
<tr>
<th style="text-align: right"> </th>
<th style="text-align: center">1920x1080, PT</th>
<th style="text-align: left"> </th>
</tr>
<tr>
<th style="text-align: right">Processor:</th>
<th style="text-align: center">Max TDP:</th>
<th style="text-align: left">Total Energy Used:</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: right">Broadcom BCM2711:</td>
<td style="text-align: center">4 W</td>
<td style="text-align: left">6.2819 Wh</td>
</tr>
<tr>
<td style="text-align: right">Intel Core i5-5250U:</td>
<td style="text-align: center">15 W</td>
<td style="text-align: left">20.4189 Wh</td>
</tr>
<tr>
<td style="text-align: right">Intel Xeon E5-2680 x2:</td>
<td style="text-align: center">260 W</td>
<td style="text-align: left">22.4142 Wh</td>
</tr>
<tr>
<td style="text-align: right">Intel Core i7-9750H:</td>
<td style="text-align: center">45 W</td>
<td style="text-align: left">4.5286 Wh</td>
</tr>
<tr>
<td style="text-align: right">Apple M1 Native:</td>
<td style="text-align: center">15 W</td>
<td style="text-align: left">1.0695 Wh</td>
</tr>
<tr>
<td style="text-align: right">Apple M1 TSO-Enabled:</td>
<td style="text-align: center">15 W</td>
<td style="text-align: left">1.2154 Wh</td>
</tr>
<tr>
<td style="text-align: right">Apple M1 Rosetta 2:</td>
<td style="text-align: center">15 W</td>
<td style="text-align: left">1.5250 Wh</td>
</tr>
</tbody>
</table>
<table>
<thead>
<tr>
<th style="text-align: right"> </th>
<th style="text-align: center">SCANDINAVIAN ROOM</th>
<th style="text-align: left"> </th>
</tr>
<tr>
<th style="text-align: right"> </th>
<th style="text-align: center">1920x1080, PT</th>
<th style="text-align: left"> </th>
</tr>
<tr>
<th style="text-align: right">Processor:</th>
<th style="text-align: center">Max TDP:</th>
<th style="text-align: left">Total Energy Used:</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: right">Intel Xeon E5-2680 x2:</td>
<td style="text-align: center">260 W</td>
<td style="text-align: left">8.606 Wh</td>
</tr>
<tr>
<td style="text-align: right">Intel Core i7-9750H:</td>
<td style="text-align: center">45 W</td>
<td style="text-align: left">1.8976 Wh</td>
</tr>
<tr>
<td style="text-align: right">Apple M1 Native:</td>
<td style="text-align: center">15 W</td>
<td style="text-align: left">0.4581 Wh</td>
</tr>
<tr>
<td style="text-align: right">Apple M1 TSO-Enabled:</td>
<td style="text-align: center">15 W</td>
<td style="text-align: left">0.5206 Wh</td>
</tr>
<tr>
<td style="text-align: right">Apple M1 Rosetta 2:</td>
<td style="text-align: center">15 W</td>
<td style="text-align: left">0.6403 Wh</td>
</tr>
</tbody>
</table>
<p>Again the first takeaway from these results is just how much processor technology has improved overall in the past decade; the total energy usage by the modern Intel Core i7-9750H and Apple M1 is leaps and bounds better than the dual Xeons from 2012.
Compared to what was essentially the most powerful workstation hardware that Intel sold a little under a decade ago, a modern Intel laptop chip can now do the same work in about the same amount of time for roughly 5x <em>less</em> energy consumption.</p>
<p>The M1 though, once again entirely lives in a class of its own.
Running the native arm64 build, the M1 processor is <em>4 times more energy efficient</em> than the Intel Core i7-9750H to complete the same task.
The M1’s maximum TDP is only a third of the Intel Core i7-9750H’s maximum TDP, but the actual final energy utilization is a quarter because the M1’s faster performance means that the M1 runs for much less time than the Intel Core i7-9750H.
In other words, running native code, the M1 is both faster <em>and</em> more energy efficient than the Intel Core i7-9750H.
This result wouldn’t be impressive if the comparison was between the M1 and some low-end, power-optimized ultra-portable Intel chip, but that’s not what the comparison is with.
The comparison is with the Intel Core i7-9750H, which is a high-end, 45 W maximum TDP part that MSRPs for $395.
In comparison, the M1 is estimated to cost about $50, and the entire M1 Mac Mini only has a 39 W TDP total at maximum load; the M1 itself is reported to have a 15 W maximum TDP.
Where the comparison between the M1 and the Intel Core i7-9750H gets even more impressive is when looking at the M1’s energy utilization running x86-64 code under Rosetta 2: the M1 is <em>still</em> about 3 times more energy efficient than the Intel Core i7-9750H to do the same work.
Put another way, the M1 is an arm64 processor that can run emulated x86-64 code <em>faster than a modern native x86-64 processor that cost 5x more and uses 3x more energy can</em>.</p>
<p>Another interesting observation is that the for the same work, the M1 is actually more energy efficient than the Raspberry Pi 4B as well!
In the case of the Raspberry Pi 4B comparison, while the M1’s maximum TDP is 3.75x higher than the Broadcom BCM2711’s maximum TDP, the M1 is also around 20x faster to complete each render; the M1’s massive performance uplift more than offsets the higher maximum TDP.</p>
<p>Another aspect of the M1 processor that I was curious enough about to test further is the M1’s big.LITTLE implementation.
The M1 has four “Firestorm” cores and four “Icestorm” cores, where Firestorm cores are high-performance but also use a ton of energy, and Icestorm cores are extremely energy-efficient but are also commensurately less performant.
I wanted to know just how much of the overall performance of the M1 was coming from the big Firestorm cores, and just how much slower the Icestorm cores are.
So, I did a simple thread scaling test where I did successive renders using 1 all the way through 8 threads.
I don’t know of a good way on the M1 to explicitly pin which kind of core a given thread runs on on; on the A12Z, the easy way to pin to the high-performance cores is to just enable hardware TSO mode since the A12Z only has hardware TSO on the high-performance cores, but this is no longer the case on the M1.
But, I figured that the underlying operating system’s thread scheduler should be smart enough to notice that Takua Renderer is a job that pushes performance limits, and schedule any available high-performance cores before using the energy-efficiency cores too.</p>
<p>Here are the results on the Scandinavian Room scene for native arm64, native arm64 with TSO-enabled, and x86-64 running using Rosetta 2:</p>
<table>
<thead>
<tr>
<th style="text-align: center"> </th>
<th style="text-align: center"> </th>
<th style="text-align: center">M1 Native</th>
<th style="text-align: center"> </th>
<th style="text-align: center"> </th>
</tr>
<tr>
<th style="text-align: center"> </th>
<th style="text-align: center"> </th>
<th style="text-align: center">1920x1080, PT</th>
<th style="text-align: center"> </th>
<th style="text-align: center"> </th>
</tr>
<tr>
<th style="text-align: center">Threads:</th>
<th style="text-align: center">Wall Time:</th>
<th style="text-align: center">WT Speedup:</th>
<th style="text-align: center">Core-Seconds:</th>
<th style="text-align: center">CS Multiplier:</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: center">1 (1 big, 0 LITTLE)</td>
<td style="text-align: center">575.6787 s</td>
<td style="text-align: center">1.0x</td>
<td style="text-align: center">575.6786 s</td>
<td style="text-align: center">1.0x</td>
</tr>
<tr>
<td style="text-align: center">2 (2 big, 0 LITTLE)</td>
<td style="text-align: center">292.521 s</td>
<td style="text-align: center">1.9679x</td>
<td style="text-align: center">585.042 s</td>
<td style="text-align: center">0.9839x</td>
</tr>
<tr>
<td style="text-align: center">3 (3 big, 0 LITTLE)</td>
<td style="text-align: center">197.04 s</td>
<td style="text-align: center">2.9216x</td>
<td style="text-align: center">591.1206 s</td>
<td style="text-align: center">0.9738x</td>
</tr>
<tr>
<td style="text-align: center">4 (4 big, 0 LITTLE)</td>
<td style="text-align: center">148.9617 s</td>
<td style="text-align: center">3.8646x</td>
<td style="text-align: center">595.8466 s</td>
<td style="text-align: center">0.9661x</td>
</tr>
<tr>
<td style="text-align: center">5 (4 big, 1 LITTLE)</td>
<td style="text-align: center">137.6307 s</td>
<td style="text-align: center">4.1827x</td>
<td style="text-align: center">688.1536 s</td>
<td style="text-align: center">0.8365x</td>
</tr>
<tr>
<td style="text-align: center">6 (4 big, 2 LITTLE)</td>
<td style="text-align: center">128.9223 s</td>
<td style="text-align: center">4.4653x</td>
<td style="text-align: center">773.535 s</td>
<td style="text-align: center">0.7442x</td>
</tr>
<tr>
<td style="text-align: center">7 (4 big, 3 LITTLE)</td>
<td style="text-align: center">120.496 s</td>
<td style="text-align: center">4.7775x</td>
<td style="text-align: center">843.4713 s</td>
<td style="text-align: center">0.6825x</td>
</tr>
<tr>
<td style="text-align: center">8 (4 big, 4 LITTLE)</td>
<td style="text-align: center">109.9437 s</td>
<td style="text-align: center">5.2361x</td>
<td style="text-align: center">879.5476 s</td>
<td style="text-align: center">0.6545x</td>
</tr>
</tbody>
</table>
<table>
<thead>
<tr>
<th style="text-align: center"> </th>
<th style="text-align: center"> </th>
<th style="text-align: center">M1 TSO-Enabled</th>
<th style="text-align: center"> </th>
<th style="text-align: center"> </th>
</tr>
<tr>
<th style="text-align: center"> </th>
<th style="text-align: center"> </th>
<th style="text-align: center">1920x1080, PT</th>
<th style="text-align: center"> </th>
<th style="text-align: center"> </th>
</tr>
<tr>
<th style="text-align: center">Threads:</th>
<th style="text-align: center">Wall Time:</th>
<th style="text-align: center">WT Speedup:</th>
<th style="text-align: center">Core-Seconds:</th>
<th style="text-align: center">CS Multiplier:</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: center">1 (1 big, 0 LITTLE)</td>
<td style="text-align: center">643.9846 s</td>
<td style="text-align: center">1.0x</td>
<td style="text-align: center">643.9846 s</td>
<td style="text-align: center">1.0x</td>
</tr>
<tr>
<td style="text-align: center">2 (2 big, 0 LITTLE)</td>
<td style="text-align: center">323.8036 s</td>
<td style="text-align: center">1.9888x</td>
<td style="text-align: center">647.6073 s</td>
<td style="text-align: center">0.9944x</td>
</tr>
<tr>
<td style="text-align: center">3 (3 big, 0 LITTLE)</td>
<td style="text-align: center">220.4093 s</td>
<td style="text-align: center">2.9217x</td>
<td style="text-align: center">661.2283 s</td>
<td style="text-align: center">0.9739x</td>
</tr>
<tr>
<td style="text-align: center">4 (4 big, 0 LITTLE)</td>
<td style="text-align: center">168.9733 s</td>
<td style="text-align: center">3.8111x</td>
<td style="text-align: center">675.8943 s</td>
<td style="text-align: center">0.9527x</td>
</tr>
<tr>
<td style="text-align: center">5 (4 big, 1 LITTLE)</td>
<td style="text-align: center">153.849 s</td>
<td style="text-align: center">4.1858x</td>
<td style="text-align: center">769.2453 s</td>
<td style="text-align: center">0.8371x</td>
</tr>
<tr>
<td style="text-align: center">6 (4 big, 2 LITTLE)</td>
<td style="text-align: center">143.7426 s</td>
<td style="text-align: center">4.4801x</td>
<td style="text-align: center">862.4576 s</td>
<td style="text-align: center">0.7466x</td>
</tr>
<tr>
<td style="text-align: center">7 (4 big, 3 LITTLE)</td>
<td style="text-align: center">132.7233 s</td>
<td style="text-align: center">4.8520x</td>
<td style="text-align: center">929.0633 s</td>
<td style="text-align: center">0.6931x</td>
</tr>
<tr>
<td style="text-align: center">8 (4 big, 4 LITTLE)</td>
<td style="text-align: center">124.9456 s</td>
<td style="text-align: center">5.1541x</td>
<td style="text-align: center">999.5683 s</td>
<td style="text-align: center">0.6442x</td>
</tr>
</tbody>
</table>
<table>
<thead>
<tr>
<th style="text-align: center"> </th>
<th style="text-align: center"> </th>
<th style="text-align: center">M1 Rosetta 2</th>
<th style="text-align: center"> </th>
<th style="text-align: center"> </th>
</tr>
<tr>
<th style="text-align: center"> </th>
<th style="text-align: center"> </th>
<th style="text-align: center">1920x1080, PT</th>
<th style="text-align: center"> </th>
<th style="text-align: center"> </th>
</tr>
<tr>
<th style="text-align: center">Threads:</th>
<th style="text-align: center">Wall Time:</th>
<th style="text-align: center">WT Speedup:</th>
<th style="text-align: center">Core-Seconds:</th>
<th style="text-align: center">CS Multiplier:</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: center">1 (1 big, 0 LITTLE)</td>
<td style="text-align: center">806.6843 s</td>
<td style="text-align: center">1.0x</td>
<td style="text-align: center">806.68433 s</td>
<td style="text-align: center">1.0x</td>
</tr>
<tr>
<td style="text-align: center">2 (2 big, 0 LITTLE)</td>
<td style="text-align: center">412.186 s</td>
<td style="text-align: center">1.9570x</td>
<td style="text-align: center">824.372 s</td>
<td style="text-align: center">0.9785x</td>
</tr>
<tr>
<td style="text-align: center">3 (3 big, 0 LITTLE)</td>
<td style="text-align: center">280.875 s</td>
<td style="text-align: center">2.8720x</td>
<td style="text-align: center">842.625 s</td>
<td style="text-align: center">0.9573x</td>
</tr>
<tr>
<td style="text-align: center">4 (4 big, 0 LITTLE)</td>
<td style="text-align: center">207.0996 s</td>
<td style="text-align: center">3.8951x</td>
<td style="text-align: center">828.39966 s</td>
<td style="text-align: center">0.9737x</td>
</tr>
<tr>
<td style="text-align: center">5 (4 big, 1 LITTLE)</td>
<td style="text-align: center">189.322 s</td>
<td style="text-align: center">4.2609x</td>
<td style="text-align: center">946.608 s</td>
<td style="text-align: center">0.8521x</td>
</tr>
<tr>
<td style="text-align: center">6 (4 big, 2 LITTLE)</td>
<td style="text-align: center">175.0353 s</td>
<td style="text-align: center">4.6086x</td>
<td style="text-align: center">1050.2133 s</td>
<td style="text-align: center">0.7681x</td>
</tr>
<tr>
<td style="text-align: center">7 (4 big, 3 LITTLE)</td>
<td style="text-align: center">166.1286 s</td>
<td style="text-align: center">4.8557x</td>
<td style="text-align: center">1162.9033 s</td>
<td style="text-align: center">0.6936x</td>
</tr>
<tr>
<td style="text-align: center">8 (4 big, 4 LITTLE)</td>
<td style="text-align: center">153.6646 s</td>
<td style="text-align: center">5.2496x</td>
<td style="text-align: center">1229.3166 s</td>
<td style="text-align: center">0.6562x</td>
</tr>
</tbody>
</table>
<p>In the above table, WT speedup is how many times faster that given test was than the baseline single-threaded render; WT speedup is a measure of multithreading scaling efficiency.
The closer WT speedup is to the number of threads, the better the multithreading scaling efficiency; with perfect multithreading scaling efficiency, we’d expect the WT speedup number to be exactly the same as the number of threads.
The CS Multiplier value is another way to measure multithreading scaling efficiency; the closer the CS Multiplier number is to exactly 1.0, the closer each test is to achieving perfect multithreading scaling efficiency.</p>
<p>Since this test ran Takua Renderer in unidirectional path tracing mode, and depth-first unidirectional path tracing is largely trivially parallelizable using a simple parallel_for (okay, it’s not so simple once things like texture caching and things like learned path guiding data structures come into play, but close enough for now), my expectation for Takua Renderer is that on a system with homogeneous cores, multithreading scaling should be very close to perfect (assuming a fair scheduler in the underlying operating system).
Looking at the first four threads, which are all using the M1’s high-performance “big” Firestorm cores, close-to-perfect multithreading scaling efficiency is exactly what we see.
Adding the next four threads though, which use the M1’s low-performance energy-efficient “LITTLE” Icestorm cores, the multithreading scaling efficiency drops dramatically.
This drop in multithreading scaling efficiency is expected, since the Icestorm cores are far less performant than the Firestorm cores, but the <em>amount</em> that multithreading scaling efficiency drops by is what is interesting here, since that drop gives us a good estimate of just how less performant the Icestorm cores are.
The answer is that the Icestorm cores are roughly a quarter as performant as the high-performance Firestorm cores.
However, according to Apple, the Icestorm cores only use a tenth of the energy that the Firestorm cores do; a 4x performance drop for a 10x drop in energy usage is very impressive.</p>
<div id="conclusion"></div>
<p><strong>Conclusion to Part 2</strong></p>
<p>There’s really no way to understate what a colossal achievement Apple’s M1 processor is; compared with almost every modern x86-64 processor in its class, it achieves significantly more performance for much less cost and much less energy.
The even more amazing thing to think about is that the M1 is Apple’s <em>low end</em> Mac processor and likely will be the slowest arm64 chip to ever power a shipping Mac (the A12Z powering the DTK is slower, but the DTK is not a shipping consumer device); future Apple Silicon chips will only be even faster.
Combined with other extremely impressive high-performance arm64 chips such as Fujistu’s A64FX supercomputer CPU, NVIDIA’s upcoming Grace GPU, Ampere’s monster 80-core Altra CPU, and Amazon’s Graviton2 CPU used in AWS, I think the future for high-end arm64 looks very bright.</p>
<p>That being said though, x86-64 chips aren’t exactly sitting still either.
In the comparisons above I don’t have any modern AMD Ryzen chips, entirely because I personally don’t have access to any Ryzen-based systems at the moment.
However, AMD has been making enormous advancements in both performance and energy efficiency with their Zen series of x86-64 microarchitectures, and the current Zen 3 microarchitecture thoroughly bests Intel in both performance and energy efficiency.
Intel is not sitting still either, with ambitious plans to fight AMD for the x86-64 performance crown, and I’m sure both companies have no intention of taking the rising threat from arm64 lying down.</p>
<p>We are currently in a very exciting period of enormous advances in modern processor technology, with multiple large, well funded, very serious players competing to outdo each other.
For the end user, no matter who comes out on top and what happens, the end result is ultimately a win- faster chips using less energy for lower prices.
Now that I have Takua Renderer fully working with parity on both x86-64 and arm64, I’m ready to take advantage of each new advancement!</p>
<p><strong>Acknowledgements</strong></p>
<p>For both the last post and this post, I owe <a href="https://twitter.com/superfunc">Josh Filstrup</a> an enormous debt of gratitude for proofreading, giving plenty of constructive and useful feedback and suggestions, and for being a great discussion partner over the past year on many of the topics covered in this miniseries.
Also an enormous thanks to my wife, <a href="http://harmonymli.com/">Harmony Li</a>, who was patient with me while I took ages with the porting work and then was patient again with me as I took even longer to get these posts written.
Harmony also helped me brainstorm through various topics and provided many useful suggestions along the way.
Finally, thanks to you, the reader, for sticking with me through these two giant blog posts!</p>
<p><strong>References</strong></p>
<p>Apple. 2020. <a href="https://developer.apple.com/documentation/apple-silicon/addressing-architectural-differences-in-your-macos-code">Addressing Architectural Differences in Your macOS Code</a>. Retrieved July 19, 2021.</p>
<p>Apple. 2020. <a href="https://developer.apple.com/documentation/apple-silicon/building-a-universal-macos-binary">Building a Universal macOS Binary</a>. Retrieved June 22, 2021.</p>
<p>Apple. 2020. <a href="https://developer.apple.com/videos/play/wwdc2020/10686/">Explore the New System Architecture of Apple Silicon Macs</a>. Retrieved June 15, 2011.</p>
<p>Apple. 2020. <a href="https://developer.apple.com/documentation/xcode/writing-arm64-code-for-apple-platforms">Writing ARM64 Code for Apple Platforms</a>. Retrieved June 26, 2021.</p>
<p>ARM Holdings. 2015. <a href="https://developer.arm.com/documentation/den0024/a/The-ABI-for-ARM-64-bit-Architecture/Register-use-in-the-AArch64-Procedure-Call-Standard/Parameters-in-general-purpose-registers">Parameters in General-Purpose Registers</a>. In <em>ARM Cortex-A Series Programmer’s Guide for ARMv8-A</em>. Retrieved June 26, 2021.</p>
<p>ARM Holdings. 2017. <a href="https://developer.arm.com/documentation/100442/0100/register-descriptions/aarch64-system-registers/actlr-el1--auxiliary-control-register--el1">ACTLR_EL1, Auxiliary Control Register, EL1</a>. In <em>ARM Cortex-A55 Core Technical Reference Manual</em>. Retrieved June 26, 2021.</p>
<p>Martin Chang. 2017. <a href="https://mightynotes.wordpress.com/2017/01/24/porting-intel-embree-to-arm/">Porting Intel Embree to ARM</a>. In <em>MightyNotes: A Developer’s Blog</em>. Retrieved July 18, 2021.</p>
<p>Erik Engheim. 2021. <a href="https://medium.com/swlh/apples-m1-secret-coprocessor-6599492fc1e1">The Secret Apple M1 Coprocessor</a>. Retrieved July 23, 2021.</p>
<p>Trevor Harmon. 2003. <a href="https://www.drdobbs.com/architecture-and-design/portability-the-arm-processor/184405435#">Portability & the ARM Processor</a>. In <em>Dr. Dobb’s</em>. Retrieved July 19, 2021.</p>
<p>Shawn Hickey, Matt Wojiakowski, Shipa Sharma, David Coulter, Theano Petersen, Mike Jacobs, and Michael Satran. 2021. <a href="https://docs.microsoft.com/en-us/windows/uwp/porting/apps-on-arm-x86-emulation">How x86 Emulation works on ARM</a>. In <em>Windows on ARM</em>. Retrieved June 26, 2021.</p>
<p>Saagar Jha. 2020. <a href="https://github.com/saagarjha/TSOEnabler">TSOEnabler</a>. Retrieved June 15, 2021.</p>
<p>Dougall Johnson. 2020. <a href="https://gist.github.com/dougallj/7a75a3be1ec69ca550e7c36dc75e0d6f">AMX: Apple Matrix Coprocessor</a>. Retrieved July 23, 2021.</p>
<p>LLVM Project. 2021. <a href="https://llvm.org/docs/CommandGuide/llvm-lipo.html">llvm-lipo - LLVM Tool for Manipulating Universal Binaries</a>. Retrieved June 22, 2021.</p>
<p>LLVM Project. 2021. <a href="https://llvm.org/docs/CommandGuide/llvm-objdump.html">llvm-objdump - LLVM’s object file dumper</a>. Retrieved June 22, 2021.</p>
<p>Koh M. Nakagawa. 2021. <a href="https://ffri.github.io/ProjectChampollion/part1/">Reverse-Engineering Rosetta 2 Part 1: Analyzing AOT Files and the Rosetta 2 Runtime</a>. In <em>Project Champollion</em>. Retrieved June 23, 2021.</p>
<p>Koh M. Nakagawa. 2021. <a href="https://ffri.github.io/ProjectChampollion/part2/">Reverse-Engineering Rosetta 2 Part 2: Analyzing Other aspects of Rosetta 2 Runtime and AOT Shared Cache Files</a>. In <em>Project Champollion</em>. Retrieved June 23, 2021.</p>
<p>Howard Oakley. 2020. <a href="https://eclecticlight.co/2020/07/28/universal-binaries-inside-fat-headers/">Universal Binaries: Inside Fat Headers</a>. In <em>The Eclectic Light Company</em>. Retrieved June 22, 2021.</p>
<p>Howard Oakley. 2021. <a href="https://eclecticlight.co/2021/07/27/code-in-arm-assembly-rounding-and-arithmetic/">Code in ARM Assembly Series</a>. In <em>The Eclectic Light Company</em>. Retrieved July 19, 2021.</p>
<p>OSDev. 2018. <a href="https://wiki.osdev.org/System_V_ABI">System V ABI</a>. Retrieved June 26, 2021.</p>
<p>Matt Pharr. 2018. <a href="https://pharr.org/matt/blog/2018/04/30/ispc-all">The Story of ISPC</a>. In <em>Matt Pharr’s Blog</em>. Retrieved July 18, 2021.</p>
<p>Matt Pharr and William R. Mark. 2012. <a href="https://doi.org/10.1109/InPar.2012.6339601">ispc: A SPMD compiler for high-performance CPU programming</a>. In <em>2012 Innovative Parallel Computing (InPar)</em>.</p>
<p>Jeff Preshing. 2012. <a href="https://preshing.com/20121019/this-is-why-they-call-it-a-weakly-ordered-cpu/">This Is Why They Call It a Weakly-Ordered CPU</a>. In <em>Preshing on Programming</em>. Retrieved March 20, 2021.</p>
<p>Marc Sweetgall. 2021. <a href="https://blogs.windows.com/windowsdeveloper/2021/06/28/announcing-arm64ec-building-native-and-interoperable-apps-for-windows-11-on-arm/">Announcing ARM64EC: Building Native and Interoperable Apps for Windows 11 on ARM</a>. In <em>Windows Developers Blog</em>. Retrieved June 26, 2021.</p>
<p>Threedots. 2021. <a href="https://threedots.ovh/blog/2021/02/cpus-with-sequential-consistency/">Arm CPUs with Sequential Consistency</a>. In <em>Random Blog</em>. Retrieved June 26, 2021.</p>
<p>Ingo Wald. 2018. <a href="https://ingowald.blog/2018/07/15/cfi-embree-on-arm-power/">Cfl: Embree on ARM/Power/…?</a>. In <em>Ingo Wald’s Blog</em>. Retrieved July 18, 2021.</p>
<p>Amy Williams, Steve Barrus, R. Keith Morley, and Peter Shirley. 2005. <a href="https://doi.org/10.1080/2151237X.2005.10129188">An Efficient and Robust Ray-Box Intersection Algorithm</a>. <em>Journal of Graphics Tools</em>. 10, 1 (2005), 49-54.</p>
<p>Wikipedia. 2021. <a href="https://en.wikipedia.org/wiki/Endianness">Endianess</a>. Retrieved July 19, 2021.</p>
<p>Wikipedia. 2021. <a href="https://en.wikipedia.org/wiki/SIMD">SIMD</a>. Retrieved July 18, 2021.</p>
<p>Wikipedia. 2021. <a href="https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads">Single Instruction, Multiple Threads</a>. Retrieved July 18, 2021.</p>
<p>Wikipedia. 2021. <a href="https://en.wikipedia.org/wiki/SPMD">SPMD</a>. Retrieved July 18, 2021.</p>
<!--
ARM and Lock-Free Programming:
https://randomascii.wordpress.com/2020/11/29/arm-and-lock-free-programming/
ARM64 trims negative unsigned numbers to zero:
https://github.com/Aloshi/dukglue/pull/27
Lockless Programming Considerations:
https://docs.microsoft.com/en-us/windows/win32/dxtecharts/lockless-programming?redirectedfrom=MSDN
x86 gcc defaults to signed-chars whilst ARM gcc defaults to unsigned-chars:
https://www.raspberrypi.org/forums/viewtopic.php?t=501
https://www.drdobbs.com/architecture-and-design/portability-the-arm-processor/184405435
Embree on ARM CFI
https://ingowald.blog/2018/07/15/cfi-embree-on-arm-power/
https://heyrick.eu/assembler/fpops.html
Reverse Engineering Rosetta 2:
https://ffri.github.io/ProjectChampollion/part1/
Apple AMX extensions to ARM:
https://gist.githubusercontent.com/dougallj/7a75a3be1ec69ca550e7c36dc75e0d6f/raw/60d491aeb70863363af1d4bdf4b8ade9be486af3/aarch64_amx.py
M1 Trickery:
https://news.ycombinator.com/item?id=25233554
TSOEnabler:
https://github.com/saagarjha/TSOEnabler
A15 hardware objc_msgSend branch predictor:
https://mobile.twitter.com/_saagarjha/status/1398959235954745346
MSR/MRS instructions (see TSO stuff):
https://stackoverflow.com/questions/56142728/what-is-the-expansion-of-the-msr-and-mrs-instructions-in-arm/56146442
ARM Registers Explained:
https://eclecticlight.co/2021/06/16/code-in-arm-assembly-registers-explained/
(comments: https://news.ycombinator.com/item?id=27526155)
Code in ARM Assembly: Working with Pointers:
https://eclecticlight.co/2021/06/21/code-in-arm-assembly-working-with-pointers/
Code in ARM Assembly: Controlling Flow:
https://eclecticlight.co/2021/06/23/code-in-arm-assembly-controlling-flow/
Writing ARM64 Code for Apple Platforms
https://developer.apple.com/documentation/xcode/writing-arm64-code-for-apple-platforms
HelloSilicon: ARM64 on Macs
https://github.com/below/HelloSilicon
NEON is the new black: fast JPEG optimization on ARM server
https://blog.cloudflare.com/neon-is-the-new-black/
actlr_el1 Register:
https://developer.arm.com/documentation/100442/0100/register-descriptions/aarch64-system-registers/actlr-el1--auxiliary-control-register--el1
https://gist.github.com/zhuowei/c712df9ce13d8eabf4c49968d6c6cb2b
https://news.ycombinator.com/item?id=25749874
arm64 + x86-64 on Windows 11:
https://blogs.windows.com/windowsdeveloper/2021/06/28/announcing-arm64ec-building-native-and-interoperable-apps-for-windows-11-on-arm/
https://fgiesen.wordpress.com/2016/04/03/sse-mind-the-gap/
https://github.com/p12tic/libsimdpp
https://pharr.org/matt/blog/2018/04/30/ispc-all
https://tomforsyth1000.github.io/blog.wiki.html#%5B%5BWhy%20didn%27t%20Larrabee%20fail%3F%5D%5D
Don't access __m128 directly: https://msdn.microsoft.com/en-us/library/ayeb3ayc.aspx
-->
https://blog.yiningkarlli.com/2021/05/porting-takua-to-arm-pt1.html
Porting Takua Renderer to 64-bit ARM- Part 1
2021-05-29T00:00:00+00:00
2021-05-29T00:00:00+00:00
Yining Karl Li
<p>For almost its entire existence my hobby renderer, Takua Renderer, has built and run on Mac, Windows, and Linux on x86-64.
I maintain Takua on all three major desktop operating systems because I routinely run and use all three operating systems, and I’ve found that building with different compilers on different platforms is a good way for making sure that I don’t have code that is actually wrong but just happens to work because of the implementation quirks of a particular compiler and / or platform.
As of last year, Takua Renderer now also runs on 64-bit ARM, for both Linux and Mac!
64-bit ARM is often called either aarch64 or arm64; these two terms are interchangeable and mean the same thing (aarch64 is the official name for 64-bit ARM and is what Linux tends to use, while arm64 is the name that Apple and Microsoft’s tools tend to use).
For the sake of consistency, I’ll use the term arm64.</p>
<p>This post is the first of a two-part writeup of the process I undertook to port Takua Renderer to run on arm64, along with interesting stuff that I learned along the way.
In this first part, I’ll write about motivation and the initial port I undertook in the spring to arm64 Linux (specifically Fedora).
I’ll also write about how arm64 and x86-64’s memory ordering guarantees differ and what that means for lock-free code, and I’ll also do some deeper dives into topics such as floating point differences between different processors and a case study examining how code compiles to x86-64 versus to arm64.
In the second part, I’ll write about porting to arm64-based Apple Silicon Macs and I’ll also write about getting Embree up and running on ARM, creating Universal Binaries, and some other miscellaneous topics.</p>
<div id="motivation"></div>
<p><strong>Motivation</strong></p>
<p>So first, a bit of a preamble: why port to arm64 at all?
Today, basically most, if not all, of the animation/VFX industry renders on x86-64 machines (and a vast majority of those machines are likely running Linux), so pretty much all contemporary production rendering development happens on x86-64.
However, this has not always been true!
A long long time ago, much of the computer graphics world was based on MIPS hardware running SGI’s IRIX Unix variant; in the early 2000s, as SGI’s custom hardware began to fall behind the performance-per-dollar, performance-per-watt, and even absolute performance that commodity x86-based machines could offer, the graphics world undertook a massive migration to the current x86 world that we live in today.
Apple undertook a massive migration from PowerPC to x86 in the mid/late 2000s for similar reasons.</p>
<p>At this point, an ocean of text has been written about why it is that x86 (and by (literal) extension x86-64) became the dominant ISA in desktop computing and in the server space.
One common theory that I like is that x86’s dominance was a classic example of <a href="https://en.wikipedia.org/wiki/Disruptive_innovation#Disruptive_technology">disruptive innovation</a> from the low end.
A super short summary of disruptive innovation from the low end is that sometimes, a new player enters an existing market with a product that is much less capable but also much cheaper than existing competing products.
By being so much cheaper, the new product can generate a new, larger market that existing competing products can’t access due to their higher cost or different set of requirements or whatever.
As a result, the new product gets massive investment since the new product is the only thing that can capture this new larger market, and in turn this massive influx of investment allows the new player to iterate faster and rapidly grow its product in capabilities until the new player becomes capable of overtaking the old market as well.
This theory maps well to x86; x86-based desktop PCs started off being much cheaper but also much less capable than specialized hardware such as SGI machines, but the investment that poured into the desktop PC space allowed x86 chips to rapidly grow in absolute performance capability until they were able to overtake specialized hardware in basically every comparable metric.
At that point, moving to x86 became a no-brainer for many industries, including the computer graphics realm.</p>
<p>I think that ARM is following the same disruptive innovation path that x86 did, only this time the starting “low end” point is smartphones and tablets, which is an even lower starting point than desktop PCs were.
More importantly, I think we’re now at a tipping point for ARM.
For many years now, ARM chips have offered better performance-per-dollar and performance-per-watt than any x86-64 chip from Intel or AMD, and the point where arm64 chips can overtake x86-64 chips in absolute performance seems plausibly within sight over the next few years.
Notably, Amazon’s in-house Graviton2 arm64 CPU and Apple’s M1 arm64-based Apple Silicon chip are both already highly competitive in absolute performance terms with high end consumer x86-64 CPUs, while consuming less power and costing less.
Actually, I think that this trend should have been obvious to anyone paying attention to Apple’s A-series chips since the A9 chip was released in 2015.</p>
<p>In cases of disruptive innovation from the low end, the outer edge of the absolute high end is often the last place where the disruption reaches.
One of the interesting things about the high-end rendering field is that high-end rendering is one of a relatively small handful of applications that sits at the absolute outer edge of high end compute performance.
All of the major animation and VFX studios have render farms (either on-premises or in the cloud) with core counts somewhere in the tens of thousands of cores; these render farms have more similarities with supercomputers than they do with a regular consumer desktop or laptop.
I don’t know that anyone has actually tried this, but my guess is that if someone benchmarked any major animation or VFX studio’s render farm using the <a href="https://en.wikipedia.org/wiki/LINPACK_benchmarks">LINPACK supercomputer benchmark</a>, the score would sit very respectably somewhere in the upper half of the <a href="https://www.top500.org">TOP500 supercomputer list</a>.
With the above in mind, the fact that the fastest supercomputer in the world is now an arm64-based system should be an interesting indicator of where ARM is now in the process of catching up to x86-64 and how seriously all of us in high-end computer graphics should be when contemplating the possibility of an ARM-based future.</p>
<p>So all of the above brings me to why I undertook porting Takua to arm64.
The reason is because I think we can now plausibly see a potential near future in which the fastest, most efficient, and most cost effective chips in the world are based on arm64 instead of x86-64, and the moment this potential future becomes reality, high-performance software that hasn’t already made the jump will face growing pressure to port to arm64.
With Apple’s in-progress shift to arm64-based Apple Silicon Macs, we may already be at this point.
I can’t speak for any animation or VFX studio in particular; everything I have written here is purely personal opinion and personal conjecture, but I’d like to be ready in the event that a move to arm64 becomes something we have to face as an industry, and what better way is there to prepare than to try with my own hobby renderer first!
Also, for several years now I’ve thought that Apple eventually moving Macs to arm64 was obvious given the progress the A-series Apple chips were making, and since macOS is my primary personal daily use platform, I figured I’d have to port Takua to arm64 eventually anyway.</p>
<p><strong>Porting to arm64 Linux</strong></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/May/takua-on-arm-pt1/takua_fedora_arm64.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/May/takua-on-arm-pt1/takua_fedora_arm64.jpg" alt="Figure 1: Takua Renderer running on arm64 Fedora 32, on a Raspberry Pi 4B." /></a></p>
<p>I actually first attempted an ARM port of Takua several years ago, when Fedora 27 became the first version of Fedora to support arm64 single-board computers (SBCs) such as the Raspberry Pi 3B or the Pine A64.
I’ve been a big fan of the Raspberry Pi basically since the original first came out, and the thought of porting Takua to run on a Raspberry Pi as an experiment has been with me basically since 2012.
However, Takua is written very much with 64-bit in mind, and the first two generations of Raspberry Pis only had 32-bit ARMv7 processors.
I actually backed the original Pine A64 on Kickstarter in 2015 precisely because it was one of the very first 64-bit ARMv8 boards on the market, and if I remember correctly, I also ordered the Raspberry Pi 3B the week it was announced in 2016 because it was the first 64-bit ARMv8 Raspberry Pi.
However, my Pine A64 and Raspberry Pi 3B mostly just sat around not doing much because I was working on a bunch of other stuff, but that actually wound up working out because by the time I got back around to tinkering with SBCs in late 2017, Fedora 27 had just been released.
Thanks to a ton of work from <a href="https://nullr0ute.com/">Peter Robinson</a> at Red Hat, Fedora 27 added native arm64 support that basically worked out-of-the-box on both the Raspberry Pi 3B and the Pine A64, which was ideal for me since my Linux distribution of choice for personal hobby projects is Fedora.
Since I already had Takua building and running on Fedora on x86-64, being able to use Fedora as the target distribution for arm64 as well meant that I could eliminate different compiler and system library versions as a variable factor; I “just” had to move everything in my Fedora x86-64 build over to Fedora arm64.
However, back in 2017, I found that a lot of the foundational libraries that Takua depends on just weren’t quite ready on arm64 yet.
The problem usually wasn’t with the actual source code itself, since anything written in pure C++ without any intrinsics or inline assembly should just compile directly on any platform with a supported compiler; instead, the problem was usually just in build scripts not knowing how to handle small differences in where system libraries were located or stuff like that.
At the time I was focused on other stuff, so I didn’t try particularly hard to diagnose and work around the problems I ran into; I kind of just shrugged and put it all aside to revisit some other day.</p>
<p>Fast forward to early 2020, when rumors started circulating of a potential macOS transition to 64-bit ARM.
As the rumors grew, I figured that this was a good time to return to porting Takua to arm64 Fedora in preparation for if a macOS transition actually happened.
I had also recently bought a Raspberry Pi 4B with 4 GB of RAM; the 4 GB of RAM made actually building and running complex code on-device a lot easier than with the Raspberry Pi 3B/3B+’s 1 GB of RAM.
By this point, the arm64 build support level for Takua’s dependencies had improved dramatically.
I think that as arm64 devices like the iPhone and iPad Pro have gotten more and more powerful processors over the last few years and enabled more and more advanced and complex iOS / iPadOS apps (and similarly with Android devices and Android apps), more and more open source libraries have seen adoption on ARM-based platforms and have seen ARM support improve as a result.
Almost everything just built and worked out-of-the-box on arm64, including (to my enormous surprise) Intel’s TBB library!
I had assumed that TBB would be x86-64-only since TBB is an Intel project, but it turns out that over the years, the community has contributed support for ARMv7 and arm64 and even PowerPC to TBB.
The only library that didn’t work out-of-the-box or with minor changes was Embree, which relies heavily on SSE and AVX intrinsics and has small amounts of inline x86-64 assembly.
To get things up and running initially, I just disabled Takua’s Embree-based traversal backend and fell back to my own custom BVH traversal backend.
My own custom BVH traversal backend isn’t nearly as fast as Embree and is instead meant to serves as a reference implementation and fallback for when Embree isn’t available, but for the time being since the goal was just to get Takua working at all, losing performance due to not having Embree was fine.
As you can see by the “Traverser: Embree” label in Takua Renderer’s UI in Figure 1, I later got Embree up and running on arm64 using Syoyo Fujita’s embree-aarch64 port, but I’ll write more about that in the next post.
To be honest, the biggest challenge with getting everything compiled and running was just the amount of patience that was required.
I never seem to be able to get cross-compilation for a different architecture right because I always forget something, so instead of cross-compiling for arm64 from my nice big powerful x86-64 Fedora workstation, I just compiled for arm64 directly on the Raspberry Pi 4B.
While the Raspberry Pi 4B is much faster than the Raspberry Pi 3B, it’s still nowhere near as fast as a big fancy dual-Xeon workstation, so some libraries took forever to compile locally (especially Boost, which I wish I didn’t have to have a dependency on, but I have to since OpenVDB depends on Boost).
Overall getting a working build of Takua up and running on arm64 was very fast; from deciding to undertake the port to getting a first image back took only about a day’s worth of work, and most of that time was just waiting for stuff to compile.</p>
<p>However, getting code to <em>build</em> is a completely different question from getting code to <em>run correctly</em> (unless you’re using one of those fancy proof-solver languages I guess).
The first test renders I did with Takua on arm64 Fedora looked fine to my eye, but when I diff’d them against reference images rendered on x86-64, I found some subtle differences; the source of these differences took me a good amount of digging to understand!
Chasing this problem down led down some interesting rabbit holes exploring important differences between x86-64 and arm64 that need to be considered when porting code between the two platforms; just because code is written in portable C++ does not necessarily mean that it is always actually as portable as one might think!</p>
<p><strong>Floating Point Consistency (or lack thereof) on Different Systems</strong></p>
<p>Takua has two different types of image comparison based regression tests: the first type of test renders out to high samples-per-pixel numbers and does comparisons with near-converged images, while the second type of test renders out and does comparisons using a single sample-per-pixel.
The reason for these two different types of tests is because of how difficult getting floating point calculations to match across different compilers / platforms / processors is.
Takua’s single-sample-per-pixel tests are only meant to catch regressions on the same compiler / platform / processor, while Takua’s longer tests are meant to test overall correctness of converged renders.
Because of differences in how floating point operations come out on different compilers / platforms / processors, Takua’s convergence tests don’t require an exact match; instead, the tests use small, predefined difference thresholds that comparisons must stay within to pass.
The difference thresholds are basically completely ad-hoc; I picked them to be at a level where I can’t perceive any difference when flipping between the images, since I put together my testing system before image differencing systems that formally factor in perception <a href="https://doi.org/10.1145/3406183">[Andersson et al. 2020]</a> were published.
A large part of the differences between Takua’s test results on x86-64 versus arm64 come from these problems with floating point reproducibility across different systems.
Because of how commonplace this issue is and how often this issue is misunderstood by programmers who haven’t had to deal with it, I want to spend a few paragraphs talking about floating point numbers.</p>
<p>A lot of programmers that don’t have to routinely deal with floating point calculations might not realize that even though floating point numbers are standardized through the <a href="https://en.wikipedia.org/wiki/IEEE_754">IEEE754 standard</a>, in practice reproducibility is not at all guaranteed when carrying out the same set of floating point calculations using different compilers / platforms / processors!
In fact, starting with the same C++ floating point code, determinism is only really guaranteed for successive runs using binaries generated using the same compiler, with the same optimizations enabled, on the same processor family; sometimes running on the same operating system is also a requirement for guaranteed determinism.
There are three main reasons <a href="http://yosefk.com/blog/consistency-how-to-defeat-the-purpose-of-ieee-floating-point.html">[Kreinin 2008]</a> why reproducing exactly the same results from the same set of floating point calculations across different systems is so inconsistent: compiler optimizations, processor implementation details, and different implementations of built-in “complex” functions like sine and cosine .</p>
<p>The first reason above is pretty easy to understand: operations like addition and multiplication are commutative, meaning they can be done in any order, and often a compiler in an optimization pass may choose to reorder commutative math operations.
However, as anyone who has dealt extensively with floating point numbers knows, due to how floating point numbers are represented <a href="https://doi.org/10.1145/103162.103163">[Goldberg 1991]</a> the commutative and associative properties of addition and multiplication do not actually hold true for floating point numbers; not even for IEEE754 floating point numbers!
Sometimes reordering floating point math is expressly permitted by the language, and sometimes doing this is not actually allowed by the language but happens anyway in the compiler because the user has specified flags like <code class="language-plaintext highlighter-rouge">-ffast-math</code>, which tells the compiler that it is allowed to sacrifice strict IEEE754 and language math requirements in exchange for additional optimization opportunities.
Sometimes the compiler can just have implementation bugs too; <a href="https://lists.llvm.org/pipermail/llvm-dev/2020-June/142697.html">here is an example</a> that I found on the llvm-dev mailing lists describing a bug with loop vectorization that impacts floating point consistency!
The end result of all of the above is that the same floating point source code can produce subtly different results depending on which compiler is used and which compiler optimizations are enabled within that compiler.
Also, while some compiler optimization passes operate purely on the AST built from the parser or operate purely on the compiler’s intermediate representation, there can also be optimization passes that take into account the underlying target instruction set and choose to carry out different optimizations depending on the what’s available in the target processor architecture.
These architecture-specific optimizations mean that even the same floating point source code compiled using the same compiler can still produce different results on different processor architectures!
Architecture-specific optimizations are one reason why floating point results on x86-64 versus arm64 can be subtly different.
Also, another fun fact: the C++ specification doesn’t actually specify a binary representation for floating point numbers, so in principle a C++ compiler could outright ignore IEEE754 and use something else entirely, although in practice this is basically never the case since all modern compilers like GCC, Clang, and MSVC use IEEE754 floats.</p>
<p>The second reason floating point math is so hard to reproduce exactly across different systems is in how floating point math is implemented in the processor itself.
Differences at this level is a huge source of floating point differences between x86-64 and arm64.
In both x86-64 and arm64, at the assembly level individual arithmetic instructions such as add, subtract, multiple, divide, etc all adhere strictly to the IEEE754 standard.
However, the IEEE754 standard is itself… surprisingly loosely specified in some areas!
For example, the IEEE754 standard specifies that intermediate results should be as precise as possible, but this means that two different implementations of a floating point addition instructions both adhering to IEEE754 can actually produce different results for the same input <em>if they use different levels of increased precision internally</em>.
Here’s a bit of a deprecated example that is still useful to know for historical reasons: everyone knows that an IEEE754 floating point number is 32 bits, but older 32-bit x86 specifies that internal calculations be done using <em>80-bit precision</em>, which is a holdover from the <a href="https://en.wikipedia.org/wiki/Intel_8087">Intel 8087</a> math coprocessor.
Every x86 (and by extension x86-64) processor when using x87 FPU instructions actually does floating point math using 80 bit internal precision and then rounds back down to 32 bit floats in hardware; the 80 bit internal representation is known as the <a href="https://en.wikipedia.org/wiki/Extended_precision#x86_extended_precision_format">x86 extended precision format</a>.
But even within <em>the same</em> x86 processor, we can still get difference floating point results depending on if the compiler has output x87 FPU instructions or SSE instructions; SSE stays within 32 bits at all times, which means SSE and x87 on the same processor doing the same floating point math isn’t guaranteed to produce the exact same answer.
Of course, modern x86-64 generally uses SSE for floating point math instead of x87, but different amounts of precision truncation can still happen depending on what order values are loaded into SSE registers and back into other non-SSE registers.
Furthermore, SSE is sufficiently under-specified that the actual implementation details can differ, which is why the same SSE floating point instructions can produce different results on Intel versus AMD processors.
Similarly, the ARM architecture doesn’t actually specify a particular FPU implementation at all; the internals of the FPU are left up to each processor designer; for example, the VFP/NEON floating point units that ship on the Raspberry Pi 4B’s Cortex-A72-based CPU use up to 64 bits of internal precision <a href="https://embeddedartistry.com/blog/2017/10/11/demystifying-arm-floating-point-compiler-options/">[Johnston 2020]</a>.
So, while the x87, SSE on Intel, SSE on AMD, and VFP/NEON FPU implementations are IEEE754-compliant, because of their internal maximum precision differences they can still all produce different results from each other.
There are many more examples of areas where IEEE754 leaves in wiggle room for different implementations to do different things <a href="https://www.appinf.com/download/FPIssues.pdf">[Obiltschnig 2006]</a>, and in practice different CPUs do use this wiggle room to do things differently from each other.
For example, this wiggle room is why for floating point operations at the extreme ends of the IEEE754 float range, Intel’s x86-64 versus AMD’s x86-64 versus arm64 can produce results with minor differences from each other in the end of the mantissa.</p>
<p>Finally, the third reason floating point math can vary across different systems is because of transcendental functions such as sine and cosine.
Transcendental functions like sine and cosine have exact, precise mathematical definitions, but unfortunately these precise mathematical definitions can’t be implemented exactly in hardware.
Think back to high school trigonometry; the exact answer for a given input to functions like sine and cosine have to be determined using a <a href="https://en.wikipedia.org/wiki/Taylor_series">Taylor series</a>, but actually implementing a Taylor series in hardware is not at all practical nor performant.
Instead, modern processors typically use some form of a <a href="https://en.wikipedia.org/wiki/CORDIC">CORDIC algorithm</a> to approximate functions like sine and cosine, often to reasonably high levels of accuracy.
However, the level of precision to which any given processor approximates sine and cosine is completely unspecified by either IEEE754 or any language standard; as a result, these approximations can and do vary widely between different hardware implementations on different processors!
However, how much this reason actually matters in practice is complicated and compiler/language dependent.
As an example using cosine, the standard library could choose to implement cosine in software using a variety of different methods, or the standard library could choose to just pass through to the hardware cosine implementation.
To illustrate how much the actual execution path depends on the compiler: I originally wanted to include a simple small example using cosine that you, the reader, could go and compile and run yourself on an x86-64 machine and then on an arm64 machine to see the difference, but I wound up having so much difficulty convincing different compilers on different platforms to actually compile the cosine function (even using intrinsics like <code class="language-plaintext highlighter-rouge">__builtin_cos</code>!) down to a hardware instruction reliably that I wound up having to abandon the idea.</p>
<p>One of the things that makes all of the above even more difficult to reason about is that which specific factors are applicable at any given moment depends heavily on what the compiler is doing, what compiler flags are in use, and what the compiler’s defaults are.
Actually getting floating point determinism across different systems is a notoriously difficult problem <a href="https://gafferongames.com/post/floating_point_determinism/">[Fiedler 2010]</a> that volumes of stuff has been written about!
On top of that, while in principle getting floating point code to produce consistent results across many different systems is possible (hard, but possible) by disabling compiler optimizations and by relying entirely on software implementations of floating point operations to ensure strict, identical IEEE754 compliance on all systems, actually doing all of the above comes with major trade-offs.
The biggest trade-off is simply performance: all of the changes necessary to make floating point code consistent across different systems (and especially across different processor architectures like x86-64 versus arm64) also likely will make the floating point considerably slower too.</p>
<p>All of the above reasons mean that modern usage of floating point code basically falls into three categories.
The first category is: just don’t use floating point code at all.
Included in this first category are applications that require absolute precision and absolute consistency and determinism across all implementations; examples are banking and financial industry code, which tend to store monetary values entirely using only integers.
The second category are applications that absolutely must use floats but also must ensure absolute consistency; a good example of applications in this category are high-end scientific simulations that run on supercomputers.
For applications in this second category, the difficult work and the performance sacrifices that have to be made in favor of consistency are absolutely worthwhile.
Also, tools do exist that can help with ensuring floating point consistency; for example, <a href="https://herbie.uwplse.org">Herbie</a> is a tool that can detect potentially inaccurate floating point expressions and suggest more accurate replacements.
The last category are applications where the requirement for consistency is not necessarily absolute, and the requirement for performance may weigh heavier.
This is the space that things like game engines and renderers and stuff live in, and here the trade-offs become more nuanced and situation-dependent.
A single-player game may choose absolute performance over any kind of cross-platform guaranteed floating point consistency, whereas a multi-player multi-platform game may choose to sacrifice some performance in order to guarantee that physics and gameplay calculations produce the same result for all players regardless of platform.</p>
<p>Takua Renderer lives squarely in the third category, and historically the point in the trade-off space that I’ve chosen for Takua Renderer is to favor performance over cross-platform floating point consistency.
I have a couple of reasons for choosing this trade-off, some of which are good and some of which are… just laziness, I guess!
As a hobby renderer, I’ve never had shipping Takua as a public release in any form in mind, and so consistency across many platforms has never really mattered to me.
I know exactly which systems Takua will be run on, because I’m the only one running Takua on anything, and to me having Takua run slightly faster at the cost of minor noise differences on different platforms seems worthwhile.
As long as Takua is converging to the correct image, I’m happy, and for my purposes, I consider converged images that are perceptually indistinguishable when compared with a known correct reference to also be correct.
I do keep determinism within the same platform as a major priority though, since determinism within each platform is important for being able to reliably reproduce bugs and is important for being able to reason about what’s going on in the renderer.</p>
<p>Here is a concrete example of the noise differences I get on x86-64 versus on arm64.
This scene is the iced tea scene I originally created for my <a href="https://blog.yiningkarlli.com/2019/05/nested-dielectrics.html">Nested Dielectrics</a> post; I picked this scene for this comparison purely because it is has a small memory footprint and therefore fits in the relatively constrained 4 GB memory footprint of my Raspberry Pi 4B, while also being slightly more interesting than a Cornell Box.
Here is a comparison of a single sample-per-pixel render using bidirectional path tracing on a dual-socket Xeon E5-2680 x86-64 system versus on a Raspberry Pi 4B with a Cortex-A72 based arm64 processor.
The scene actually appears somewhat noisier than it normally would be coming out of Takua renderer because for this demonstration, I disabled low-discrepancy sampling and had the renderer fall back to purely random <a href="https://www.pcg-random.org/index.html">PCG-based</a> sample sequences, with the goal of trying to produce more noticeable noise differences:</p>
<div class="embed-container">
<iframe src="/content/images/2021/May/takua-on-arm-pt1/comparisons/noisecomparison_embed.html" frameborder="0" border="0" scrolling="no"></iframe></div>
<div class="figcaption">Figure 2: A single-spp render demonstrating noise pattern differences between x86-64 (left) versus arm64 (right). Differences are most noticeable on rim of the cup, especially on the left near the handle. For a full screen comparison, <a href="/content/images/2021/May/takua-on-arm-pt1/comparisons/noisecomparison.html">click here.</a></div>
<p>The noise differences are actually relatively minimal!
The most noticeable noise differences are on the rim of the cup; note the left of the rim near the handle.
Since the noise differences can be fairly difficult to see in the full render on a small screen, here is a 2x zoomed-in crop:</p>
<div class="embed-container">
<iframe src="/content/images/2021/May/takua-on-arm-pt1/comparisons/noisecomparison_crop_embed.html" frameborder="0" border="0" scrolling="no"></iframe></div>
<div class="figcaption">Figure 3: A zoomed-in crop of Figure 2 showing noise pattern differences between x86-64 (left) versus arm64 (right). For a full screen comparison, <a href="/content/images/2021/May/takua-on-arm-pt1/comparisons/noisecomparison_crop.html">click here.</a></div>
<p>The differences are still kind of hard to see even in the zoomed-in crop!
So, here’s the absolute difference between the x86-64 and arm64 renders, created by just subtracting the images from each other and taking the absolute value of the difference at each pixel.
Black pixels indicate pixels where the absolute difference is zero (or at least, so close to zero so as to be completely imperceptible).
Brighter pixels indicate greater differences between the x86-64 and arm64 renders; from where the bright pixels are, we can see that most of the differences occur on the rim of the cup, on ice cubes in the cup, and in random places mostly in the caustics cast by the cup.
There’s also a faint horizontal line of small differences across the background; that area lines up with where the seamless white cyclorama backdrop starts to curve upwards:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/May/takua-on-arm-pt1/noise_difference.png"><img src="https://blog.yiningkarlli.com/content/images/2021/May/takua-on-arm-pt1/noise_difference.png" alt="Figure 4: Absolute difference between the x86-64 and arm64 renders from Figure 2. Black indicates identical pixels, while brighter values indicate greater differences in pixel values between x86-64 and arm64." /></a></p>
<p>Understanding why the areas with the highest differences are where they are requires thinking about how light transport is functioning in this specific scene and how differences in floating point calculations impact that light transport.
This scene is lit fairly simply; the only light sources are two rect lights and a skydome.
Basically everything is illuminated through direct lighting, meaning that for most areas of the scene, a ray starting from the camera is directly hitting the diffuse background cyclorama and then sampling a light source, and a ray starting from the light is directly hitting the diffuse background cyclorama and then immediately sampling the camera lens.
So, even with bidirectional path tracing, the total path lengths for a lot of the scene is just two path segments, or one bounce.
That’s not a whole lot of path for differences in floating point calculations to accumulate during.
On the flip side, most of the areas with the greatest differences are areas where a lot of paths pass through the glass tea cup.
For paths that go through the glass tea cup, the path lengths can be very long, especially if a path gets caught in total internal reflection within the glass walls of the cup.
As the path lengths get longer, the floating point calculation differences at each bounce accumulate until the entire path begins to diverge significantly between the x86-64 and arm64 versions of the render.
Fortunately, these differences basically eventually “integrate out” thanks to the magic of Monte Carlo integration; by the time the renders are near converged, the x86-64 and arm64 results are basically perceptually indistinguishable from each other:</p>
<div class="embed-container">
<iframe src="/content/images/2021/May/takua-on-arm-pt1/comparisons/noisecomparison_nearconverged_embed.html" frameborder="0" border="0" scrolling="no"></iframe></div>
<div class="figcaption">Figure 5: The same cup scene from Figure 1, but now much closer to convergence (2048 spp), rendered using x86-64 (left) and arm64 (right). Note how differences between the x86-64 and arm64 renders are now basically imperceptible to the eye; these are in fact two different images! For a full screen comparison, <a href="/content/images/2021/May/takua-on-arm-pt1/comparisons/noisecomparison_nearconverged.html">click here.</a></div>
<p>Below is the absolute difference between the two images above.
To the naked eye the absolute difference image looks completely black, because the differences between the two images are so small that they’re basically below the threshold of normal perception.
So, to confirm that there are in fact differences, I’ve also included below a version of the absolute difference exposed up 10 stops, or made 1024 times brighter.
Much like in the single spp renders in Figure 1, the areas of greatest difference are in the areas where the path lengths are the longest, which in this scene are areas where paths refract through the glass cup, the tea, and the ice cubes.
Just, the differences between individual paths for the same sample across x86-64 and arm64 become tiny to the point of insignificance once averaged across 2048 samples-per-pixel:</p>
<div class="embed-container">
<iframe src="/content/images/2021/May/takua-on-arm-pt1/comparisons/noisecomparison_diff_nearconverged_embed.html" frameborder="0" border="0" scrolling="no"></iframe></div>
<div class="figcaption">Figure 6: Left: Absolute difference between the x86-64 and arm64 renders from Figure 2. Right: Since the absolute difference image basically looks completely black to the eye, I've also included a version of the absolute difference exposed up 10 stops (made 1024 times brighter) to make the differences more visible. For a full screen comparison, <a href="/content/images/2021/May/takua-on-arm-pt1/comparisons/noisecomparison_diff_nearconverged.html">click here.</a></div>
<p>For many extremely precise scientific applications, the level of differences above would still likely be unacceptable, but for our purposes in just making pretty pictures, I’ll call this good enough!
In fact, many rendering teams only target perceptually indistinguishable for the purposes of calling things deterministic enough, as opposed to aiming for absolute binary-level determinism; great examples include Pixar’s RenderMan XPU, Disney Animation’s Hyperion, and DreamWorks Animation’s MoonRay.</p>
<p>Eventually maybe I’ll get around to putting more work into trying to get Takua Renderer’s per-path results to be completely consistent even across different systems and processor architectures and compilers, but for the time being I’m fine with keeping that goal as a fairly low priority relative to everything else I want to work on, because as you can see, once the renders are converged, the difference doesn’t really matter!
Floating point calculations accounted for most of the differences I was finding when comparing renders on x86-64 versus renders on arm64, but only most.
The remaining source of differences turned out… to be an actual bug!</p>
<p><strong>Weak Memory Ordering in arm64 and Atomic Bugs in Takua</strong></p>
<p>Multithreaded programming with atomics and locks has a reputation for being one of the relatively more challenging skills for programmers to master, and for good reason.
Since different processor architectures often have different semantics and guarantees and rules around multithreading-related things like memory reordering, porting between different architectures is often a great way to expose subtle multithreading bugs.
The remaining source of major differences between the x86-64 and arm64 renders I was getting turned out to be caused by a memory reordering-related bug in some old multithreading code that I wrote a long time ago and forgot about.</p>
<p>In addition to outputing the main render, Takua Renderer is also able to generate some additional render outputs, including some useful diagnostic images.
One of the diagnostic render outputs is a sample heatmap, which shows how many pixel samples were used for each pixel in the image.
I originally added the sample heatmap render output to Takua when I was <a href="https://blog.yiningkarlli.com/2015/03/adaptive-sampling.html">implementing adaptive sampling</a>, and since then the sample heatmap render output has been a useful tool for understanding how much time Takua is spending on different parts of the image.
One of the other things the sample heatmap render output has served as though is as a simple sanity check that Takua’s multithreaded work dispatching system is functioning correctly.
For a render where the adaptive sampler is disabled, the sample heatmap should contain exactly the same value for every single pixel in the entire image, since without adaptive sampling, every pixel is just being rendered to the target samples-per-pixel of the entire render.
So, in some of my tests, I have the renderer scripted to always output the sample heatmap, and the test system checks that the sample heatmap is completely uniform after the render as a sanity check to make sure that the renderer has rendered everything that it was supposed to.
To my surprise, sometimes on arm64, a test would fail because the sample heatmap for a render without adaptive sampling would come back as nonuniform!
Specifically, the sample heatmap would come back indicating that some pixels had received one fewer sample than the total target sample-per-pixel count across the whole render.
These pixels were always in square blocks corresponding to a specific tile, or multithreaded work dispatch unit.
The specific bug was in how Takua Renderer dispatches rendering work to each thread; to provide the relevant context and explain what I mean by a “tile”, I’ll first have to quickly describe how Takua Renderer is multithreaded.</p>
<p>In university computer graphics courses, path tracing is often taught as being trivially simple to parallelize: since a path tracer traces individual paths in a depth-first fashion, individual paths don’t have dependencies on other paths, so just assign each path that has to be traced to a separate thread.
The easiest way to implement this simple parallelization scheme is to just run a <code class="language-plaintext highlighter-rouge">parallel_for</code> loop over all of the paths that need to be traced for a given set of samples, and to just repeat this for each set of samples until the render is complete.
However, in reality, parallelizing a modern production-grade path tracing renderer is often not as simple as the classic “embarrassingly parallel” approach.
Modern advanced path tracers often are written to take into account factors such as cache coherency, memory access patterns and memory locality, NUMA awareness, optimal SIMD utilization, and more.
Also, advanced path tracers often make use of various complex data structures such as out-of-core texture caches, photon maps, path guiding trees, and more.
Making sure that these data structures can be built, updated, and accessed on-the-fly by multiple threads simultaneously and efficiently often introduces complex lock-free data structure design problems.
On top of that, path tracers that use a wavefront or breadth-first architecture instead of a depth-first approach are far from trivial to parallelize, since various sorting and batching operations and synchronization points need to be accounted for.</p>
<p>Even for relatively straightforward depth-first architectures like the one Takua has used for the past six years, the direct <code class="language-plaintext highlighter-rouge">parallel_for</code> approach can be improved upon in some simple ways.
Before progressive rendering became the standard modern approach, many renderers used an approach called “bucket” rendering <a href="https://www.racoon-artworks.de/cgbasics/bucket_progressive.php">[Geupel 2018]</a>, where the image plane was divided up into a bunch of small tiles, or buckets.
Each thread would be assigned a single bucket, and each thread would be responsible for rendering that bucket to completion before being assigned another bucket.
For offline, non-interactive rendering, bucket rendering often ends up being faster than just a simple <code class="language-plaintext highlighter-rouge">parallel_for</code> because bucket rendering allows for a higher degree of memory access coherency and cache coherency within each thread since each thread is always working in roughly the same area of space (at least for the first few bounces).
Even with progressive rendering as the standard approach for renderers running in an interactive mode today, many (if not most) renderers still use a bucketed approach when dispatched to a renderfarm today.
For CPU path tracers today, the number of pixels that need to be rendered for a typical image is much much larger than the number of hardware threads available on the CPU.
As a result, the basic locality idea that bucket rendering utilizes also ends up being applicable to progressive, interactive rendering in CPU path tracers (for GPU path tracing though, the GPU’s completely different, wavefront-based SIMT threading model means a bit of a different approach is necessary).
RenderMan, Arnold, and Vray in interactive progressive mode all still render pixels in a bucket-like order, although instead of having each thread render all samples-per-pixel to completion in each bucket all at once, each thread just renders a single sample-per-pixel for each bucket and then the renderer loops over the entire image plane for each sample-per-pixel number.
To differentiate using buckets in a progressive mode from using buckets in a batch mode, I will refer to buckets in progressive mode as “tiles” for the rest of this post.</p>
<p>Takua Renderer also supports using a tiled approach for assigning work to individual threads.
At renderer startup, Takua precalculates a work assignment order, which can be in a tiled fashion, or can use a more naive <code class="language-plaintext highlighter-rouge">parallel_for</code> approach; the tiled mode is the default.
When using a tiled work assignment order, the specific order of tiles supports several different options; the default is a spiral starting from the center of the image.
Here’s a short screen recording demonstrating what this tiling work assignment looks like:</p>
<video autoplay="" muted="" loop="" playsinline="">
<source src="https://blog.yiningkarlli.com/content/images/2021/May/takua-on-arm-pt1/buckets.mp4" type="video/mp4" />
Your browser does not support the video tag.
</video>
<div class="figcaption">Figure 7: A short video showing Takua Renderer's tile assignment system running in spiral mode; each red outlined square represents a single tile. This video was captured on an arm64 M1 Mac Mini running macOS Big Sur instead of on a Raspberry Pi 4B because trying to screen record on a Raspberry Pi 4B while also running the renderer was not a good time. To see this video in a full window, <a href="/content/images/2021/May/takua-on-arm-pt1/buckets.mp4">click here.</a></div>
<p>As threads free up, the work assignment system hands each free thread a tile to render; each thread then renders a single sample-per-pixel for every pixel in its assigned tile and then goes back to the work assignment system to request more work.
Once the number of remaining tiles for the current samples-per-pixel number drops below the number of available threads, the work assignment system starts allowing multiple threads to team up on a single tile.
In general, the additional cache coherency and more localizes memory access patterns from using a tiled approach gives Takua Renderer a minimum 3% speed improvement compared to using a naive <code class="language-plaintext highlighter-rouge">parallel_for</code> to assign work to each thread; sometimes the speed improvement can be even higher if the scene is heavily dependent on things like texture cache access or reading from a photon map.</p>
<p>The reason the work assignment system actually hands out tiles one by one upon request instead of just running a <code class="language-plaintext highlighter-rouge">parallel_for</code> loop over all of the tiles is because using something like <code class="language-plaintext highlighter-rouge">tbb::parallel_for</code> means that the tiles won’t actually be rendered in the correct specified order.
Actually, Takua does have a “I don’t care what order the tiles are in” mode, which does in fact just run a <code class="language-plaintext highlighter-rouge">tbb::parallel_for</code> over all of the tiles and lets <code class="language-plaintext highlighter-rouge">tbb</code>’s underlying scheduler decide what order the tiles are dispatched in; rendering tiles in a specific order doesn’t actually matter for correctness.
However, maintaining a specific tile ordering does make user feedback a bit nicer.</p>
<p>Implementing a work dispatcher that can still maintain a specific tile ordering requires some mechanism internally to track what the next tile that should be dispatched is; Takua does so using an atomic integer inside of the work dispatcher.
This atomic is where the memory-reordering bug comes in that led to Takua occasionally dropping a single spp for a single tile on arm64.
Here’s some pesudo-code for how threads are launched and how they ask the work dispatcher for tiles to render; this is highly simplified and condensed from how the actual code in Takua is written (specifically, I’ve inlined together code from both individual threads and from the work dispatcher and removed a bunch of other unrelated stuff), but preserves all of the important details necessary to illustrate the bug:</p>
<div id="listing1"></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>int nextTileIndex = 0;
std::atomic<bool> nextTileSoftLock(false);
tbb::parallel_for(int(0), numberOfTilesToRender, [&](int /*i*/) {
bool gotNewTile = false;
int tile = -1;
while (!gotNewTile) {
bool expected = false;
if (nextTileSoftLock.compare_exchange_strong(expected, true, std::memory_order_relaxed)) {
tile = nextTileIndex++;
nextTileSoftLock.store(false, std::memory_order_relaxed);
gotNewTile = true;
}
}
if (tileIsInRange(tile)) {
renderTile(tile);
}
});
</code></pre></div></div>
<div class="codecaption">Listing 1: Simplified pseudocode for the not-very-good work scheduling mechanism Takua used to assign tiles to threads. This version of the scheduler resulted in tiles occasionally being missed on arm64, but not on x64-64.</div>
<p>If you remember your memory ordering rules, you already know what’s wrong with the code above; this code is really really bad!
In my defense, this code is an ancient part of Takua’s codebase; I wrote it back in college and haven’t really revisited it since, and back when I wrote it, I didn’t have the strongest grasp of memory ordering rules and how they apply to concurrent programming yet.
First off, why does this code use an atomic bool as a makeshift mutex so that multiple threads can increment a non-atomic integer, as opposed to just using an atomic integer?
Looking through the commit history, the earliest version of this code that I first prototyped (some eight years ago!) actually relied on a full-blown <code class="language-plaintext highlighter-rouge">std::mutex</code> to protect from race conditions around incrementing <code class="language-plaintext highlighter-rouge">nextTileIndex</code>; I must have prototyped this code completely single-threaded originally and then done a quick-and-dirty multithreading adaptation by just wrapping a mutex around everything, and then replaced the mutex with a cheaper atomic bool as an incredibly lazy port to a lock-free implementation instead of properly rewriting things.
I haven’t had to modify it since then because it worked well enough, so over time I must have just completely forgotten about how awful this code is.</p>
<p>Anyhow, the fix for the code above is simple enough: just replace the first <code class="language-plaintext highlighter-rouge">std::memory_order_relaxed</code> in line 8 with <code class="language-plaintext highlighter-rouge">std::memory_order_acquire</code> and replace the second <code class="language-plaintext highlighter-rouge">std::memory_order_relaxed</code> in line 10 with <code class="language-plaintext highlighter-rouge">std::memory_order_release</code>.
An even better fix though is to just outright replace the combination of an atomic bool and non-atomic integer incremented with a single atomic integer incrementer, which is what I actually did.
But, going back to the original code, why exactly does using <code class="language-plaintext highlighter-rouge">std::memory_order_relaxed</code> produce correctly functioning code on x86-64, but produces code that occasionally drops tiles on arm64?
Well, first, why did I use <code class="language-plaintext highlighter-rouge">std::memory_order_relaxed</code> in the first place?
My commit comments from eight years ago indicate that I chose <code class="language-plaintext highlighter-rouge">std::memory_order_relaxed</code> because I thought it would compile down to something cheaper than if I had chosen some other memory ordering flag; I really didn’t understand this stuff back then!
I wasn’t entirely wrong, although not for the reasons that I thought at the time.
On x86-64, different memory order flags don’t actually do anything, since x86-64 has a guaranteed strong memory model.
On arm64, using <code class="language-plaintext highlighter-rouge">std::memory_order_relaxed</code> instead of <code class="language-plaintext highlighter-rouge">std::memory_order_acquire</code>/<code class="language-plaintext highlighter-rouge">std::memory_order_release</code> does indeed produce simpler and faster arm64 assembly, but the simpler and faster arm64 assembly is also <em>wrong</em> for what the code is supposed to do.
Understanding why the above happens on arm64 but not on x86-64 requires understanding what a <em>weakly ordered</em> CPU is versus what a <em>strong ordered</em> CPU is; arm64 is a weakly ordered architecture, whereas x86-64 is a strongly ordered architecture.</p>
<p>One of the best resources on diving deep into weak versus strong memory orderings is the well-known series of articles <a href="https://preshing.com">by Jeff Preshing</a> on the topic (parts <a href="https://preshing.com/20120515/memory-reordering-caught-in-the-act/">1</a>, <a href="https://preshing.com/20120612/an-introduction-to-lock-free-programming/">2</a>, <a href="https://preshing.com/20120625/memory-ordering-at-compile-time/">3</a>, <a href="https://preshing.com/20120710/memory-barriers-are-like-source-control-operations/">4</a>, <a href="https://preshing.com/20120913/acquire-and-release-semantics/">5</a>, <a href="https://preshing.com/20120930/weak-vs-strong-memory-models/">6</a>, and <a href="https://preshing.com/20121019/this-is-why-they-call-it-a-weakly-ordered-cpu/">7</a>).
Actually, while I was going back through the Preshing on Programming series in preparation to write this post, I noticed that by hilarious coincidence the older code in Takua represented by Listing 1, once boiled down to what it is fundamentally doing, is extremely similar to the canonical example used in Preshing on Programming’s “<a href="https://preshing.com/20121019/this-is-why-they-call-it-a-weakly-ordered-cpu/">This Is Why They Call It a Weakly-Ordered CPU</a>” article.
If only I had read the Preshing on Programming series a year before implementing Takua’s work assignment system instead of a few years after!
I’ll do my best to quickly recap what the Preshing on Programming series covers about weak versus strong memory orderings here, but if you have not read Jeff Preshing’s articles before, I’d recommend taking some time later to do so.</p>
<p>One of the single most important things that lock-free multithreaded code needs to take into account is the potential for memory reordering.
Memory reordering is when the compiler and/or the processor decides to optimize code by changing the ordering of instructions that access and modify memory.
Memory reordering is always carried out in such a way that the behavior of a single-threaded program never changes, and multithreaded code using locks such as mutexes forces the compiler and processor to not reorder instructions across the boundaries defined by locks.
However, lock-free multithreaded code is basically free range for the compiler and processor to do whatever they want; even though memory reordering is carried out for each individual thread in such a way that keeps the apparent behavior of that specific thread the same as before, this rule does not take into account the interactions <em>between</em> threads, so different reorderings in different threads that keep behavior the same in each thread isolated can still result in very different behavior in the overall multithreaded behavior.</p>
<p>The easiest way to disable any kind of memory reordering at compile time is to just… disable all compiler optimizations.
However, in practice we never actually want to do this, because disabling compiler optimizations means all of our code will run slower (sometimes a lot slower).
Instruction selection to lower from IR to assembly also means that even disabling all compiler optimizations may not be enough to ensure no memory reordering, because we still need to contend with potential memory reordering at runtime from the CPU.</p>
<p>Memory reordering in multithreaded code happens on the CPU because of how CPUs access memory: modern processors have a series of caches (L1, L2, sometimes L3, etc) sitting between the actual registers in each CPU core and main memory.
Some of these cache levels (usually L1) are per-CPU-core, and some of these cache levels (usually L2 and higher) are shared across some or all cores.
The lower the cache level number, the faster and also smaller that cache level typically is, and the higher the cache level number, the slower and larger that cache level is.
When a CPU wants to read a particular piece of data, it will check for it in cache first, and if the value is not in cache, then the CPU must make a fetch request to main memory for the value; fetching from main memory is obviously much slower than fetching from cache.
Where these caches get tricky is how data is propagated from a given CPU core’s registers and caches back to main memory and then eventually up again into the L1 caches for other CPU cores.
This propagation can happen… whenever!
A variety of different possible implementation strategies exist for <a href="https://en.wikipedia.org/wiki/CPU_cache#Policies">when caches update from and write back to main memory</a>, with the end result being that by default we as programmers have no reliable way of guessing when data transfers between cache and main memory will happen.</p>
<p>Imagine that we have some multithreaded code written such that one thread writes, or stores, to a value, and then a little while later, another thread reads, or loads, that same value.
We would expect the store on the first thread to always precede the load on the second thread, so the second thread should always pick up whatever value the first thread read.
However, if we implement this code just using a normal int or float or bool or whatever, what can actually happen at runtime is our first thread writes the value to L1 cache, and then eventually the value in L1 cache gets written back to main memory.
However, before the value manages to get propagated from L1 cache back to main memory, the second thread reads the value out of main memory.
In this case, from the perspective of main memory, the second thread’s load out of main memory takes place <em>before</em> the first thread’s store has rippled back down to main memory.
This case is an example of <em>StoreLoad</em> reordering, so named because a store has been reordered with a later load.
There are also <em>LoadStore</em>, <em>LoadLoad</em>, and <em>StoreStore</em> reorderings that are possible.
Jeff Preshing’s “<a href="https://preshing.com/20120710/memory-barriers-are-like-source-control-operations/">Memory Barriers are Like Source Control</a>” article does a great job of describing these four possible reordering scenarios in detail.</p>
<p>Different CPU architectures make different guarantees about which types of memory reordering can and can’t happen on that particular architecture at the hardware level.
A processor that guarantees absolutely no memory reordering of any kind is said to have a <em>sequentially consistent</em> memory model.
Few, if any modern processor architecture provide a guaranteed sequentially consistent memory model.
Some processors don’t guarantee absolutely sequential consistency, but do guarantee that at least when a CPU core makes a series of writes, other CPU cores will see those writes in the same sequence that they were made; CPUs that make this guarantee have a <em>strong</em> memory model.
Strong memory models effectively guarantee that StoreLoad reordering is the only type of reordering allowed; x86-64 has a strong memory model.
Finally, CPUs that allow for any type of memory reordering at all are said to have a <em>weak</em> memory model.
The arm64 architecture uses a weak memory model, although arm64 at least guarantees that if we read a value through a pointer, the value read will be at least as new as the pointer itself.</p>
<p>So, how can we possibly hope to be able to reason about multithreaded code when both the compiler and the processor can happily reorder our memory access instructions between threads whenever they want for whatever reason they want?
The answer is in memory barriers and fence instructions; these tools allow us to specify boundaries that the compiler cannot reorder memory access instructions across and allow us to force the CPU to make sure that values are flushed to main memory before being read.
In C++, specifying barriers and fences can be done by using compiler intrinsics that map to specific underlying assembly instructions, but the easier and more common way of doing this is by using <a href="https://en.cppreference.com/w/cpp/atomic/memory_order"><code class="language-plaintext highlighter-rouge">std::memory_order</code></a> flags in combination with atomics.
Other languages have similar concepts; for example, <a href="https://doc.rust-lang.org/nomicon/atomics.html">Rust’s atomic access flags</a> are very similar to the C++ memory ordering flags.</p>
<p><code class="language-plaintext highlighter-rouge">std::memory_order</code> flags specify how memory accesses for all operations surrounding an atomic are to be ordered; the impacted surrounding operations include all non-atomics.
There are a whole bunch of <code class="language-plaintext highlighter-rouge">std::memory_order</code> flags; we’ll examine the few that are relevant to the specific example in Listing 1.
The heaviest hammer of all of the flags is <code class="language-plaintext highlighter-rouge">std::memory_order_seq_cst</code>, which enforces absolute sequential consistency at the cost of potentially being more expensive due to potentially needing more loads and/or stores.
For example, on x86-64, <code class="language-plaintext highlighter-rouge">std::memory_order_seq_cst</code> is often implemented using slower <code class="language-plaintext highlighter-rouge">xchg</code> or paired <code class="language-plaintext highlighter-rouge">mov</code>/<code class="language-plaintext highlighter-rouge">mfence</code> instructions instead of a single <code class="language-plaintext highlighter-rouge">mov</code> instruction, and on arm64, the overhead is even greater due to arm64’s weak memory model.
Using <code class="language-plaintext highlighter-rouge">std::memory_order_seq_cst</code> also potentially disallows the CPU from reordering unrelated, longer running instructions to starting (and therefore finish) earlier, potentially causing even more slowdowns.
In C++, atomic operations default to using <code class="language-plaintext highlighter-rouge">std::memory_order_seq_cst</code> if no memory ordering flag is explicitly specified.
Contrast with <code class="language-plaintext highlighter-rouge">std::memory_order_relaxed</code>, which is the exact opposite of <code class="language-plaintext highlighter-rouge">std::memory_order_seq_cst</code>.
<code class="language-plaintext highlighter-rouge">std::memory_order_relaxed</code> enforces no synchronization or ordering constraints whatsoever; on an architecture like x86-64, using <code class="language-plaintext highlighter-rouge">std::memory_order_relaxed</code> can be faster than using <code class="language-plaintext highlighter-rouge">std::memory_order_seq_cst</code> if your memory ordering requirements are already met in hardware by x86-64’s strong memory model.
However, being sloppy with <code class="language-plaintext highlighter-rouge">std::memory_order_relaxed</code> can result in some nasty nondeterministic bugs on arm64 if your code requires specific memory ordering guarantees, due to arm64’s weak memory model.
The above is the exact reason why the code in Listing 1 occasionally resulted in dropped tiles in Takua on arm64!</p>
<p>Without any kind of memory ordering constraints, with arm64’s weak memory ordering, the code in Listing 1 can sometimes execute in such a way that one thread sets <code class="language-plaintext highlighter-rouge">nextTileSoftLock</code> to true, but another thread attempts to check <code class="language-plaintext highlighter-rouge">nextTileSoftLock</code> before the first thread’s new value propagates back to main memory and to all of the other threads.
As a result, two threads can end up in a race condition, trying to both increment the non-atomic <code class="language-plaintext highlighter-rouge">nextTileIndex</code> at the same time.
When this happens, two threads can end up working on the same tile at the same time or a tile can get skipped!
We could fix this problem by just removing the memory ordering flags entirely from Listing 1, allowing everything to default back to <code class="language-plaintext highlighter-rouge">std::memory_order_seq_cst</code>, which would fix the problem.
However, as just mentioned above, we can do better than using <code class="language-plaintext highlighter-rouge">std::memory_order_seq_cst</code> if we know specifically what memory ordering requirements we need for the code to work correctly.</p>
<p>Enter <code class="language-plaintext highlighter-rouge">std::memory_order_acquire</code> and <code class="language-plaintext highlighter-rouge">std::memory_order_release</code>, which represent <em>acquire</em> semantics and <em>release</em> semantics respectively and, when used correctly, always come in a pair.
Acquire semantics apply to load (read) operations and prevent memory ordering of the load operation
with any subsequent read or write operation.
Release semantics apply to store (write) operations and prevent memory reordering of the store operation with any preceding read or write operation.
In other words, <code class="language-plaintext highlighter-rouge">std::memory_order_acquire</code> tells the compiler to issue instructions that prevent LoadLoad and LoadStore reordering from happening, and <code class="language-plaintext highlighter-rouge">std::memory_order_release</code> tells the compiler to issue instructions that prevent LoadStore and StoreStore reordering from happening.
Using acquire and release semantics allows Listing 1 to work correctly on arm64, while being ever so slightly cheaper compared with enforcing absolute sequential consistency everywhere.</p>
<p>What is the takeaway from this long tour through memory reordering and weak and strong memory models and memory ordering constraints?
The takeaway is that when writing multithreaded code that needs to be portable across architectures with different memory ordering guarantees, such as x86-64 versus arm64, we need to be very careful with thinking about how each architecture’s memory ordering guarantees (or lack thereof) impact any lock-free cross-thread communication we need to do!
Atomic code often can be written more sloppily on x86-64 than on arm64 and still have a good chance of working, whereas arm64’s weak memory model means there’s much less room for being sloppy.
If you want a good way to smoke out potential bugs in your lock-free atomic code, porting to arm64 is a good way to find out!</p>
<p><strong>A Deep Dive on x86-64 versus arm64 Through the Lens of Compiling <code class="language-plaintext highlighter-rouge">std::atomic::compare_exchange_weak()</code></strong></p>
<p>While I was looking for the source of the memory reordering bug, I found a separate interesting bug in Takua’s atomic framebuffer… or at least, I thought it was a bug.
The thing I found turned out to not be a bug at all in the end, but at the time I thought that there was a bug in the form of a race condition in an atomic compare-and-exchange loop.
I figured that the renderer must be just running correctly <em>most</em> of the time instead of <em>all</em> of the time, but as I’ll explain in a little bit, the renderer actually provably runs correctly 100% of the time.
Understanding what was going on here led me to dive into the compiler’s assembly output, and wound up being an interesting case study in comparing how the same exact C++ source code compiles to x86-64 versus arm64.
In order to provide the context for the not-a-bug and what I learned about arm64 from it, I need to first briefly describe what Takua’s atomic framebuffer is and how it is used.</p>
<p>Takua supports multiple threads writing to the same pixel in the framebuffer at the same time.
There are two major uses cases for this capability: first, integration techniques that use light tracing will connect back to the camera completely arbitrarily, resulting in splats to the framebuffer that are completely unpredictable and possibly overlapping on the same pixels.
Second, adaptive sampling techniques that redistribute sample allocation within a single iteration (meaning launching a single set of pixel samples) can result in multiple samples for the same pixel in the same iteration, which means multiple threads can be calculating paths starting from the same pixel and therefore multiple threads need to write to the same framebuffer pixel.
In order to support multiple threads writing simultaneously to the same pixel in the framebuffer, there are three possible implementation options.
The first option is to just keep a separate framebuffer per thread and merge afterwards, but this approach obviously requires potentially a huge amount of memory.
The second option is to never write to the framebuffer directly, but instead keep queues of framebuffer write requests that occasionally get flushed to the framebuffer by a dedicated worker thread (or some variation thereof).
The third option is to just make each pixel in the framebuffer support exclusive operations through atomics (a mutex per pixel works too, but obviously this would involve much more overhead and might be slower); this option is the atomic framebuffer.
I actually implemented the second option in Takua a long time ago, but the added complexity and performance impact of needing to flush the queue led me to eventually replace the whole thing with an atomic framebuffer.</p>
<p>The tricky part of implementing an atomic framebuffer in C++ is the need for atomic floats.
Obviously each pixel in the framebuffer has to store at the very least accumulated radiance values for the base RGB primaries, along with potentially other AOV values, and accumulated radiance values and many common AOVs all have to be represented with floats.
Modern C++ has standard library support for atomic types through std::atomic, and std::atomic works with floats.
However, pre-C++20, std::atomic only provides atomic arithmetic operations for integer types.
C++20 adds <code class="language-plaintext highlighter-rouge">fetch_add()</code> and <code class="language-plaintext highlighter-rouge">fetch_sub()</code> implementations for <code class="language-plaintext highlighter-rouge">std::atomic<float></code>, but I wrote Takua’s atomic framebuffer way back when C++11 was still the latest standard.
So, pre-C++20, if you want atomic arithmetic operations for <code class="language-plaintext highlighter-rouge">std::atomic<float></code>, you have to implement it yourself.
Fortunately, pre-C++20 does provide <code class="language-plaintext highlighter-rouge">compare_and_exchange()</code> implementations for all atomic types, and that’s all we need to implement everything else we need ourselves.</p>
<p>Implementing <code class="language-plaintext highlighter-rouge">fetch_add()</code> for atomic floats is fairly straightforward.
Let’s say we want to add a value <code class="language-plaintext highlighter-rouge">f1</code> to an atomic float <code class="language-plaintext highlighter-rouge">f0</code>.
The basic idea is to do an atomic load from <code class="language-plaintext highlighter-rouge">f0</code> into some temporary variable <code class="language-plaintext highlighter-rouge">oldval</code>.
A standard <code class="language-plaintext highlighter-rouge">compare_and_exchange()</code> implementation compares some input value with the current value of the atomic float, and if the two are equal, replaces the current value of the atomic float with a second input value; C++ provides an implementations in the form of <code class="language-plaintext highlighter-rouge">compare_exchange_weak()</code> and <code class="language-plaintext highlighter-rouge">compare_exchange_strong()</code>.
So, all we need to do is run <code class="language-plaintext highlighter-rouge">compare_exchange_weak()</code> on <code class="language-plaintext highlighter-rouge">f0</code> where the value we use for the comparison test is <code class="language-plaintext highlighter-rouge">oldval</code> and the replacement value is <code class="language-plaintext highlighter-rouge">oldval + f1</code>; if <code class="language-plaintext highlighter-rouge">compare_exchange_weak()</code> succeeds, we return <code class="language-plaintext highlighter-rouge">oldval</code>, otherwise, loop and repeat until <code class="language-plaintext highlighter-rouge">compare_exchange_weak()</code> succeeds.
Here’s an example implementation:</p>
<div id="listing2"></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>float addAtomicFloat(std::atomic<float>& f0, const float f1) {
do {
float oldval = f0.load();
float newval = oldval + f1;
if (f0.compare_exchange_weak(oldval, newval)) {
return oldval;
}
} while (true);
}
</code></pre></div></div>
<div class="codecaption">Listing 2: Example implementation of atomic float addition.</div>
<p>Seeing why the above implementation works should be very straightforward: imagine two threads are calling the above implementation at the same time.
We want each thread to reload the atomic float on each iteration because we never want a situation where a first thread loads from <code class="language-plaintext highlighter-rouge">f0</code>, a second thread succeeds in adding to <code class="language-plaintext highlighter-rouge">f0</code>, and then the first thread also succeeds in writing its value to <code class="language-plaintext highlighter-rouge">f0</code>, because upon the first thread writing, the value of <code class="language-plaintext highlighter-rouge">f0</code> that the first thread used for the addition operation is out of date!</p>
<p>Well, here’s the implementation that has actually been in Takua’s atomic framebuffer implementation for most of the past decade.
This implementation is very similar to Listing 2, but compared with Listing 2, Lines 2 and 3 are swapped from where they should be; I likely swapped these two lines through a simple copy/paste error or something when I originally wrote it.
This is the implementation that I suspected was a bug upon revisiting it during the arm64 porting process:</p>
<div id="listing3"></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>float addAtomicFloat(std::atomic<float>& f0, const float f1) {
float oldval = f0.load();
do {
float newval = oldval + f1;
if (f0.compare_exchange_weak(oldval, newval)) {
return oldval;
}
} while (true);
}
</code></pre></div></div>
<div class="codecaption">Listing 3: What I thought was an incorrect implementation of atomic float addition.</div>
<p>In the Listing 3 implementation, note how the atomic load of <code class="language-plaintext highlighter-rouge">f0</code> only ever happens once outside of the loop.
The following is what I thought was going on and why at the moment I thought this implementation was wrong:
Think about what happens if a first thread loads from <code class="language-plaintext highlighter-rouge">f0</code> and then a second thread’s call to <code class="language-plaintext highlighter-rouge">compare_exchange_weak()</code> succeeds before the first thread gets to <code class="language-plaintext highlighter-rouge">compare_exchange_weak()</code>; in this race condition scenario, the first thread should get stuck in an infinite loop.
Since the value of <code class="language-plaintext highlighter-rouge">f0</code> has now been updated by the second thread, but the first thread never reloads the value of <code class="language-plaintext highlighter-rouge">f0</code> inside of the loop, the first thread <em>should have no way of ever succeeding at the</em> <code class="language-plaintext highlighter-rouge">compare_exchange_weak()</code> <em>call</em>!
However, in reality, with the Listing 3 implementation, Takua never actually gets stuck in an infinite loop, even when multiple threads are writing to the same pixel in the atomic framebuffer.
I initially thought that I must have just been getting really lucky every time and multiple threads, while attempting to accumulate to the same pixel, just never happened to produce the specific <code class="language-plaintext highlighter-rouge">compare_exchange_weak()</code> call ordering that would cause the race condition and infinite loop.
But then I repeatedly tried a simple test where I had 32 threads simultaneously call <code class="language-plaintext highlighter-rouge">addAtomicFloat()</code> for the same atomic float a million times per thread, and… still an infinite loop never occurred.
So, the situation appeared to be that what I thought was <em>incorrect code</em> was always behaving as if it had been written <em>correctly</em>, and furthermore, this held true on both x86-64 <em>and</em> on arm64, across both compiling with Clang on macOS and compiling with GCC on Linux.</p>
<p>If you are well-versed in the C++ specifications, you already know which crucial detail I had forgotten that explains why Listing 3 is actually completely correct and functionally equivalent to Listing 2.
Under the hood, <code class="language-plaintext highlighter-rouge">std::atomic<T>::compare_exchange_weak(T& expected, T desired)</code> requires doing an atomic load of the target value in order to compare the target value with <code class="language-plaintext highlighter-rouge">expected</code>.
What I had forgotten was that if the comparison fails, <code class="language-plaintext highlighter-rouge">std::atomic<T>::compare_exchange_weak()</code> doesn’t just return a false bool; the function <em>also replaces</em> <code class="language-plaintext highlighter-rouge">expected</code> with the result of the atomic load on the target value!
So really, there isn’t only a single atomic load of <code class="language-plaintext highlighter-rouge">f0</code> in Listing 3; there’s actually an atomic load of <code class="language-plaintext highlighter-rouge">f0</code> in every loop as part of <code class="language-plaintext highlighter-rouge">compare_exchange_weak()</code>, and in the event that the comparison fails, the equivalent of <code class="language-plaintext highlighter-rouge">oldval = f0.load()</code> happens.
Of course, I didn’t actually correctly remember what <code class="language-plaintext highlighter-rouge">compare_exchange_weak()</code> does in the comparison failure case, and I stupidly didn’t double check <a href="https://en.cppreference.com/w/cpp/atomic/atomic/compare_exchange">cppreference</a>, so it took me much longer to figure out what was going on.</p>
<p>So, still missing the key piece of knowledge that I had forgotten and assuming that <code class="language-plaintext highlighter-rouge">compare_exchange_weak()</code> didn’t modify any inputs upon comparison failure, my initial guess was that perhaps the compiler was inlining <code class="language-plaintext highlighter-rouge">f0.load()</code> wherever <code class="language-plaintext highlighter-rouge">oldval</code> was being used as an optimization, which would produce a result that should prevent the race condition from ever happening.
However, after a bit more thought, I concluded that this optimization was very unlikely, since it both changes the written semantics of what the code should be doing by effectively moving an operation from outside a loop to the inside of the loop, and also inlining <code class="language-plaintext highlighter-rouge">f0.load()</code> wherever <code class="language-plaintext highlighter-rouge">oldval</code> is used is not actually a safe code transformation and can produce a different result from the originally written code, since having two atomic loads from <code class="language-plaintext highlighter-rouge">f0</code> introduces the possibility that another thread can do an atomic write to <code class="language-plaintext highlighter-rouge">f0</code> in between the current thread’s two atomic loads.</p>
<p>Things got even more interesting when I tried adding in an additional bit of indirection around the atomic load of <code class="language-plaintext highlighter-rouge">f0</code> into <code class="language-plaintext highlighter-rouge">oldval</code>.
Here is an actually incorrect implementation that I thought should be functionally equivalent to the implementation in Listing 3:</p>
<div id="listing4"></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>float addAtomicFloat(std::atomic<float>& f0, const float f1) {
const float oldvaltemp = f0.load();
do {
float oldval = oldvaltemp;
float newval = oldval + f1;
if (f0.compare_exchange_weak(oldval, newval)) {
return oldval;
}
} while (true);
}
</code></pre></div></div>
<div class="codecaption">Listing 4: An actually incorrect implementation of atomic float addition that might appear to be semantically identical to the implementation in Listing 3 if you've forgotten a certain very important detail about std::compare_exchange_weak().</div>
<p>Creating the race condition and subsequent infinite loop is extremely easy and reliable with Listing 4.
So, to summarize where I was at this point: Listing 2 is a correctly written implementation that produces a correct result in reality, Listing 4 is an incorrectly written implementation that, as expected, produces an incorrect result in reality, and Listing 3 is what I thought was an incorrectly written implementation that I thought was <em>semantically identical</em> to Listing 4, but actually produces the same correct result in reality as Listing 2!</p>
<p>So, left with no better ideas, I decided to just go look directly at the compiler’s output assembly.
To make things a bit easier, we’ll look at and compare the x86-64 assembly for the Listing 2 and Listing 3 C++ implementations first, and explain what important detail I had missed that led me down this wild goose chase.
Then, we’ll look at and compare the arm64 assembly, and we’ll discuss some interesting things I learned along the way by comparing the x86-64 and arm64 assembly for the same C++ function.</p>
<p>Here is the corresponding x86-64 assembly for the correct C++ implementation in Listing 2, compiled with Clang 10.0.0 using -O3.
For readers who are not very used to reading assembly, I’ve included annotations as comments in the assembly code to describe what the assembly code is doing and how it corresponds back to the original C++ code:</p>
<div id="listing5"></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>addAtomicFloat(std::atomic<float>&, float): # f0 is dword ptr [rdi], f1 is xmm0
.LBB0_1:
mov eax, dword ptr [rdi] # eax = *arg0 = f0.load()
movd xmm1, eax # xmm1 = eax = f0.load()
movdqa xmm2, xmm1 # xmm2 = xmm1 = eax = f0.load()
addss xmm2, xmm0 # xmm2 = (xmm2 + xmm0) = (f0 + f1)
movd ecx, xmm2 # ecx = xmm2 = (f0 + f1)
lock cmpxchg dword ptr [rdi], ecx # if eax == *arg0 { ZF = 1; *arg0 = arg1 }
# else { ZF = 0; eax = *arg0 };
# "lock" means all done exclusively
jne .LBB0_1 # if ZF == 0 goto .LBB0_1
movdqa xmm0, xmm1 # return f0 value from before cmpxchg
ret
</code></pre></div></div>
<div class="codecaption">Listing 5: x86-64 assembly corresponding to the implementation in Listing 2, with my annotations in the comments. Compiled using armv8-a Clang 10.0.0 using -O3. <a href="https://godbolt.org/#g:!((g:!((g:!((h:codeEditor,i:(fontScale:14,fontUsePx:'0',j:1,lang:c%2B%2B,selection:(endColumn:9,endLineNumber:5,positionColumn:9,positionLineNumber:5,selectionStartColumn:9,selectionStartLineNumber:5,startColumn:9,startLineNumber:5),source:'%23include+%3Catomic%3E%0A%0Afloat+addAtomicFloat(std::atomic%3Cfloat%3E%26+f0,+const+float+f1)+%7B%0A++++do+%7B%0A++++++++float+oldval+%3D+f0.load()%3B%0A++++++++float+newval+%3D+oldval+%2B+f1%3B%0A++++++++if+(f0.compare_exchange_weak(oldval,+newval))+%7B%0A++++++++++++return+oldval%3B%0A++++++++%7D%0A++++%7D+while+(true)%3B%0A%7D%0A'),l:'5',n:'0',o:'C%2B%2B+source+%231',t:'0')),k:50.32967032967033,l:'4',n:'0',o:'',s:0,t:'0'),(g:!((h:compiler,i:(compiler:clang1000,filters:(b:'0',binary:'1',commentOnly:'0',demangle:'0',directives:'0',execute:'1',intel:'0',libraryCode:'0',trim:'1'),fontScale:14,fontUsePx:'0',j:1,lang:c%2B%2B,libs:!(),options:'-O3',selection:(endColumn:1,endLineNumber:1,positionColumn:1,positionLineNumber:1,selectionStartColumn:1,selectionStartLineNumber:1,startColumn:1,startLineNumber:1),source:1),l:'5',n:'0',o:'x86-64+clang+10.0.0+(Editor+%231,+Compiler+%231)+C%2B%2B',t:'0')),k:49.67032967032967,l:'4',n:'0',o:'',s:0,t:'0')),l:'2',n:'0',o:'',t:'0')),version:4">See on Godbolt Compiler Explorer</a></div>
<p>Here is the corresponding x86-64 assembly for the C++ implementation in Listing 3; again, this is the version that produces the same correct result as Listing 2.
Just like with Listing 5, this was compiled using Clang 10.0.0 using -O3, and descriptive annotations are in the comments:</p>
<div id="listing6"></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>addAtomicFloat(std::atomic<float>&, float): # f0 is dword ptr [rdi], f1 is xmm0
mov eax, dword ptr [rdi] # eax = *arg0 = f0.load()
.LBB0_1:
movd xmm1, eax # xmm1 = eax = f0.load()
movdqa xmm2, xmm1 # xmm2 = xmm1 = eax = f0.load()
addss xmm2, xmm0 # xmm2 = (xmm2 + xmm0) = (f0 + f1)
movd ecx, xmm2 # ecx = xmm2 = (f0 + f1)
lock cmpxchg dword ptr [rdi], ecx # if eax == *arg0 { ZF = 1; *arg0 = arg1 }
# else { ZF = 0; eax = *arg0 };
# "lock" means all done exclusively
jne .LBB0_1 # if ZF == 0 goto .LBB0_1
movdqa xmm0, xmm1 # return f0 value from before cmpxchg
</code></pre></div></div>
<div class="codecaption">Listing 6: x86-64 assembly corresponding to the implementation in Listing 3, with my annotations in the comments. Compiled using armv8-a Clang 10.0.0 using -O3. <a href="https://godbolt.org/#g:!((g:!((g:!((h:codeEditor,i:(fontScale:14,fontUsePx:'0',j:1,lang:c%2B%2B,selection:(endColumn:1,endLineNumber:3,positionColumn:1,positionLineNumber:3,selectionStartColumn:1,selectionStartLineNumber:3,startColumn:1,startLineNumber:3),source:'%23include+%3Catomic%3E%0A%0Afloat+addAtomicFloat(std::atomic%3Cfloat%3E%26+f0,+const+float+f1)+%7B%0A++++float+oldval+%3D+f0.load()%3B%0A++++do+%7B%0A++++++++float+newval+%3D+oldval+%2B+f1%3B%0A++++++++if+(f0.compare_exchange_weak(oldval,+newval))+%7B%0A++++++++++++return+oldval%3B%0A++++++++%7D%0A++++%7D+while+(true)%3B%0A%7D%0A'),l:'5',n:'0',o:'C%2B%2B+source+%231',t:'0')),k:50.32967032967033,l:'4',n:'0',o:'',s:0,t:'0'),(g:!((h:compiler,i:(compiler:clang1000,filters:(b:'0',binary:'1',commentOnly:'0',demangle:'0',directives:'0',execute:'1',intel:'0',libraryCode:'0',trim:'1'),fontScale:14,fontUsePx:'0',j:1,lang:c%2B%2B,libs:!(),options:'-O3',selection:(endColumn:1,endLineNumber:1,positionColumn:1,positionLineNumber:1,selectionStartColumn:1,selectionStartLineNumber:1,startColumn:1,startLineNumber:1),source:1),l:'5',n:'0',o:'x86-64+clang+10.0.0+(Editor+%231,+Compiler+%231)+C%2B%2B',t:'0')),k:49.67032967032967,l:'4',n:'0',o:'',s:0,t:'0')),l:'2',n:'0',o:'',t:'0')),version:4">See on Godbolt Compiler Explorer</a></div>
<p>The compiled x86-64 assembly in Listing 5 and Listing 6 is almost identical; the only difference is that in Listing 5, copying data from the address stored in register <code class="language-plaintext highlighter-rouge">rdi</code> to register <code class="language-plaintext highlighter-rouge">eax</code> happens after label <code class="language-plaintext highlighter-rouge">.LBB0_1</code> and in Listing 6 the copy happens before label <code class="language-plaintext highlighter-rouge">.LBB0_1</code>.
Comparing the x86-64 assembly with the C++ code, we can see that this difference corresponds directly to where <code class="language-plaintext highlighter-rouge">f0</code>’s value is atomically loaded into <code class="language-plaintext highlighter-rouge">oldval</code>.
We can also see that <code class="language-plaintext highlighter-rouge">std::atomic<float>::compare_exchange_weak()</code> compiles down to a single <code class="language-plaintext highlighter-rouge">cmpxchg</code> instruction, which as the instruction name suggests, is a compare and exchange operation.
The <code class="language-plaintext highlighter-rouge">lock</code> instruction prefix in front of <code class="language-plaintext highlighter-rouge">cmpxchg</code> ensures that the current CPU core has exclusive ownership of the corresponding cache line for the duration of the <code class="language-plaintext highlighter-rouge">cmpxchg</code> operation, which is how the operation is made atomic.</p>
<p>This is the point where I eventually realized what I had missed.
I actually didn’t notice immediately; figuring out what I had missed didn’t actually occur to me until several days later!
The thing that finally made me realize what I had missed and made me understand why Listing 3 / Listing 6 don’t actually result in an infinite loop and instead match the behavior of Listing 2 / Listing 5 lies in <code class="language-plaintext highlighter-rouge">cmpxchg</code>.
Let’s take a look at the official <a href="https://software.intel.com/content/www/us/en/develop/download/intel-64-and-ia-32-architectures-sdm-combined-volumes-1-2a-2b-2c-2d-3a-3b-3c-3d-and-4.html">Intel 64 and IA-32 Architectures Software Developer’s Manual</a>’s description <a href="https://software.intel.com/content/www/us/en/develop/download/intel-64-and-ia-32-architectures-sdm-combined-volumes-1-2a-2b-2c-2d-3a-3b-3c-3d-and-4.html">[Intel 2021]</a> of what <code class="language-plaintext highlighter-rouge">cmpxchg</code> does:</p>
<blockquote>
<p>Compares the value in the AL, AX, EAX, or RAX register with the first operand (destination operand). If the two values are equal, the second operand (source operand) is loaded into the destination operand. Otherwise, the destination operand is loaded into the AL, AX, EAX or RAX register. RAX register is available only in 64-bit mode.</p>
<p>This instruction can be used with a LOCK prefix to allow the instruction to be executed atomically. To simplify the interface to the processor’s bus, the destination operand receives a write cycle without regard to the result of the comparison. The destination operand is written back if the comparison fails; otherwise, the source operand is written into the destination. (The processor never produces a locked read without also producing a locked write.)</p>
</blockquote>
<p>If the compare part of <code class="language-plaintext highlighter-rouge">cmpxchg</code> fails, <em>the first operand is loaded into the EAX register</em>!
After thinking about this property of <code class="language-plaintext highlighter-rouge">cmpxchg</code> for a bit, I finally had my head-smack moment and remembered that <code class="language-plaintext highlighter-rouge">std::atomic<T>::compare_exchange_weak(T& expected, T desired)</code> replaces <code class="language-plaintext highlighter-rouge">expected</code> with the result of the atomic load in the event of comparison failure.
This property of <code class="language-plaintext highlighter-rouge">std::atomic<T>::compare_exchange_weak()</code> is why <code class="language-plaintext highlighter-rouge">std::atomic<T>::compare_exchange_weak()</code> can be compiled down to a single <code class="language-plaintext highlighter-rouge">cmpxchg</code> instruction on x86-64 in the first place.
We can actually see the compiler being clever here in Listing 6 and exploiting the fact that <code class="language-plaintext highlighter-rouge">cmpxchg</code> comparison failure mode writes into the <code class="language-plaintext highlighter-rouge">eax</code> register: the compiler chooses to use <code class="language-plaintext highlighter-rouge">eax</code> as the target for the <code class="language-plaintext highlighter-rouge">mov</code> instruction in Line 1 instead of using some other register so that a second move from <code class="language-plaintext highlighter-rouge">eax</code> into some other register isn’t necessary after <code class="language-plaintext highlighter-rouge">cmpxchg</code>.
If anything, the implementation in Listing 3 / Listing 6 is actually slightly <em>more</em> efficient than the implementation in Listing 2 / Listing 5, since there is one fewer <code class="language-plaintext highlighter-rouge">mov</code> instruction needed in the loop.</p>
<p>So what does this have to do with learning about arm64?
Well, while I was in the process of looking at the x86-64 assembly to try to understand what was going on, I also tried the implementation in Listing 3 on my Raspberry Pi 4B just to sanity check if things worked the same on arm64.
At that point I hadn’t realized that the code in Listing 3 was actually correct yet, so I was beginning to consider possibilities like a compiler bug or weird platform-specific considerations that I hadn’t thought of, so to rule those more exotic explanations out, I decided to see if the code worked the same on x86-64 and arm64.
Of course the code worked exactly the same on both, so the next step was to also examine the arm64 assembly in addition to the x86-64 assembly.
Comparing the same code’s corresponding assembly for x86-64 and arm64 at the same time proved to be a very interesting exercise in getting to better understand some low-level and general differences between the two instruction sets.</p>
<p>Here is the corresponding arm64 assembly for the implementation in Listing 2; this is the arm64 assembly that is the direct counterpart to the x86-64 assembly in Listing 5.
This arm64 assembly was also compiled with Clang 10.0.0 using -O3.
I’ve included annotations here as well, although admittedly my arm64 assembly comprehension is not as good as my x86-64 assembly comprehension, since I’m relatively new to compiling for arm64.
If you’re well versed in arm64 assembly and see a mistake in my annotations, feel free to send me a correction!</p>
<div id="listing7"></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>addAtomicFloat(std::atomic<float>&, float):
b .LBB0_2 // goto .LBB0_2
.LBB0_1:
clrex // clear this thread's record of exclusive lock
.LBB0_2:
ldar w8, [x0] // w8 = *arg0 = f0, non-atomically loaded
ldaxr w9, [x0] // w9 = *arg0 = f0.load(), atomically
// loaded (get exclusive lock on x0), with
// implicit synchronization
fmov s1, w8 // s1 = w8 = f0
fadd s2, s1, s0 // s2 = s1 + s0 = (f0 + f1)
cmp w9, w8 // compare non-atomically loaded f0 with atomically
// loaded f0 and store result in N
b.ne .LBB0_1 // if N==0 { goto .LBB0_1 }
fmov w8, s2 // w8 = s2 = (f0 + f1)
stlxr w9, w8, [x0] // if this thread has the exclusive lock,
// { *arg0 = w8 = (f0 + f1), release lock },
// store whether or not succeeded in w9
cbnz w9, .LBB0_2 // if w9 says exclusive lock failed { goto .LBB0_2}
mov v0.16b, v1.16b // return f0 value from ldaxr
ret
</code></pre></div></div>
<div class="codecaption">Listing 7: arm64 assembly corresponding to Listing 2, with my annotations in the comments. Compiled using arm64 Clang 10.0.0 using -O3. <a href="https://godbolt.org/#g:!((g:!((g:!((h:codeEditor,i:(fontScale:14,fontUsePx:'0',j:1,lang:c%2B%2B,selection:(endColumn:9,endLineNumber:5,positionColumn:9,positionLineNumber:5,selectionStartColumn:9,selectionStartLineNumber:5,startColumn:9,startLineNumber:5),source:'%23include+%3Catomic%3E%0A%0Afloat+addAtomicFloat(std::atomic%3Cfloat%3E%26+f0,+const+float+f1)+%7B%0A++++do+%7B%0A++++++++float+oldval+%3D+f0.load()%3B%0A++++++++float+newval+%3D+oldval+%2B+f1%3B%0A++++++++if+(f0.compare_exchange_weak(oldval,+newval))+%7B%0A++++++++++++return+oldval%3B%0A++++++++%7D%0A++++%7D+while+(true)%3B%0A%7D%0A'),l:'5',n:'0',o:'C%2B%2B+source+%231',t:'0')),k:50.32967032967033,l:'4',n:'0',o:'',s:0,t:'0'),(g:!((h:compiler,i:(compiler:armv8-clang1000,filters:(b:'0',binary:'1',commentOnly:'0',demangle:'0',directives:'0',execute:'1',intel:'0',libraryCode:'0',trim:'1'),fontScale:14,fontUsePx:'0',j:1,lang:c%2B%2B,libs:!(),options:'-O3',selection:(endColumn:1,endLineNumber:1,positionColumn:1,positionLineNumber:1,selectionStartColumn:1,selectionStartLineNumber:1,startColumn:1,startLineNumber:1),source:1),l:'5',n:'0',o:'armv8-a+clang+10.0.0+(Editor+%231,+Compiler+%231)+C%2B%2B',t:'0')),k:49.67032967032967,l:'4',n:'0',o:'',s:0,t:'0')),l:'2',n:'0',o:'',t:'0')),version:4">See on Godbolt Compiler Explorer</a></div>
<p>I should note here that the specific version of arm64 that Listing 7 was compiled for is <a href="https://developer.arm.com/documentation/ddi0487/ga">ARMv8.0-A</a>, which is what Clang and GCC both default to when compiling for arm64; this detail will become important a little bit later in this post.
When we compare Listing 7 with Listing 5, we can immediately see some major differences between the arm64 and x86-64 instruction sets, aside from superficial stuff like how registers are named.
The arm64 version is just under twice as long as the x86-64 version, and examining the code, we can see that most of the additional length comes from how the atomic compare-and-exchange is implemented.
Actually, the rest of the code is very similar; the rest of the code is just moving stuff around to support the addition operation and to deal with setting up and jumping to the top of the loop.
In the compare and exchange code, we can see that the arm64 version does not have a single instruction to implement the atomic compare-and-exchange!
While the x86-64 version can compile <code class="language-plaintext highlighter-rouge">std::atomic<float>::compare_exchange_weak()</code> down into a single <code class="language-plaintext highlighter-rouge">cmpxchg</code> instruction, ARMv8.0-A has no equivalent instruction, so the arm64 version instead must use three separate instructions to implement the complete functionality: <code class="language-plaintext highlighter-rouge">ldaxr</code> to do an exclusive load, <code class="language-plaintext highlighter-rouge">stlxr</code> to do an exclusive store, and <code class="language-plaintext highlighter-rouge">clrex</code> to reset the current thread’s record of exclusive access requests.</p>
<p>This difference speaks directly towards x86-84 being a <a href="https://en.wikipedia.org/wiki/Complex_instruction_set_computer">CISC architecture</a> and arm64 being a <a href="https://en.wikipedia.org/wiki/Reduced_instruction_set_computer">RISC architecture</a>.
x86-64’s CISC nature calls for the ISA to have a large number of instructions carrying out complex often-multistep operations, and this design philosophy is what allows x86-64 to encode complex multi-step operations like a compare-and-exchange as a single instruction.
Conversely, arm64’s RISC nature means a design consisting of fewer, simpler operations <a href="https://doi.org/10.1145/641914.641917">[Patterson and Ditzel 1980]</a>; for example, the RISC design philosophy mandates that memory access be done through specific single-cycle instructions instead of as part of a more complex instruction such as compare-and-exchange.
These differing design philosophies mean that in arm64 assembly, we will often see many instructions used to implement what would be a single instruction in x86_64; given this difference, compiling Listing 2 produces surprisingly structurally similarities in the output x86_64 (Listing 5) and arm64 (Listing 7) assembly.
However, if we take the implementation of <code class="language-plaintext highlighter-rouge">addAtomicFloat()</code> in Listing 3 and compile it for arm64’s ARMv8.0-A revision, structural differences between the x86-64 and arm64 output become far more apparent:</p>
<div id="listing8"></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>addAtomicFloat(std::atomic<float>&, float):
ldar w9, [x0] // w9 = *arg0 = f0, non-atomically loaded
ldaxr w8, [x0] // w8 = *arg0 = f0.load(), atomically
// loaded (get exclusive lock on x0), with
// implicit synchronization
fmov s1, w9 // s1 = s9 = f0
cmp w8, w9 // compare non-atomically loaded f0 with atomically
// loaded f0 and store result in N
b.ne .LBB0_3 // if N==0 { goto .LBB0_3 }
fadd s2, s1, s0 // s2 = s1 + s0 = (f0 + f1)
fmov w9, s2 // w9 = s2 = (f0 + f1)
stlxr w10, w9, [x0] // if this thread has the exclusive lock,
// { *arg0 = w9 = (f0 + f1), release lock },
// store whether or not succeeded in w10
cbnz w10, .LBB0_4. // if w10 says exclusive lock failed { goto .LBBO_4 }
mov w9, #1. // w9 = 1 (???)
tbz w9, #0, .LBB0_8. // if bit 0 of w9 == 0 { goto .LBB0_8 }
b .LBB0_5 // goto .LBB0_5
.LBB0_3:
clrex. // clear this thread's record of exclusive lock
.LBB0_4:
mov w9, wzr // w9 = 0
tbz w9, #0, .LBB0_8 // if bit 0 of w9 == 0 { goto .LBBO_8 }
.LBB0_5:
mov v0.16b, v1.16b. // return f0 value from ldaxr
ret
.LBB0_6:
clrex // clear this thread's record of exclusive lock
.LBB0_7:
mov w10, wzr // w10 = 0
mov w8, w9 // w8 = w9
cbnz w10, .LBB0_5 // if w10 is not zero { goto .LBB0_5 }
.LBB0_8:
ldaxr w9, [x0] // w9 = *arg0 = f0.load(), atomically
// loaded (get exclusive lock on x0), with
// implicit synchronization
fmov s1, w8 // s1 = w0 = f0
cmp w9, w8 // compare non-atomically loaded f0 with atomically
// loaded f0 and store result in N
b.ne .LBB0_6 // if N==0 { goto .LBBO_6 }
fadd s2, s1, s0 // s2 = s1 + s0 = (f0 + f1)
fmov w8, s2 // w2 = s2 = (f0 + f1)
stlxr w10, w8, [x0] // if this thread has the exclusive lock,
// { *arg0 = w8 = (f0 + f1), release lock },
// store whether or not succeeded in w10
cbnz w10, .LBB0_7 // if w10 says exclusive lock failed { goto .LBB0_7 }
mov w10, #1 // w10 = 1
mov w8, w9 // w8 = w9 = f0.load()
cbz w10, .LBB0_8 // if w10==0 { goto .LBB0_8 }
b .LBB0_5 // goto .LBB0_5
</code></pre></div></div>
<div class="codecaption">Listing 8: arm64 assembly corresponding to Listing 3, with my annotations in the comments. Compiled using arm64 Clang 10.0.0 using -O3. <a href="https://godbolt.org/#g:!((g:!((g:!((h:codeEditor,i:(fontScale:14,fontUsePx:'0',j:1,lang:c%2B%2B,selection:(endColumn:1,endLineNumber:12,positionColumn:1,positionLineNumber:12,selectionStartColumn:1,selectionStartLineNumber:12,startColumn:1,startLineNumber:12),source:'%23include+%3Catomic%3E%0A%0Afloat+addAtomicFloat(std::atomic%3Cfloat%3E%26+f0,+const+float+f1)+%7B%0A++++float+oldval+%3D+f0.load()%3B%0A++++do+%7B%0A++++++++float+newval+%3D+oldval+%2B+f1%3B%0A++++++++if+(f0.compare_exchange_weak(oldval,+newval))+%7B%0A++++++++++++return+oldval%3B%0A++++++++%7D%0A++++%7D+while+(true)%3B%0A%7D%0A'),l:'5',n:'0',o:'C%2B%2B+source+%231',t:'0')),k:50.32967032967033,l:'4',n:'0',o:'',s:0,t:'0'),(g:!((h:compiler,i:(compiler:armv8-clang1000,filters:(b:'0',binary:'1',commentOnly:'0',demangle:'0',directives:'0',execute:'1',intel:'0',libraryCode:'0',trim:'1'),fontScale:14,fontUsePx:'0',j:1,lang:c%2B%2B,libs:!(),options:'-O3',selection:(endColumn:1,endLineNumber:1,positionColumn:1,positionLineNumber:1,selectionStartColumn:1,selectionStartLineNumber:1,startColumn:1,startLineNumber:1),source:1),l:'5',n:'0',o:'armv8-a+clang+10.0.0+(Editor+%231,+Compiler+%231)+C%2B%2B',t:'0')),k:49.67032967032967,l:'4',n:'0',o:'',s:0,t:'0')),l:'2',n:'0',o:'',t:'0')),version:4">See on Godbolt Compiler Explorer</a></div>
<p>Moving the atomic load out of the loop in Listing 3 resulted in a single line change between Listing 5 and Listing 6’s x86-64 assembly, but causes the arm64 version to explode in size and radically change in structure between Listing 7 and Listing 8!
The key difference between Listing 7 and Listing 8 is that in Listing 8, the entire first iteration of the while loop is lifted out into it’s own code segment, which can then either directly return out of the function or go into the main body of the loop afterwards.
I initially thought that Clang’s decision to lift out the first iteration of the loop was surprising, but it turns out that GCC 10.3 and MSVC v19.28’s respective arm64 backends also similarly decide to lift the first iteration of the loop out as well.
The need to lift the entire first iteration out of the loop likely comes from the need to use an <code class="language-plaintext highlighter-rouge">ldaxr</code> instruction to carry out the initial atomic load of <code class="language-plaintext highlighter-rouge">f0</code>.
Compared with GCC 10.3 and MSVC v19.28 though, Clang 10.0.0’s arm64 output does seem to do a bit more jumping around (see <code class="language-plaintext highlighter-rouge">.LBB0_4</code> through <code class="language-plaintext highlighter-rouge">.LBBO_7</code>) though.
Also, admittedly I’m not entirely sure why register <code class="language-plaintext highlighter-rouge">w9</code> gets set to 1 and then immediately compared with 0 in lines 16/17 and lines 47/49; maybe that’s just a convenient way to clear the <code class="language-plaintext highlighter-rouge">z</code> bit of the <code class="language-plaintext highlighter-rouge">CPSR</code> (Current Program Status Register; this is analogous to <code class="language-plaintext highlighter-rouge">EFLAG</code> on x86-64)?
But anyhow, compared with Listing 7, the arm64 assembly in Listing 8 is much longer in terms of code length, but actually is only slightly more inefficient in terms of total instructions executed.
The slight additional inefficiency comes from some of the additional setup work needed to manage all of the jumping and the split loop.
However, the fact that Listing 8 is less efficient compared with Listing 7 is interesting when we compare with what Listing 3 does to the x86-64 assembly; in the case of x86-64, pulling the initial atomic load out of the loop makes the output x86-64 assembly slightly <em>more</em> efficient, as opposed to slightly <em>less</em> efficient as we have here with arm64.</p>
<p>As a very loose general rule of thumb, arm64 assembly tends to be longer than the equivalent x86-64 assembly for the same high-level code because CISC architectures simply tend to encode a lot more <em>stuff</em> per instruction compared with RISC architectures <a href="https://doi.org/10.1109/ICCD.2009.5413117">[Weaver and McKee 2009]</a>.
However, compiled x86-64 binaries having fewer instructions doesn’t actually mean x86-64 binaries necessarily runs faster than equivalent, less “instruction-dense” compiled arm64 binary.
x86-64 instructions are variable length, requiring more complex logic in the processor’s <a href="https://en.wikibooks.org/wiki/Microprocessor_Design/Instruction_Decoder">instruction decoder</a>, and also since x86-64 instructions are more complex, they can take many more cycles per instruction to execute.
Contrast with arm64, in which instructions are fixed length.
Generally RISC architectures usually feature fixed length instructions, although this generalization isn’t a hard rule; the <a href="https://en.wikipedia.org/wiki/SuperH">SuperH</a> architecture (famously used in the Sega Saturn and Sega Dreamcast) is notably a RISC architecture with variable length instructions.
Fixed length instructions allow for arm64 chips to have simpler logic in decoding, and arm64 also tends to take many many fewer instructions per cycle (often, but not always, as low as one or two cycles per instruction).
The end result is that even though compiled arm64 binaries have lower instruction-density than compiled x86-64 binaries, arm64 processors tend to be able to retire more instructions per cycle than comparable x86-64 processors, allowing arm64 as an architecture to make up for the difference in code density.</p>
<p>…except, of course, all of the above is only loosely true today!
While the x86-64 instruction set is still definitively a CISC instruction set today and the arm64 instruction set is still clearly a RISC instruction set today, a lot of the details have gotten fuzzier over time.
Processors today rarely directly implement the instruction set that they run; basically all modern x86-64 processors today feed x86-64 instructions into a huge hardware decoder block that breaks down individual x86-64 instructions into lower-level <a href="https://en.m.wikipedia.org/wiki/Micro-operation">micro-operations, or μops</a>.
Compared with older x86 processors from decades ago that directly implemented x86, these modern micro-operation-based x86-64 implementations are often much more RISC-like internally.
In fact, if you were to examine all of the parts of a modern Intel and AMD x86-64 processor that take place after the instruction decoding phase, without knowing what processor you were looking at beforehand, you likely would not be able to determine if the processor implemented a CISC or a RISC ISA <a href="https://www.researchgate.net/publication/235960679_The_Architecture_of_the_Nehalem_Processor_and_Nehalem-EP_SMP_Platforms">[Thomadakis 2011]</a>.</p>
<p>The same is true going the other way; while modern x86-64 is a CISC architecture that in practical implementation is often more RISC-like, modern arm64 is a RISC architecture that sometimes has surprisingly CISC-like elements if you look closely.
Modern arm64 processors often <em>also</em> decode individual instructions into smaller micro-operations <a href="https://developer.arm.com/documentation/uan0015/b/">[ARM 2016]</a>, although the extent to which modern arm64 processors do this is a lot less intensive than what modern x86-64 does <a href="https://superuser.com/a/934755">[Castellano 2015]</a>.
Modern arm64 instruction decoders usually rely on simple <a href="https://en.wikipedia.org/wiki/Control_unit#Hardwired_control_unit">hardwired control</a> to break instructions down into micro-operations, whereas modern x86-64 must use a <a href="https://en.wikipedia.org/wiki/Microcode">programmable ROM containing advanced microcode</a> to store mappings from x86-64 instructions to micro-instructions.</p>
<p>Another way that arm64 has slowly gained some CISC-like characteristics is that arm64 over time has gained some surprisingly specialized complex instructions!
Remember the important note I made earlier about Listing 7 and Listing 8 being generated specifically for the ARMv8.0-A revision of arm64?
Well, the specific <code class="language-plaintext highlighter-rouge">ldaxr</code>/<code class="language-plaintext highlighter-rouge">stlxr</code> combination in Listings 6 and 7 that is needed to implement an atomic compare-and-exchange (and generally any kind of atomic load-and-conditional-store operation) is a specific area where a more complex single-instruction implementation generally can perform better than an implementation using several instructions.
As discussed earlier, one complex instruction is not necessarily always faster than several simpler instructions due to how the instructions actually have to be decoded and executed, but in this case, one atomic instruction allows for a faster implementation than several instructions combined since a single atomic instruction can take advantage of more available information at once <a href="https://cpufun.substack.com/p/atomics-in-aarch64">[Cownie 2021]</a>.
Accordingly, the <a href="https://developer.arm.com/documentation/ddi0557/">ARMv8.1-A revision</a> of arm64 introduces a collection of new single-instruction atomic operations.
Of interest to our particular example here is the new <code class="language-plaintext highlighter-rouge">casal</code> instruction, which performs a compare-and-exchange to memory with acquire and release semantics; this new instruction is a direct analog to the x86_64 <code class="language-plaintext highlighter-rouge">cmpxchg</code> instruction with the <code class="language-plaintext highlighter-rouge">lock</code> prefix.</p>
<p>We can actually use these new ARMv8.1-A single-instruction atomic operations today; while GCC and Clang both target ARMv8.0-A by default today, ARMv8.1-A support can be enabled using the <code class="language-plaintext highlighter-rouge">-march=armv8.1-a</code> flag starting in GCC 10.1 and starting in Clang 9.0.0.
Actually, Clang’s support might go back even earlier; Clang 9.0.0 was the furthest back I was able to test.
Here’s what Listing 2 compiles to using the <code class="language-plaintext highlighter-rouge">-march=armv8.1-a</code> flag to enable the <code class="language-plaintext highlighter-rouge">casal</code> instruction:</p>
<div id="listing9"></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>addAtomicFloat(std::atomic<float>&, float):
.LBB0_1:
ldar w8, [x0] // w8 = *arg0 = f0, non-atomically loaded
fmov s1, w8 // s1 = w8 = f0
fadd s2, s1, s0 // s2 = s1 + s0 = (f0 + f1)
mov w9, w8 // w9 = w8 = f0
fmov w10, s2 // w10 = s2 = (f0 + f1)
casal w9, w10, [x0] // atomically read the contents of the address stored
// in x0 (*arg0 = f0) and compare with w9;
// if [x0] == w9:
// atomically set the contents of the
// [x0] to the value in w10
// else:
// w9 = value loaded from [x0]
cmp w9, w8 // compare w9 and w8 and store result in N
cset w8, eq // if previous instruction's compare was true,
// set w8 = 1
cmp w8, #1 // compare if w8 == 1 and store result in N
b.ne .LBB0_1 // if N==0 { goto .LBB0_1 }
mov v0.16b, v1.16b // return f0 value from ldar
ret
</code></pre></div></div>
<div class="codecaption">Listing 9: arm64 revision ARMv8.1-A assembly corresponding to Listing 2, with my annotations in the comments. Compiled using arm64 Clang 10.0.0 using -O3 and also -march=armv8.1-a. <a href="https://godbolt.org/#g:!((g:!((g:!((h:codeEditor,i:(fontScale:14,fontUsePx:'0',j:1,lang:c%2B%2B,selection:(endColumn:34,endLineNumber:5,positionColumn:34,positionLineNumber:5,selectionStartColumn:34,selectionStartLineNumber:5,startColumn:34,startLineNumber:5),source:'%23include+%3Catomic%3E%0A%0Afloat+addAtomicFloat(std::atomic%3Cfloat%3E%26+f0,+const+float+f1)+%7B%0A++++do+%7B%0A++++++++float+oldval+%3D+f0.load()%3B%0A++++++++float+newval+%3D+oldval+%2B+f1%3B%0A++++++++if+(f0.compare_exchange_weak(oldval,+newval))+%7B%0A++++++++++++return+oldval%3B%0A++++++++%7D%0A++++%7D+while+(true)%3B%0A%7D%0A'),l:'5',n:'0',o:'C%2B%2B+source+%231',t:'0')),k:50.32967032967033,l:'4',n:'0',o:'',s:0,t:'0'),(g:!((h:compiler,i:(compiler:armv8-clang1000,filters:(b:'0',binary:'1',commentOnly:'0',demangle:'0',directives:'0',execute:'1',intel:'0',libraryCode:'0',trim:'1'),fontScale:14,fontUsePx:'0',j:1,lang:c%2B%2B,libs:!(),options:'-O3+-march%3Darmv8.1-a',selection:(endColumn:1,endLineNumber:1,positionColumn:1,positionLineNumber:1,selectionStartColumn:1,selectionStartLineNumber:1,startColumn:1,startLineNumber:1),source:1),l:'5',n:'0',o:'armv8-a+clang+10.0.0+(Editor+%231,+Compiler+%231)+C%2B%2B',t:'0')),k:49.67032967032967,l:'4',n:'0',o:'',s:0,t:'0')),l:'2',n:'0',o:'',t:'0')),version:4">See on Godbolt Compiler Explorer</a></div>
<p>If we compare Listing 9 with the ARMv8.0-A version in Listing 7, we can see that Listing 9 is only slightly shorted in terms of total instructions used, but the need for separate <code class="language-plaintext highlighter-rouge">ldaxr</code>, <code class="language-plaintext highlighter-rouge">stlxr</code>, and <code class="language-plaintext highlighter-rouge">clrex</code> instructions has been completely replaced with a single <code class="language-plaintext highlighter-rouge">casal</code> instruction.
Interestingly, Listing 9 is now structurally very very similar to it’s x86-64 counterpart in Listing 5.
My guess is that if someone was familiar with x86-64 assembly but had never seen arm64 assembly before, and that person was given Listing 5 and Listing 9 to compare side-by-side, they’d be able to figure out almost immediately what each line in Listing 9 does.</p>
<p>Now let’s see what Listing 3 compiles to using the <code class="language-plaintext highlighter-rouge">-march=armv8.1-a</code> flag:</p>
<div id="listing10"></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>addAtomicFloat(std::atomic<float>&, float):
ldar w9, [x0] // w9 = *arg0 = f0, non-atomically loaded
fmov s1, w9 // s1 = w9 = f0
fadd s2, s1, s0 // s2 = s1 + s0 = (f0 + f1)
mov w8, w9 // w8 = w9 = f0
fmov w10, s2 // w10 = s2 = (f0 + f1)
casal w8, w10, [x0] // atomically read the contents of the address stored
// in x0 (*arg0 = f0) and compare with w8;
// if [x0] == w8:
// atomically set the contents of the
// [x0] to the value in w10
// else:
// w8 = value loaded from [x0]
cmp w8, w9 // compare w8 and w9 and store result in N
b.eq .LBB0_3 // if N==1 { goto .LBB0_3 }
mov w9, w8
.LBB0_2:
fmov s1, w8 // s1 = w8 = value previously loaded from [x0] = f0
fadd s2, s1, s0 // s2 = s1 + s0 = (f0 + f1)
fmov w10, s2 // w10 = s2 = (f0 + f1)
casal w9, w10, [x0] // atomically read the contents of the address stored
// in x0 (*arg0 = f0) and compare with w9;
// if [x0] == w9:
// atomically set the contents of the
// [x0] to the value in w10
// else:
// w9 = value loaded from [x0]
cmp w9, w8 // compare w9 and w8 and store result in N
cset w8, eq // if previous instruction's compare was true,
// set w8 = 1
cmp w8, #1 // compare if w8 == 1 and store result in N
mov w8, w9 // w8 = w9 = value previously loaded from [x0] = f0
b.ne .LBB0_2 // if N==0 { goto .LBB0_2 }
.LBB0_3:
mov v0.16b, v1.16b // return f0 value from ldar
ret
</code></pre></div></div>
<div class="codecaption">Listing 10: arm64 revision ARMv8.1-A assembly corresponding to Listing 3, with my annotations in the comments. Compiled using arm64 Clang 10.0.0 using -O3 and also -march=armv8.1-a. <a href="https://godbolt.org/#g:!((g:!((g:!((h:codeEditor,i:(fontScale:14,fontUsePx:'0',j:1,lang:c%2B%2B,selection:(endColumn:30,endLineNumber:4,positionColumn:30,positionLineNumber:4,selectionStartColumn:30,selectionStartLineNumber:4,startColumn:30,startLineNumber:4),source:'%23include+%3Catomic%3E%0A%0Afloat+addAtomicFloat(std::atomic%3Cfloat%3E%26+f0,+const+float+f1)+%7B%0A++++float+oldval+%3D+f0.load()%3B%0A++++do+%7B%0A++++++++float+newval+%3D+oldval+%2B+f1%3B%0A++++++++if+(f0.compare_exchange_weak(oldval,+newval))+%7B%0A++++++++++++return+oldval%3B%0A++++++++%7D%0A++++%7D+while+(true)%3B%0A%7D%0A'),l:'5',n:'0',o:'C%2B%2B+source+%231',t:'0')),k:50.32967032967033,l:'4',n:'0',o:'',s:0,t:'0'),(g:!((h:compiler,i:(compiler:armv8-clang1000,filters:(b:'0',binary:'1',commentOnly:'0',demangle:'0',directives:'0',execute:'1',intel:'0',libraryCode:'0',trim:'1'),fontScale:14,fontUsePx:'0',j:1,lang:c%2B%2B,libs:!(),options:'-O3+-march%3Darmv8.1-a',selection:(endColumn:1,endLineNumber:1,positionColumn:1,positionLineNumber:1,selectionStartColumn:1,selectionStartLineNumber:1,startColumn:1,startLineNumber:1),source:1),l:'5',n:'0',o:'armv8-a+clang+10.0.0+(Editor+%231,+Compiler+%231)+C%2B%2B',t:'0')),k:49.67032967032967,l:'4',n:'0',o:'',s:0,t:'0')),l:'2',n:'0',o:'',t:'0')),version:4">See on Godbolt Compiler Explorer</a></div>
<p>Here, the availability of the <code class="language-plaintext highlighter-rouge">casal</code> instruction makes a huge difference in the compactness of the output assembly!
Listing 10 is nearly half the length of Listing 8, and more importantly, Listing 10 is also structurally much simpler than Listing 8.
In Listing 10, the compiler still decided to unroll the first iteration of the loop, but the amount of setup and jumping around in between iterations of the loop is significantly reduced, which should make Listing 10 a bit more performant than Listing 8 even before we take into account the performance improvements from using <code class="language-plaintext highlighter-rouge">casal</code>.</p>
<p>By the way, remember our discussion of weak versus strong memory models in the previous section?
As you may have noticed, Takua’s implementation of <code class="language-plaintext highlighter-rouge">addAtomicFloat()</code> uses <code class="language-plaintext highlighter-rouge">std::atomic<T>::compare_exchange_weak()</code> instead of <code class="language-plaintext highlighter-rouge">std::atomic<T>::compare_exchange_strong()</code>.
The difference between the weak and strong versions of <code class="language-plaintext highlighter-rouge">std::atomic<T>::compare_exchange_*()</code> is that the weak version is allowed to sometimes report a failed comparison even if the values are actually equal (that is, the weak version is allowed to spuriously report a false negative), while the strong version guarantees always accurately reporting the outcome of the comparison.
On x86-64, there is no difference between using the weak and strong versions of because x86-64 always provides strong memory ordering (in other words, on x86-64 the weak version is allowed to report a false negative by the spec but never actually does).
However, on arm64, the weak version actually does report false negatives in practice.
The reason I chose to use the weak version is because when the compare-and-exchange is attempted repeatedly in a loop, if the underlying processor actually has weak memory ordering, using the weak version is usually faster than the strong version.
To see why, let’s take a look at the arm64 ARMv8.0-A assembly corresponding to Listing 2, but with <code class="language-plaintext highlighter-rouge">std::atomic<T>::compare_exchange_strong()</code> swapped in instead of <code class="language-plaintext highlighter-rouge">std::atomic<T>::compare_exchange_weak()</code>:</p>
<div id="listing11"></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>addAtomicFloat(std::atomic<float>&, float):
.LBB0_1:
ldar w8, [x0] // w8 = *arg0 = f0, non-atomically loaded
fmov s1, w8 // s1 = w8 = f0
fadd s2, s1, s0 // s2 = s1 + s0 = (f0 + f1)
fmov w9, s2 // w9 = s2 = (f0 + f1)
.LBB0_2:
ldaxr w10, [x0] // w10 = *arg0 = f0.load(), atomically
// loaded (get exclusive lock on x0), with
// implicit synchronization
cmp w10, w8 // compare non-atomically loaded f0 with atomically
// loaded f0 and store result in N
b.ne .LBB0_4 // if N==0 { goto .LBB0_4 }
stlxr w10, w9, [x0] // if this thread has the exclusive lock,
// { *arg0 = w9 = (f0 + f1), release lock },
// store whether or not succeeded in w10
cbnz w10, .LBB0_2 // if w10 says exclusive lock failed { goto .LBB0_2}
b .LBB0_5 // goto .LBB0_5
.LBB0_4:
clrex // clear this thread's record of exclusive lock
b .LBB0_1 // goto .LBB0_1
.LBB0_5:
mov v0.16b, v1.16b // return f0 value from ldaxr
ret
</code></pre></div></div>
<div class="codecaption">Listing 11: arm64 revision ARMv8.0-A assembly corresponding to Listing 2 but using <br /><code class="language-plaintext highlighter-rouge">std::atomic::compare_exchange_strong()</code> instead of <code class="language-plaintext highlighter-rouge">std::atomic::compare_exchange_weak()</code>, with my annotations in the comments. Compiled using arm64 Clang 10.0.0 using -O3 and also -march=armv8.1-a. <a href="https://godbolt.org/#g:!((g:!((g:!((h:codeEditor,i:(fontScale:14,fontUsePx:'0',j:1,lang:c%2B%2B,selection:(endColumn:39,endLineNumber:7,positionColumn:39,positionLineNumber:7,selectionStartColumn:39,selectionStartLineNumber:7,startColumn:39,startLineNumber:7),source:'%23include+%3Catomic%3E%0A%0Afloat+addAtomicFloat(std::atomic%3Cfloat%3E%26+f0,+const+float+f1)+%7B%0A++++do+%7B%0A++++++++float+oldval+%3D+f0.load()%3B%0A++++++++float+newval+%3D+oldval+%2B+f1%3B%0A++++++++if+(f0.compare_exchange_strong(oldval,+newval))+%7B%0A++++++++++++return+oldval%3B%0A++++++++%7D%0A++++%7D+while+(true)%3B%0A%7D%0A'),l:'5',n:'0',o:'C%2B%2B+source+%231',t:'0')),k:50.32967032967033,l:'4',n:'0',o:'',s:0,t:'0'),(g:!((h:compiler,i:(compiler:armv8-clang1000,filters:(b:'0',binary:'1',commentOnly:'0',demangle:'0',directives:'0',execute:'1',intel:'0',libraryCode:'0',trim:'1'),fontScale:14,fontUsePx:'0',j:1,lang:c%2B%2B,libs:!(),options:'-O3+',selection:(endColumn:12,endLineNumber:19,positionColumn:12,positionLineNumber:19,selectionStartColumn:1,selectionStartLineNumber:1,startColumn:1,startLineNumber:1),source:1),l:'5',n:'0',o:'armv8-a+clang+10.0.0+(Editor+%231,+Compiler+%231)+C%2B%2B',t:'0')),k:49.67032967032967,l:'4',n:'0',o:'',s:0,t:'0')),l:'2',n:'0',o:'',t:'0')),version:4">See on Godbolt Compiler Explorer</a></div>
<p>If we compare Listing 11 with Listing 7, we can see that just changing the compare and exchange to a strong version instead of a weak version causes a major restructuring of the arm64 assembly and the addition of a bunch more jumps.
In Listing 7, loads from <code class="language-plaintext highlighter-rouge">[x0]</code> (corresponding to reads of <code class="language-plaintext highlighter-rouge">f0</code> in the C++ code) happen together at the top of the loop and the loaded values are reused through the rest of the loop.
However, Listing 11 is restructured such that loads from <code class="language-plaintext highlighter-rouge">[x0]</code> happen immediately before the instruction that uses the loaded value from <code class="language-plaintext highlighter-rouge">[x0]</code> to do a comparison or other operation.
This change means that there is less time for another thread to change the value at <code class="language-plaintext highlighter-rouge">[x0]</code> while this thread is still doing stuff.
Interestingly, if we compile using ARMv8.1-A, the availability of single-instruction atomic operations means that just like on x86-64, the difference between the strong and weak versions of the compare and exchange go away and end up compiling to the same arm64 assembly.</p>
<p>At this point in process of porting Takua to arm64, I only had a couple of Raspberry Pis, as Apple Silicon Macs hadn’t even been announced yet.
Unfortunately, the Raspberry Pi 3B’s Cortex-A53-based CPU and the Raspberry Pi 4B’s Cortex-A72-based CPU only implement ARMv8.0-A, which means I couldn’t actually test and compare the versions of the compiled assembly with and without <code class="language-plaintext highlighter-rouge">casal</code>.
Fortunately though, we can still compile the code such that if the processor the code is running on implements ARMv8.1-A, the code will use <code class="language-plaintext highlighter-rouge">casal</code> and other ARMv8.1-A single-instruction atomic operations, and otherwise if only ARMv8.0-A is implemented, then the code will fall back to using <code class="language-plaintext highlighter-rouge">ldaxr</code>, <code class="language-plaintext highlighter-rouge">stlxr</code>, and <code class="language-plaintext highlighter-rouge">clrex</code>.
We can get the compiler to automatically do the above by using the <code class="language-plaintext highlighter-rouge">-moutline-atomics</code> compiler flag, which Richard Henderson of Linaro contributed into GCC 10.1 <a href="https://community.arm.com/developer/tools-software/tools/b/tools-software-ides-blog/posts/making-the-most-of-the-arm-architecture-in-gcc-10">[Tkachov 2020]</a> and which also recently was added to Clang 12.0.0 in April 2021.
The <code class="language-plaintext highlighter-rouge">-moutline-atomics</code> flag tells the compiler to generate a runtime helper function and stub the runtime helper function into the atomic operation call-site instead of directly generating atomic instructions; this helper function then does a runtime check for what atomic instructions are available on the current processor and dispatches to the best possible implementation given the available instructions.
This runtime check is cached to make subsequent calls to the helper function faster.
Using this flag means that if a future Raspberry Pi 5 or something comes out hopefully with support for something newer than ARMv8.0-A, Takua should be able to automatically take advantage of faster single-instruction atomics without me having to reconfigure Takua’s builds per processor.</p>
<p><strong>Performance Testing</strong></p>
<p>So, now that I have Takua up and running on arm64 on Linux, how does it actually perform?
Here are some comparisons, although there are some important caveats.
First, at this stage in the porting process, the only arm64 hardware I had that could actually run reasonably sized scenes on was a Raspberry Pi 4B with 4 GB of memory.
The Raspberry Pi 4B’s CPU is a Broadcom BCM2711, which has 4 Cortex-A72 cores; these cores aren’t exactly fast, and even though the Raspberry Pi 4B came out in 2019, the Cortex-A72 core actually dates back to 2015.
So, for the x86-64 comparison point, I’m using my early 2015 MacBook Air, which also has only 4 GB of memory and has an Intel Core i5-5250U CPU with 2 cores / 4 threads.
Also, as an extremely unfair comparison point, I also ran the comparisons on my workstation, which has 128 GB of memory and dual Intel Xeon E5-2680 CPUs with 8 cores / 16 threads each, for 16 cores / 32 threads in total.
The three scenes I used were the Cornell Box seen in Figure 1, the glass teacup seen in Figure 2, and the bedroom scene from my <a href="http://blog.yiningkarlli.com/2020/02/shadow-terminator-in-takua.html">shadow terminator blog post</a>; these scenes were chosen because they fit in under 4 GB of memory.
All scenes were rendered to 16 samples-per-pixel, because I didn’t want to wait forever.
The Cornell Box and Bedroom scenes are rendered using unidirectional path tracing, while the tea cup scene is rendered using VCM.
The Cornell Box scene is rendered at 1024x1024 resolution, while the Tea Cup and Bedroom scenes are rendered at 1920x1080 resolution.</p>
<p>Here are the results:</p>
<table>
<thead>
<tr>
<th style="text-align: right"> </th>
<th style="text-align: center">CORNELL BOX</th>
<th style="text-align: left"> </th>
</tr>
<tr>
<th style="text-align: right"> </th>
<th style="text-align: center">1024x1024, PT</th>
<th style="text-align: left"> </th>
</tr>
<tr>
<th style="text-align: right">Processor:</th>
<th style="text-align: center">Wall Time:</th>
<th style="text-align: left">Core-Seconds:</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: right">Broadcom BCM2711:</td>
<td style="text-align: center">440.627 s</td>
<td style="text-align: left">approx 1762.51 s</td>
</tr>
<tr>
<td style="text-align: right">Intel Core i5-5250U:</td>
<td style="text-align: center">272.053 s</td>
<td style="text-align: left">approx 1088.21 s</td>
</tr>
<tr>
<td style="text-align: right">Intel Xeon E5-2680 x2:</td>
<td style="text-align: center">36.6183 s</td>
<td style="text-align: left">approx 1139.79 s</td>
</tr>
</tbody>
</table>
<table>
<thead>
<tr>
<th style="text-align: right"> </th>
<th style="text-align: center">TEA CUP</th>
<th style="text-align: left"> </th>
</tr>
<tr>
<th style="text-align: right"> </th>
<th style="text-align: center">1920x1080, VCM</th>
<th style="text-align: left"> </th>
</tr>
<tr>
<th style="text-align: right">Processor:</th>
<th style="text-align: center">Wall Time:</th>
<th style="text-align: left">Core-Seconds:</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: right">Broadcom BCM2711:</td>
<td style="text-align: center">2205.072 s</td>
<td style="text-align: left">approx 8820.32 s</td>
</tr>
<tr>
<td style="text-align: right">Intel Core i5-5250U:</td>
<td style="text-align: center">2237.136 s</td>
<td style="text-align: left">approx 8948.56 s</td>
</tr>
<tr>
<td style="text-align: right">Intel Xeon E5-2680 x2:</td>
<td style="text-align: center">174.872 s</td>
<td style="text-align: left">approx 5593.60 s</td>
</tr>
</tbody>
</table>
<table>
<thead>
<tr>
<th style="text-align: right"> </th>
<th style="text-align: center">BEDROOM</th>
<th style="text-align: left"> </th>
</tr>
<tr>
<th style="text-align: right"> </th>
<th style="text-align: center">1920x1080, PT</th>
<th style="text-align: left"> </th>
</tr>
<tr>
<th style="text-align: right">Processor:</th>
<th style="text-align: center">Wall Time:</th>
<th style="text-align: left">Core-Seconds:</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: right">Broadcom BCM2711:</td>
<td style="text-align: center">5653.66 s</td>
<td style="text-align: left">approx 22614.64 s</td>
</tr>
<tr>
<td style="text-align: right">Intel Core i5-5250U:</td>
<td style="text-align: center">4900.54 s</td>
<td style="text-align: left">approx 19602.18 s</td>
</tr>
<tr>
<td style="text-align: right">Intel Xeon E5-2680 x2:</td>
<td style="text-align: center">310.35 s</td>
<td style="text-align: left">approx 9931.52 s</td>
</tr>
</tbody>
</table>
<p>In the results above, “wall time” refers to how long the render took to complete in real-world time as if measured by a clock on the wall, while “core-seconds” is a measure of how long the render would have taken completely single-threaded.
Both values are separately tracked by the renderer; “wall time” is just a timer that starts when the renderer begins working on its first sample and stops when the very last sample is finished, while “core-seconds” is tracked by using a separate timer per thread and adding up how much time each thread has spent rendering.</p>
<p>The results are interesting!
The Raspberry Pi 4B and 2015 MacBook Air are both just completely outclassed by the dual-Xeon workstation in absolute wall time, but that should come as a surprise to absolutely nobody.
What’s more surprising is that the multiplier by which the dual-Xeon workstation is faster than the Raspberry Pi 4B in wall time is much higher than the multiplier in core-seconds.
For the Cornell Box scene, the dual-Xeon is 12.033x faster than the Raspberry Pi 4B in wall time, but is only 1.546x faster in core-seconds.
For the Tea Cup scene, the dual-Xeon is 12.61x faster than the Raspberry Pi 4B in wall time, but is only 1.577x faster in core-seconds.
For the Bedroom scene, the dual-Xeon is 18.217x faster than the Raspberry Pi 4B in wall time, but is only 2.277x faster in core-seconds.
This difference in wall time multiplier versus core-seconds multiplier indicates that the Raspberry Pi 4B and dual-Xeon workstation are shockingly close in <em>single-threaded</em> performance; the dual-Xeon workstation only has such a crushing lead in wall clock time because it just has way more cores and threads available than the Raspberry Pi 4B.</p>
<p>When we compare the Raspberry Pi 4B to the 2015 MacBook Air, the results are even more interesting.
Between these two machines, the times are actually relatively close; for the Cornell Box and Bedroom scenes, the Raspberry Pi 4B is within striking distance of the 2015 MacBook Air, and for the Tea Cup scene, the Raspberry Pi 4B is <em>actually faster</em> than the 2015 MacBook Air.
The reason the Raspberry Pi 4B is likely faster than the 2014 MacBook Air at the Tea Cup scene is likely because the Tea Cup scene was rendered using VCM; VCM requires the construction of a photon map, and from previous profiling I know that Takua’s photon map builder works better with more actual physical cores.
The Raspberry Pi 4B has four physical cores, whereas the 2014 MacBook Air only has two physical cores and gets to four threads using hyperthreading; my photon map builder doesn’t scale well with hyperthreading.</p>
<p>So, overall, the Raspberry Pi 4B’s arm64 processor intended for phones got handily beat by a dual-Xeon workstation but came very close to a 2015 MacBook Air.
The thing here to remember though, is that the Raspberry Pi 4B’s arm64-based processor has a TDP of just 4 watts!
Contrast with the MacBook Air’s Intel Core i5-5250U, which has a 15 watt TDP, and with the dual Xeon E5-2680 in my workstation, which have a 130 watt TDP each for a combined <em>260 watt TDP</em>.
For this comparison, I think using the max TDP of each processor is a relatively fair thing to do, since Takua Renderer pushes each CPU to 100% utilization for sustained periods of time.
So, the real story here from an energy perspective is that the Raspberry Pi 4B was between 12 to 18 times slower than the dual-Xeon workstation, but the Raspberry Pi 4B also has a TDP that is <em>65x lower</em> than the dual-Xeon workstation.
Similarly, the Raspberry Pi 4B nearly matches the 2015 MacBook Air, but with a TDP that is 3.75x lower!</p>
<p>When factoring in energy utilization, the numbers get even more interesting once we look at total energy used across the whole render.
We can get the total energy used for each render by multiplying the wall clock render time with the TDP of each processor (again, we’re assuming 100% processor utilization during each render); this gives us total energy used in watt-seconds, which we divide by 3600 seconds per hour to get watt-hours:</p>
<table>
<thead>
<tr>
<th style="text-align: right"> </th>
<th style="text-align: center">CORNELL BOX</th>
<th style="text-align: left"> </th>
</tr>
<tr>
<th style="text-align: right"> </th>
<th style="text-align: center">1024x1024, PT</th>
<th style="text-align: left"> </th>
</tr>
<tr>
<th style="text-align: right">Processor:</th>
<th style="text-align: center">Max TDP:</th>
<th style="text-align: left">Total Energy Used:</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: right">Broadcom BCM2711:</td>
<td style="text-align: center">4 W</td>
<td style="text-align: left">0.4895 Wh</td>
</tr>
<tr>
<td style="text-align: right">Intel Core i5-5250U:</td>
<td style="text-align: center">15 W</td>
<td style="text-align: left">1.1336 Wh</td>
</tr>
<tr>
<td style="text-align: right">Intel Xeon E5-2680 x2:</td>
<td style="text-align: center">260 W</td>
<td style="text-align: left">2.6450 Wh</td>
</tr>
</tbody>
</table>
<table>
<thead>
<tr>
<th style="text-align: right"> </th>
<th style="text-align: center">TEA CUP</th>
<th style="text-align: left"> </th>
</tr>
<tr>
<th style="text-align: right"> </th>
<th style="text-align: center">1920x1080, VCM</th>
<th style="text-align: left"> </th>
</tr>
<tr>
<th style="text-align: right">Processor:</th>
<th style="text-align: center">Max TDP:</th>
<th style="text-align: left">Total Energy Used:</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: right">Broadcom BCM2711:</td>
<td style="text-align: center">4 W</td>
<td style="text-align: left">2.4500 Wh</td>
</tr>
<tr>
<td style="text-align: right">Intel Core i5-5250U:</td>
<td style="text-align: center">15 W</td>
<td style="text-align: left">9.3214 Wh</td>
</tr>
<tr>
<td style="text-align: right">Intel Xeon E5-2680 x2:</td>
<td style="text-align: center">260 W</td>
<td style="text-align: left">12.6297 Wh</td>
</tr>
</tbody>
</table>
<table>
<thead>
<tr>
<th style="text-align: right"> </th>
<th style="text-align: center">BEDROOM</th>
<th style="text-align: left"> </th>
</tr>
<tr>
<th style="text-align: right"> </th>
<th style="text-align: center">1920x1080, PT</th>
<th style="text-align: left"> </th>
</tr>
<tr>
<th style="text-align: right">Processor:</th>
<th style="text-align: center">Max TDP:</th>
<th style="text-align: left">Total Energy Used:</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: right">Broadcom BCM2711:</td>
<td style="text-align: center">4 W</td>
<td style="text-align: left">6.2819 Wh</td>
</tr>
<tr>
<td style="text-align: right">Intel Core i5-5250U:</td>
<td style="text-align: center">15 W</td>
<td style="text-align: left">20.4189 Wh</td>
</tr>
<tr>
<td style="text-align: right">Intel Xeon E5-2680 x2:</td>
<td style="text-align: center">260 W</td>
<td style="text-align: left">22.4142 Wh</td>
</tr>
</tbody>
</table>
<p>From the numbers above, we can see that even though the Raspberry Pi 4B is a lot slower than the dual-Xeon workstation in wall clock time, the Raspberry Pi 4B absolutely crushes both the 2015 MacBook Air and the dual-Xeon workstation in terms of energy efficiency.
To render the same image, the Raspberry Pi 4B used between approximately 3.5x to 5.5x <em>less</em> energy overall than the dual-Xeon workstation, and used between approximately 2.3x to 3.8x less energy than the 2015 MacBook Air.
It’s also worth noting that the 2015 MacBook Air cost $899 when it first launched (and the processor had a recommended price from Intel of $315), and the dual-Xeon workstation cost… I don’t actually know.
I bought the dual-Xeon workstation used for a pittance when my employer retired it, so I don’t know how much it actually cost new.
But, I do know that the processors in the dual-Xeon had a recommended price from Intel of $1723… <em>each</em>, for a total of $3446 when they were new.
In comparison, the Raspberry Pi 4B with 4 GB of RAM costs about $55 for the entire computer, and the processor cost… well, the actual price for most ARM processors is not ever publicly disclosed, but since a baseline Raspberry Pi 4B costs only $35, the processor can’t have cost more than a few dollars at most, possibly even under a dollar.</p>
<p>I think the main takeaway from these performance comparisons is that even back with 2015 technology, even though most arm64 processors were slower in absolute terms compared to their x86-64 counterparts, the single-threaded performance was already shockingly close, and arm64 energy usage per compute unit and price already were leaving x86-64 in the dust.
Fast forward to the present day in 2021, where we have seen Apple’s arm64-based M1 chip take the absolute performance crown in its category from all x86-64 competitors, at both a lower energy utilization level and a lower price.
The even wilder thing is: the M1 is likely the slowest desktop arm64 chip that Apple will ever ship, and arm64 processors from NVIDIA and Samsung and Qualcomm and Broadcom won’t be far behind in the consumer space while Amazon and Ampere and other companies are also introducing enormous, extremely powerful arm64 chips in the high end server space.
Intel and (especially) AMD aren’t sitting still in the x86-64 space either though.
The next few years are going to be very interesting; no matter what happens, on x86-64 or on arm64, Takua Renderer is now ready to be there!</p>
<p><strong>Conclusion to Part 1</strong></p>
<p>Through the process of porting to arm64 on Linux, I learned a lot about the arm64 architecture and how it differs from x86-64, and I also found a couple of good reminders about topics like memory ordering and how floating point works.
Originally I thought that my post on porting Takua to arm64 would be a nice, short, and fast to write, but instead here we are some 17,000 words later and I have not even gotten to porting Takua to arm64 on macOS and Apple Silicon yet!
So, I think we will stop here for now and save the rest for an upcoming Part 2.
In Part 2, I’ll write about the process to port to arm64, about how to create Universal Binaries, and examine Apple’s Rosetta 2 system for running x86-64 binaries on arm64.
Also, in Part 2 we’ll examine how Embree works on arm64 and compare arm64’s NEON vector extensions with x86-64’s SSE vector extensions, and we’ll finish with some additional miscellaneous differences between x86-64 and arm64 that need to be considered when writing C++ code for both architectures.</p>
<p><strong>Acknowledgements</strong></p>
<p>Thanks so much to <a href="http://rgba32.blogspot.com">Mark Lee</a> and <a href="http://rendering-memo.blogspot.com">Wei-Feng Wayne Huang</a> for puzzling through some of the <code class="language-plaintext highlighter-rouge">std::compare_exchange_weak()</code> stuff with me.
Thanks a ton to <a href="https://twitter.com/superfunc">Josh Filstrup</a> for proofreading and giving feedback and suggestions on this post pre-release!
Josh was the one who told me about the <a href="https://herbie.uwplse.org">Herbie</a> tool mentioned in the floating point section, and he made an interesting suggestion about using <a href="https://egraphs-good.github.io">e-graph analysis</a> to better understand floating point behavior.
Also Josh pointed out SuperH as an example of a variable width RISC architecture, which of course he would because he knows all there is to know about the Sega Dreamcast.
Finally, thanks to my wife, <a href="http://harmonymli.com">Harmony Li</a>, for being patient with me while I wrote up this monster of a blog post and for also puzzling through some of the technical details with me.</p>
<p><strong>References</strong></p>
<p>Pontus Andersson, Jim Nilsson, Tomas Akenine-Möller, Magnus Oskarsson, Kalle Åström, and Mark D. Fairchild. 2020. <a href="https://doi.org/10.1145/3406183">FLIP: A Difference Evaluator for Alternating Images</a>. <em>ACM Transactions on Graphics</em>. 3, 2 (2020), 15:1-15:23.</p>
<p>ARM Holdings. 2016. <a href="https://developer.arm.com/documentation/uan0015/b/">Cortex-A57 Software Optimization Guide</a>. Retrieved May 12, 2021.</p>
<p>ARM Holdings. 2021. <a href="https://developer.arm.com/documentation/ddi0487/ga">Arm Architecture Reference Manual Armv8, for Armv8-A Architecture Profile, Version G.a</a>. Retrieved May 14, 2021.</p>
<p>ARM Holdings. 2021. <a href="https://developer.arm.com/documentation/ddi0557/latest/">Arm Architecture Reference Manual Supplement ARMv8.1, for ARMv8-A Architecture Profile, Version: A.b</a>. Retrieved May 14, 2021.</p>
<p>Brandon Castellano. 2015. <a href="https://superuser.com/a/934755">SuperUser Answer to “Do ARM Processors like Cortex-A9 Use Microcode?”</a>. Retrieved May 12, 2021.</p>
<p>Jim Cownie. 2021. <a href="https://cpufun.substack.com/p/atomics-in-aarch64">Atomics in AArch64</a>. In <em>CPU Fun</em>. Retrieved May 14, 2021.</p>
<p>CppReference. 2021. <a href="https://en.cppreference.com/w/cpp/atomic/atomic/compare_exchange"><code class="language-plaintext highlighter-rouge">std::atomic<T>::compare_exchange_weak</code></a>. Retrieved April 02, 2021.</p>
<p>CppReference. 2021. <a href="https://en.cppreference.com/w/cpp/atomic/memory_order"><code class="language-plaintext highlighter-rouge">std::memory_order</code></a>. Retrieved March 20, 2021.</p>
<p>Intel Corporation. 2021. <a href="https://software.intel.com/content/www/us/en/develop/download/intel-64-and-ia-32-architectures-sdm-combined-volumes-1-2a-2b-2c-2d-3a-3b-3c-3d-and-4.html">Intel 64 and IA-32 Architectures Software Developer’s Manual</a>. Retrieved April 02, 2021.</p>
<p>Bruce Dawson. 2020. <a href="https://randomascii.wordpress.com/2020/11/29/arm-and-lock-free-programming/">ARM and Lock-Free Programming</a>. In <em>Random ASCII</em>. Retrieved April 15, 2021.</p>
<p>Glenn Fiedler. 2008. <a href="https://gafferongames.com/post/floating_point_determinism/">Floating Point Determinism</a>. In <em>Gaffer on Games</em>. Retrieved April 20, 2021.</p>
<p>David Goldbery. 1991. <a href="https://doi.org/10.1145/103162.103163">What Every Computer Scientist Should Know About Floating-Point Arithmetic</a>. <em>ACM Computing Surveys</em>. 32, 1 (1991), 5-48.</p>
<p>Martin Geupel. 2018. <a href="https://www.racoon-artworks.de/cgbasics/bucket_progressive.php">Bucket and Progressive Rendering</a>. In <em>CG Basics</em>. Retrieved May 12, 2021.</p>
<p>Phillip Johnston. 2020. <a href="https://embeddedartistry.com/blog/2017/10/11/demystifying-arm-floating-point-compiler-options/">Demystifying ARM Floating Point Compiler Options</a>. In <em>Embedded Artistry</em>. Retrieved April 20, 2021.</p>
<p>Yossi Kreinin. 2008. <a href="http://yosefk.com/blog/consistency-how-to-defeat-the-purpose-of-ieee-floating-point.html">Consistency: How to Defeat the Purpose of IEEE Floating Point</a>. In <em>Proper Fixation</em>. Retrieved April 20, 2021.</p>
<p>Günter Obiltschnig. 2006. <a href="https://www.appinf.com/download/FPIssues.pdf">Cross-Platform Issues with Floating-Point Arithmetics in C++</a>. In <em>ACCU Conference 2006</em>.</p>
<p>David A. Patterson and David R. Ditzel. 1980. <a href="https://doi.org/10.1145/641914.641917">The Case for the Reduced Instruction Set Computer</a>. <em>ACM SIGARCH Computer Architecture News</em>. 8, 6 (1980), 25-33.</p>
<p>Jeff Preshing. 2012. <a href="https://preshing.com/20120515/memory-reordering-caught-in-the-act/">Memory Reordering Caught in the Act</a>. In <em>Preshing on Programming</em>. Retrieved March 20, 2021.</p>
<p>Jeff Preshing. 2012. <a href="https://preshing.com/20120612/an-introduction-to-lock-free-programming/">An Introduction to Lock-Free Programming</a>. In <em>Preshing on Programming</em>. Retrieved March 20, 2021.</p>
<p>Jeff Preshing. 2012. <a href="https://preshing.com/20120625/memory-ordering-at-compile-time/">Memory Ordering at Compile Time</a>. In <em>Preshing on Programming</em>. Retrieved March 20, 2021.</p>
<p>Jeff Preshing. 2012. <a href="https://preshing.com/20120710/memory-barriers-are-like-source-control-operations/">Memory Barriers Are Like Source Control Operations</a>. In <em>Preshing on Programming</em>. Retrieved March 20, 2021.</p>
<p>Jeff Preshing. 2012. <a href="https://preshing.com/20120913/acquire-and-release-semantics/">Acquire and Release Semantics</a>. In <em>Preshing on Programming</em>. Retrieved March 20, 2021.</p>
<p>Jeff Preshing. 2012. <a href="https://preshing.com/20120930/weak-vs-strong-memory-models/">Weak vs. Strong Memory Models</a>. In <em>Preshing on Programming</em>. Retrieved March 20, 2021.</p>
<p>Jeff Preshing. 2012. <a href="https://preshing.com/20121019/this-is-why-they-call-it-a-weakly-ordered-cpu/">This Is Why They Call It a Weakly-Ordered CPU</a>. In <em>Preshing on Programming</em>. Retrieved March 20, 2021.</p>
<p>The Rust Team. 2021. <a href="https://doc.rust-lang.org/nomicon/atomics.html">Atomics</a>. In <em>The Rustonomicon</em>. Retrieved March 20, 2021.</p>
<p>Michael E. Thomadakis. 2011. <a href="https://www.researchgate.net/publication/235960679_The_Architecture_of_the_Nehalem_Processor_and_Nehalem-EP_SMP_Platforms">The Architecture of the Nehalem Processor and Nehalem-EP SMP Platforms</a>. JFE Technical Report. Texas A&M University.</p>
<p>Kyrylo Tkachov. 2020. <a href="https://community.arm.com/developer/tools-software/tools/b/tools-software-ides-blog/posts/making-the-most-of-the-arm-architecture-in-GCC-10">Making the Most of the Arm Architecture with GCC 10</a>. In <em>ARM Tools, Software, and IDEs Blog</em>. Retrieved May 14, 2021.</p>
<p>Vincent M. Weaver and Sally A. McKee. 2009. <a href="https://doi.org/10.1109/ICCD.2009.5413117">Code Density Concerns for New Architectures</a>. In <em>2009 IEEE International Conference on Computer Design</em>. 459-464.</p>
<p>WikiBooks. 2021. <a href="https://en.wikibooks.org/wiki/Microprocessor_Design/Instruction_Decoder">Microprocessor Design: Instruction Decoder</a>. Retrieved May 12, 2021.</p>
<p>Wikipedia. 2021. <a href="https://en.wikipedia.org/wiki/Complex_instruction_set_computer">Complex Instruction Set Computer</a>. Retrieved April 05, 2021.</p>
<p>Wikipedia. 2021. <a href="https://en.wikipedia.org/wiki/CPU_cache#Policies">CPU Cache</a>. Retrieved March 20, 2021.</p>
<p>Wikipedia. 2021. <a href="https://en.wikipedia.org/wiki/Extended_precision#x86_extended_precision_format">Extended Precision</a>. Retrieved April 20, 2021.</p>
<p>Wikipedia. 2021. <a href="https://en.wikipedia.org/wiki/Control_unit#Hardwired_control_unit">Hardwired Control Unit</a>. Retrieved May 12, 2021.</p>
<p>Wikipedia. 2021. <a href="https://en.wikipedia.org/wiki/IEEE_754">IEEE 754</a>. Retrieved April 20, 2021.</p>
<p>Wikipedia. 2021. <a href="https://en.wikipedia.org/wiki/Intel_8087">Intel 8087</a>. Retrieved April 20, 2021.</p>
<p>Wikipedia. 2021. <a href="https://en.wikipedia.org/wiki/Microcode">Micro-Code</a>. Retrieved May 12, 2021.</p>
<p>Wikipedia. 2021. <a href="https://en.m.wikipedia.org/wiki/Micro-operation">Micro-Operation</a>. Retrieved May 10, 2021.</p>
<p>Wikipedia. 2021. <a href="https://en.wikipedia.org/wiki/Reduced_instruction_set_computer">Reduced Instruction Set Computer</a>. Retrieved April 05, 2021.</p>
<p>Wikipedia. 2021. <a href="https://en.wikipedia.org/wiki/SuperH">SuperH</a>. Retrieved June 02, 2021.</p>
https://blog.yiningkarlli.com/2021/05/responsive-layout.html
New Responsive Layout and Blog Plans
2021-05-18T00:00:00+00:00
2021-05-18T00:00:00+00:00
Yining Karl Li
<p>I recently noticed that my blog and personal website’s layout looked really bad on mobile devices and in smaller browser windows.
When I originally created the current layout for this blog and for my personal website back in 2013, I didn’t really design the layout with mobile in mind whatsoever.
Back in 2013, responsive web design had only just started to take off, and being focused entirely on renderer development and computer graphics, I wasn’t paying much attention to the web design world that much!
I then proceeded to not notice at all how bad the layout on mobile and in small windows was because… well, I don’t really visit my own website and blog very much, because why would I?
I know everything that’s on them already!</p>
<p>Well, I finally visited my site on my iPhone, and immediately noticed how terrible the layout looked.
On an iPhone, the layout was just the full desktop browser layout shrunk down to an unreadable size!
So, last week, I spent two evenings extending the current layout to incorporate responsive web design principles.
Responsive web design principles call for a site’s layout to adjust itself according to the device and window size such that the site renders in a way that is maximally readable in a variety of different viewing contexts.
Generally this means that content and images and stuff should resize so that its always at a readable size, and elements on the page should be on a fluid grid that can reflow instead of being located at fixed locations.</p>
<p>Here is how the layout used by my blog and personal site used to look on an iPhone 11 display, compared with how the layout looks now with modern responsive web design principles implemented:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/May/responsive-layout/mobile_before_after.png"><img src="https://blog.yiningkarlli.com/content/images/2021/May/responsive-layout/preview/mobile_before_after.png" alt="Figure 1: Old layout (left) vs. new responsive layout (right) in Safari on an iPhone 11." /></a></p>
<p>So why did I bother with implementing these improvements to my blog and personal site now, some eight years after I first deployed the current layout and current version of the blog?
To answer this (self-asked) question, I want to first write a bit about how the purpose of this blog has evolved over the years.
I originally started this blog back when I first started college, and it originally didn’t have any clear purpose.
If anything, starting a blog really was just an excuse to rewrite and expand a custom content management system that I had written in PHP 5 back in high school.
Sometime in late 2010, as I got more interested in computer graphics, this blog became something of a personal journal to document my progress in exploring computer graphics.
Around this time I also decided that I wanted to focus all of my attention on computer graphics, so I dropped most of the web-related projects I had at the time and moved this blog from my own custom CMS to Blogger.
In grad school, I started to experiment with writing longer-form posts; for the first time for this blog, these posts were written primarily with a reader other than my future self in mind.
In other words, this is the point where I actually started to write posts intended for an external audience.
At this point I also moved the blog from Blogger to running on Jekyll hosted through Github Pages, and that’s when the first iterations of the current layout were put into place.</p>
<p>Fast forward to today; I’ve now been working at Disney Animation for six years, and (to my constant surprise) this blog has picked up a small but steady readership in the computer graphics field!
The purpose I see for this blog now is to provide high quality, in-depth writeups of whatever projects I find interesting, with the hope that 1. my friends and colleagues and other folks in the field will find the posts similarly interesting and 2. that the posts I write can be informative and inspiring for aspirational students that might stumble upon this blog.
When I was a student, I drew a lot of inspiration from reading a lot of really cool computer graphics and programming blogs, and I want to be able to give back the same to future students!
Similarly, my personal site, which uses an extended version of the blog’s layout, now serves primarily as a place to collect and showcase projects that I’ve worked on with an eye towards hopefully inspiring other people, as opposed to serving as a tool to get recruited.</p>
<p>The rate that I post at now is much slower than when I was in school, but the reason for this slowdown is because I put far more thought and effort into each post now, and while the rate at which new posts appear has slowed down, I like to think that I’ve vastly improved both the quality and quantity of content within each post.
I recently ran <code class="language-plaintext highlighter-rouge">wc -w</code> on the blog’s archives, which yielded some interesting numbers.
From 2014 to now, I’ve only written 38 posts, but these 38 posts total a bit over 96,000 words (which averages to roughly 2,500 words per post).
Contrast with 2010 through the end of 2013, when I wrote 78 posts that together total only about 28,000 words (which averages to roughly 360 words per post)!
Those early posts came frequently, but a lot of those early posts are basically garbage; I only leave them there so that new students can see that my stuff wasn’t very good when I started either.</p>
<p>When I put the current layout into place eight years ago, I wanted the layout to have as little clutter as possible and focus on presenting a clear, optimized reading experience.
I wanted computer graphics enthusiasts that come to read this blog to be able to focus on the content and imagery with as little distraction from the site’s layout as possible, and that meant keeping the layout as simple and minimal as possible while still looking good.
Since the main topic this blog focuses on is computer graphics, and obviously computer graphics is all about pictures and the code that generates those pictures (hence the name of the blog being “Code & Visuals”), I wanted the layout to allow for large, full-width images.
The focus on large full-width images is why the blog is single-column with no sidebars of any sort; in many ways, the layout is actually more about the images than the text, hence why text never wraps around an image either.
Over the years I have also added additional capabilities to the layout in support of computer graphics content, such as MathJax integration so that I can embed beautiful LaTeX math equations, and an embedded sliding image comparison tool so that I can show before/after images with a wiping interface.</p>
<p>So with all of the above in mind, the reason for finally making the layout responsive is simple: I want the blog to be as clear and as readable as I can reasonably make it, and that means clear and readable on <em>any</em> device, not just in a desktop browser with a large window!
I think a lot of modern “minimal” designs tend to use <em>too</em> much whitespace and sacrifice information and text density; a key driving principle behind my layout is to maintain a clean and simple look while still maintaining a reasonable level of information and text density.
However, the old non-responsive layout’s density in smaller viewports was just ridiculous; nothing could be read without zooming in a lot, which on phones then meant a lot of swiping both up/down and left/right just to read a single sentence.
For the new responsive improvements, I wanted to make everything readable in small viewports without any zooming or swiping left/right.
I think the new responsive version of the layout largely accomplishes this goal; here’s an animation of how the layout resizes as the content window shrinks, as applied to the landing page of my personal site:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/May/responsive-layout/scaling.gif"><img src="https://blog.yiningkarlli.com/content/images/2021/May/responsive-layout/preview/scaling.gif" alt="Figure 2: Animation of how the new layout changes as the window changes size." /></a></p>
<p>Adapting my layout to be responsive was surprisingly easy and straightforward!
My blog and personal site use the same layout design, but the actual implementations are a bit different.
The blog’s layout is a highly modified version of an old layout called <a href="https://github.com/kezzbracey/N-Coded">N-Coded</a>, which in turn is an homage to what <a href="https://ghost.org">Ghost</a>’s default <a href="https://github.com/TryGhost/Casper">Casper</a> layout looked like back in 2014 (Casper looks completely different today).
Since the blog’s layout inherited some bits of responsive functionality from the layout that I forked from, getting most things working just required updating, fixing, and activating some already existing but inactive parts of the CSS.
My personal site, on the other hand, reimplements the same layout using completely hand-written CSS instead of using the same CSS as the blog; the reason for this difference is because my personal site extends the design language of the layout for a number of more customized pages such as project pages, publication pages, and more.
Getting my personal site’s layout updated with responsive functionality required writing more new CSS from scratch.</p>
<p>I used to be fairly well versed in web stuff back in high school, but obviously the web world has moved on considerably since then.
I’ve forgotten most of what I knew back then anyway since it’s been well over a decade, so I kind of had to relearn a lot of things.
However, I guess a lot of things in programming are similar to riding a bicycle- once you learn, you never fully forget!
Relearning what I had forgotten was pretty easy, and I quickly figured out that the only really new thing I needed to understand for implementing responsive stuff was the CSS <code class="language-plaintext highlighter-rouge">@media</code> rule, which was introduced in 2009 but only gained full support across all major browsers in 2012.
For the totally unfamiliar with web stuff: the <code class="language-plaintext highlighter-rouge">@media</code> rule allows for checking things like the width and height and resolution of the current viewport and allows for specifying CSS rule overrides per media query.
Obviously this capability is super useful for responsive layouts; implementing responsive layouts really boils down to just making sure that positions are specified as percentages or relative positions instead of fixed positions and then using <code class="language-plaintext highlighter-rouge">@media</code> rules to make larger adjustments to the layout as the viewport size reaches different thresholds.
For example, I use <code class="language-plaintext highlighter-rouge">@media</code> rules to determine when to reorganize from a two-column layout into stacked single-column layout, and I also use <code class="language-plaintext highlighter-rouge">@media</code> rules to determine when to adjust font sizes and margins and stuff.
The other important part to implementing a responsive layout is to instruct the browser to set the width of the page to follow the screen-width of the viewing device on mobile.
The easiest way to implement this requirement by far is to just insert the following into every page’s HTML headers:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><meta name="viewport" content="width=device-width">
</code></pre></div></div>
<p>For the most part, the new responsive layout actually doesn’t really noticeably change how my blog and personal site look on full desktop browsers and in large windows much, aside from some minor cleanups to spacing and stuff.
However, there is one big noticeable change: I got rid of the shrinking pinned functionality for the navbar.
Previously, as a user scrolled down, the header for my blog and personal site would shrink and gradually transform into a more compact version that would then stay pinned to the top of the browser window:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/May/responsive-layout/old_header.gif"><img src="https://blog.yiningkarlli.com/content/images/2021/May/responsive-layout/preview/old_header.gif" alt="Figure 3: Animation of how the old shrinking, pinned navbar worked." /></a></p>
<p>The shrinking pinned navbar functionality was implemented by using a small piece of JavaScript to read how far down the user had scrolled and dynamically adjusting the CSS for the navbar accordingly.
This feature was actually one of my favorite things that I implemented for my blog and site layout!
However, I decided to get rid of it because on mobile, space in the layout is already at a premium, and taking up space that otherwise could be used for content with a pinned navbar just to have my name always at the top of the browser window felt wasteful.
I thought about changing the navbar so that as the user scrolled down, the nav links would turn into a hidden menu accessible through a <a href="https://en.wikipedia.org/wiki/Hamburger_button">hamburger button</a>, but I personally don’t actually really like the additional level of indirection and complexity that hamburger buttons add.
So, the navbar is now just fixed and scrolls just like a normal element of each page:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/May/responsive-layout/new_header.gif"><img src="https://blog.yiningkarlli.com/content/images/2021/May/responsive-layout/preview/new_header.gif" alt="Figure 4: Animation of how the new fixed navbar works." /></a></p>
<p>I think a fixed navbar is fine for now; I figure that if someone is already reading a post on my blog or something on my personal site, they’ll already know where they are and don’t need a big pinned banner with my name on it to remind them of where they are.
However, if I start to find that scrolling up to reach nav links is getting annoying, I guess I’ll put some more thought into if I can come up with a design that I like for a smaller pinned navbar that doesn’t take up too much space in smaller viewports.</p>
<p>While I was in the code, I also made a few other small improvements to both the blog and my personal site.
On the blog, I made a small improvement for embedded code snippets: embedded code snippets now include line numbers on the side!
The line numbers are implemented using a small bit of JavaScript and exist entirely through CSS, so they don’t interfere with selecting and copying text out of the embedded code snippets.
On my personal site, removing the shrinking/pinning aspect of the navbar actually allowed me to completely remove almost all JavaScript includes on the site, aside from some analytics code.
On the blog, JavaScript is still present for some small things like the code line numbers, some caption features, MathJax, and analytics, but otherwise is at a bare minimum.</p>
<p>Over time I’d like to pare back what includes my layout uses even further to help improve load times even more.
One of the big motivators for moving my blog from Blogger to Jekyll was simply for page loading speed; under the hood Blogger is a big fancy dynamic CMS, whereas Jekyll just serves up static pages that are pre-generated once from Markdown files.
A few years ago, I similarly moved my personal site from using a simple dynamic templating engine I had written in PHP to instead be entirely 100% static; I now just write each page on my personal site directly as simple HTML and serve everything statically as well.
As a result, my personal site loads extremely fast!
My current layout definitely still has room for optimization though; currently, I use fonts from TypeKit because I like nice typography and having nice fonts like Futura and Proxima Nova is a big part of the overall “look” of the layout.
Fonts can add a lot of weight if not optimized carefully though, so maybe down the line I’ll need to streamline how fonts work in my layout.
Also, since the blog has a ton of images, I think updating the blog to use native browser lazy loading of images through the <code class="language-plaintext highlighter-rouge">loading="lazy"</code> attribute on <code class="language-plaintext highlighter-rouge">img</code> tags should help a lot with load speeds, but not all major browsers support this attribute yet.
Some day I’d like to get my site down to something as minimal and lightweight as <a href="https://macwright.com/2016/05/03/the-featherweight-website.html">Tom MacWright’s blog</a>, but still, for now I think things are in decent shape.</p>
<p>If for some reason you’re curious to see how all of the improvements mentioned in this post are implemented, the source code for both my blog and my personal site are available on my Github.
Please feel free to either steal any bits of the layout that you may find useful, or if you want, feel free to even fork the entire layout to use as a basis for your own site.
Although, if you do fork the entire layout, I would suggest and really prefer that you put some effort into personalizing the layout and really making it your own instead of just using it exactly as how I have it!</p>
<p>Hopefully this is the last time for a very long while that I’ll write a blog post about the blog itself; I’m an excruciating slow writer these days, but I currently have the largest simultaneous number of posts near completion that I’ve had in a long time, and I’ll be posting them soon.
As early as later this week I’ll be posting the first part of a two-part series about porting Takua Renderer to 64-bit ARM; get ready for a deep dive into some fun concurrency and atomics-related problems at the x86-64 and arm64 assembly level in this post.
The second part of this series should come soon too, and over the summer I’m also hoping to finish posts about hex-tiling in Takua and on implementing/using different light visibility modes.
Stay-at-home during the pandemic has also given me time to slowly chip away on the long-delayed second and third parts of what was supposed to be a series on mipmapped tiled texture caching, so with some luck maybe those posts will finally appear this year too.
Beyond that, I’ve started some very initial steps on new next-generation from-the-ground-up reimplementations of Takua in CUDA/Optix and in Metal, and I’ve started to dip my toes into Rust as well, so who knows, maybe I’ll have stuff to write about that too in the future!</p>
https://blog.yiningkarlli.com/2021/04/magic-shop-renderman-challenge.html
Magic Shop RenderMan Art Challenge
2021-04-12T00:00:00+00:00
2021-04-12T00:00:00+00:00
Yining Karl Li
<div>
<p>Last fall, I participated in my third Pixar’s RenderMan Art Challenge, “Magic Shop”!
I wasn’t initially planning on participating this time around due to not having as much free time on my hands, but after taking a look at the provided assets for this challenge, I figured that it looked fun and that I could learn some new things, so why not?
Admittedly participating in this challenge is why some technical content I had planned for this blog in the fall wound up being delayed, but in exchange, here’s another writeup of some fun CG art things I learned along the way!
This RenderMan Art Challenge followed the same format as usual: Pixar <a href="https://renderman.pixar.com/magic-shop-asset">supplied some base models</a> without any uvs, texturing, shading, lighting, etc, and participants had to start with the supplied base models and come up with a single final image.
Unlike in previous challenges though, this time around Pixar also provided a rigged character in the form of the popular open-source <a href="https://www.facebook.com/mathildarig">Mathilda Rig</a>, to be incorporated into the final entry somehow.
Although my day job involves rendering characters all of the time, I have really limited experience with working with characters in my personal projects, so I got to try some new stuff!
Considering that I my time spent on this project was far more limited than on previous RenderMan Art Challenges, and considering that I didn’t really know what I was doing with the character aspect, I’m pretty happy that my final entry <a href="https://renderman.pixar.com/news/renderman-magic-shop-art-challenge-final-results">won third place in the contest</a>!</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Apr/magicshop/magicshop_full_4k.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Apr/magicshop/preview/magicshop_full.jpg" alt="Figure 1: My entry to Pixar's RenderMan Magic Shop Art Challenge, titled "Books are Magic". Click for 4K version. Mathilda model by Xiong Lin and rig by Leon Sooi. Pixar models by Eman Abdul-Razzaq, Grace Chang, Ethan Crossno, Siobhán Ensley, Derrick Forkel, Felege Gebru, Damian Kwiatkowski, Jeremy Paton, Leif Pedersen, Kylie Wijsmuller, and Miguel Zozaya © Disney / Pixar - RenderMan "Magic Shop" Art Challenge." /></a></p>
<p><strong>Character Explorations</strong></p>
<p>I originally wasn’t planning on entering this challenge, but I downloaded the base assets anyway because I was curious about playing with the rigged character a bit.
I discovered really quickly that the Mathilda rig is reasonably flexible, but the flexibility meant that the rig can go off model really fast, and also the face can get really creepy really fast.
I think part of the problem is just the overall character design; the rig is based on a young Natalie Portman’s character from the movie Léon: The Professional, and the character in that movie is… something of an unusual character, to say the least.
The model itself has a head that’s proportionally a bit on the large side, and the mouth is especially large, which is part of why the facial rig gets so creepy so fast.
One of the first things I discovered was that I had to scale down the rig’s mouth and teeth a bit just to bring things back into more normal proportions.</p>
<p>After playing with the rig for a few evenings, I started thinking about what I should make if I did enter the challenge after all.
I’ve gotten a lot busier recently with personal life stuff, so I knew I wasn’t going to have as much time to spend on this challenge, which meant I needed to come up with a relatively straightforward simple concept and carefully choose what aspects of the challenge I was going to focus on.
I figured that most of the other entries into the challenge were going to use the provided character in more or less its default configuration and look, so I decided that I’d try to take the rig further away from its default look and instead use the rig as a basis for a bit of a different character.
The major changes I wanted to make to take the rig away from its default look were to add glasses, completely redo the hair, simplify the outfit, and shade the outfit completely differently from its default appearance.</p>
<p>With this plan in mind, the first problem I tackled was creating a completely new hairstyle for the character.
The last time I did anything with making CG hair was about a decade ago, and I did a terrible job back then, so I wanted to figure out how to make passable CG hair first because I saw the hair as basically a make-or-break problem for this entire project.
To make the hair in this project, I chose to use Maya’s XGen plugin, which is a generator for arbitrary primitives, including but not limited to curves for things like hair and fur.
I chose to use XGen in part because it’s built into Maya, and also because I already have some familiarity with XGen thanks to my day job at Disney Animation.
XGen was originally developed at Disney Animation <a href="https://dl.acm.org/doi/10.1145/965400.965411">[Thompson et al. 2003]</a> and is used extensively on Disney Animation feature films; Autodesk licensed XGen from Disney Animation and incorporated XGen into Maya’s standard feature set in 2011.
XGen’s origins as a Disney Animation technology explain why XGen’s authoring workflow uses Ptex [<a href="https://doi.org/10.1111/j.1467-8659.2008.01253.x">Burley and Lacewell 2008)</a> for maps and SeExpr <a href="https://wdas.github.io/SeExpr/">[Walt Disney Animation Studios 2011]</a> for expressions.
Of course, since 2011, the internal Disney Animation version of XGen has developed along its own path and gained capabilities and features <a href="https://dl.acm.org/citation.cfm?id=2927466">[Palmer and Litaker 2016]</a> beyond Autodesk’s version of XGen, but the basics are still similar enough that I figured I wouldn’t have too difficult of a time adapting.</p>
<p>I found a great intro to XGen course from <a href="https://jesusfc.net">Jesus FC</a>, which got me up and running with guides/splines XGen workflow.
I eventually found that the workflow that worked best for me was to actually model sheets of hair using just regular polygonal modeling tools, and then use the modeled polygonal sheets as a base surface to help place guide curves on to drive the XGen splines.
After a ton of trial and error and several restarts from scratch, I finally got to something that… admittedly still was not very good, but at least was workable as a starting point.
One of the biggest challenges I kept running into was making sure that different “planes” of hair didn’t intersect each other, which produces grooms that look okay at first glance but then immediately look unnatural after anything more than just a moment.
Here are some early drafts of the custom hair groom:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Apr/magicshop/hair_test.003.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Apr/magicshop/preview/hair_test.003.jpg" alt="Figure 2: Early iteration of a custom hair groom for the character, with placeholder glasses." /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Apr/magicshop/hair_test.004.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Apr/magicshop/preview/hair_test.004.jpg" alt="Figure 3: Another early iteration of a custom hair groom for the character, with pose test and with placeholder glasses." /></a></p>
<p>To shade the hair, I used RenderMan’s PxrMarschnerHair shader, driven using RenderMan’s PxrHairColor node.
PxrHairColor implements d’Eon et al. <a href="https://doi.org/10.1111/j.1467-8659.2011.01976.x">[2011]</a>, which allow for realistic hair colors by modeling melanin concentrations in hair fibers, and PxrMarschnerHair <a href="http://graphics.pixar.com/library/PxrMaterialsCourse2017/index.html">[Hery and Ling 2017]</a> implements a version of the classic Marschner et al. <a href="https://doi.org/10.1145/882262.882345">[2003]</a> hair model improved using adaptive importance sampling <a href="https://graphics.pixar.com/library/DataDrivenHairScattering/">[Pekelis et al. 2015]</a>.
In order to really make hair look good, some amount of randomization and color variation between different strands is necessary; PxrHairColor supports randomization and separately coloring stray flyaway hairs based on primvars.
In order to use the randomization features, I had to remember to check off the “id” and “stray” boxes under the “Primitive Shader Parameters” section of XGen’s Preview/Output tab.
Overall I found the PxrHairColor/PxrMarschnerHair system a little bit difficult to use; figuring out how a selected melanin color maps to a final rendered look isn’t exactly 1-to-1 and requires some getting used to.
This difference in authored hair color and final rendered hair color happens because the authored hair color is the color of a single hair strand, whereas the final rendered hair color is the result of multiple scattering between many hair strands combined with azimuthal roughness.
Fortunately, hair shading should get easier in future versions of RenderMan, which are supposed to ship with an implementation of Disney Animation’s artist-friendly hair model <a href="https://doi.org/10.1111/cgf.12830">[Chiang et al. 2016]</a>.
The Chiang model uses a color re-parameterization that allows for the final rendered hair color to closely match the desired authored color by remapping the authored color to account for multiple scattering and azimuthal roughness; this hair model is what we use in Disney’s Hyperion Renderer of course, and is also implemented in Redshift and is the basis of VRay’s modern VRayHairNextMtl shader.</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Apr/magicshop/hair_test.006.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Apr/magicshop/preview/hair_test.006.jpg" alt="Figure 4: More progressed iteration of a custom hair groom for the character, with final glasses." /></a></p>
<p><strong>Skin Shading and Subsurface Scattering</strong></p>
<p>For shading the character’s skin, the approach I took was to use the rig’s default textures as a starting point, modify heavily to get the textures that I actually wanted, and then use the modified textures to author new materials using PxrSurface.
The largest changes I made to the supplied skin textures are in the maps for subsurface; I basically had to redo everything to provide better inputs to subsurface color and mean free path to get the look that I wanted, since I used PxrSurface’s subsurface scattering set to exponential path-traced mode.
I generally like the controllability and predictability that path-traced SSS brings, but RenderMan 23’s PxrSurface implementation includes a whole bunch of different subsurface scattering modes, and the reason for this is interesting and worth briefly discussing.</p>
<p>Subsurface scattering models how light penetrates the surface of a translucent object, bounces around and scatters inside of the object, and exits at a different surface point from where it entered; this effect is exhibited by almost all organic and non-conductive materials to some degree.
However, subsurface scattering has existed in renderers for a long time; strong subsurface scattering support was actually a standout feature for RenderMan as early as 2002/2003ish <a href="https://graphics.pixar.com/library/RMan2003/">[Hery 2003]</a>, when RenderMan was still a REYES rasterization renderer.
Instead of relying on brute-force path tracing, earlier subsurface scattering implementations relied on diffusion approximations, which approximate the effect of light scattering around inside of an object by modeling the aggregate behavior of scattered light over a simplified surface.
One popular way of implementing diffusion is through dipole diffusion <a href="https://dl.acm.org/doi/10.1145/383259.383319">[Jensen et al. 2001, </a> <a href="http://www.eugenedeon.com/project/a-better-dipole/">d’Eon 2012,</a> <a href="https://graphics.pixar.com/library/TexturingBetterDipole/">Hery 2012]</a> and another popular technique is through the normalized diffusion model <a href="https://doi.org/10.1145/2776880.2787670">[Burley 2015, </a> <a href="https://graphics.pixar.com/library/ApproxBSSRDF">Christensen and Burley 2015]</a> that was originally developed at Disney Animation for Hyperion.
These models are implemented in RenderMan 23’s PxrSurface as the “Jensen and d’Eon Dipoles” subsurface model and the “Burley Normalized” subsurface model, respectively.</p>
<p>Diffusion models were the state-of-the-art for a long time, but diffusion models require a number of simplifying assumptions to work; one of the fundamental key simplifications universal to all diffusion models is an assumption that subsurface scattering is taking place on a semi-infinite slab of material.
Thin geometry breaks this fundamental assumption, and as a result, diffusion-based subsurface scattering tends to loose more energy than it should in thin geometry.
This energy loss means that thin parts of geometry rendered with diffusion models tend to look darker than one would expect in reality.
Along with other drawbacks, this thin geometry energy loss drawback in diffusion models is one of the major reasons why most renderers have moved to brute-force path-traced subsurface scattering in the past half decade, and avoiding the artifacts from diffusion is exactly what the controllability and predictability that I mentioned earlier refers to.
Subsurface scattering is most accurately simulated by brute-force path tracing within a translucent object, but brute-force path-traced subsurface scattering has only really become practical for production in the past 5 or 6 years for two major reasons: first, computational cost, and second, the (up until recently) lack of an intuitive, artist-friendly parameterization for apparent color and scattering distance.
Much like how the final color of a hair model is really the result of the color of individual hair fibers <em>and</em> the aggregate multiple scattering behavior between many hair strands, the final color result of subsurface scattering arises from a complex interaction between single-scattering albedo, mean free path, and numerous multiple scattering events.
So, much like how an artist-friendly, controllable hair model requires being able to invert an artist-specified final apparent color to produce internally-used scattering albedos (this process is called <em>albedo inversion</em>), subsurface scattering similarly requires an albedo inversion step to allow for artist-friendly controllable parameterizations.
The process of albedo inversion for diffusion models is relatively straightforward and can be computed using nice closed-form analytical solutions, but the same is not true for path-traced subsurface scattering.
A major key breakthrough to making path-traced subsurface scattering practical was the development of a usable data-fitted albedo inversion technique <a href="https://dl.acm.org/doi/10.1145/2897839.2927433">[Chiang et al. 2016]</a> that allows path-traced subsurface scattering and diffusion subsurface scattering to use the same parameterization and controls.
This technique was first developed at Disney Animation for Hyperion, and this technique was modified by Wrenninge et al. <a href="https://graphics.pixar.com/library/PathTracedSubsurface/">[2017]</a> and combined with additional support for anisotropic scattering and non-exponential free flight to produce the “Multiple Mean Free Paths” and “path-traced” subsurface models in RenderMan 23’s PxrSurface.</p>
<p>In my initial standalone lookdev test setup, something that took a while was dialing the subsurface back from looking too gummy while at the same time trying to preserve something of a glow-y look, since the final scene I had in mind would be very glow-y.
From both personal and production experience, I’ve found that one of the biggest challenges in moving from diffusion or point based subsurface scattering solutions to brute-force path-traced subsurface scattering often is in having to readjust mean free paths to prevent characters from looking too gummy, especially in areas where the geometry gets relatively thin because of the aforementioned thin geometry problem that diffusion models suffer from.
In order to compensate for energy loss and produce a more plausible result, parameters and texture maps for diffusion-based subsurface scattering are often tuned to overcompensate for energy loss in thin areas.
However, applying these same parameters to an accurate brute-force path tracing model that already models subsurface scattering in thin areas correctly results in overly bright thin areas, hence the gummier look.
Since I started with the supplied skin textures for the character model, and the original skin shader for the character model was authored for a different renderer that used diffusion-based subsurface scattering, the adjustments I had to make where specifically to fight this overly glow-y gummy look in path-traced mode when using parameters authored for diffusion.</p>
<p><strong>Clothes and Fuzz</strong></p>
<p>For the character’s clothes and shoes, I wanted to keep the outfit geometry to save time, but I also wanted to completely re-texture and re-shade the outfit to give it my own look.
I had a lot of trouble posing the character without getting lots of geometry interpenetration in the provided jacket, so I decided to just get rid of the jacket entirely.
For the shirt, I picked a sort of plaid flannel-y look for no other reason than I like plaid flannel.
The character’s shorts come with this sort of crazy striped pattern, which I opted to replace with a much more simplified denim shorts look.
I used Substance Painter for texturing the clothes; Substance Painter comes with a number of good base fabric materials that I heavily modified to get to the fabrics that I wanted.
I also wound up redoing the UVs for the clothing completely; my idea was to lay out the UVs similar to how the sewing patterns for each piece of clothing might work if they were made in reality; doing the UVs this way allowed for quickly getting the textures to meet up and align properly as if the clothes were actually sewn together from fabric panels.
A nice added bonus is that Substance Painter’s smart masks and smart materials often use UV seams as hints for effects like wear and darkening, and all of that basically just worked out of the box perfectly with sewing pattern styled UVs.</p>
<p>Bringing everything back into RenderMan though, I didn’t feel that the flannel shirt looked convincingly soft and fuzzy and warm.
I tried using PxrSurface’s fuzz parameter to get more of that fuzzy look, but the results still didn’t really hold up.
The reason the flannel wasn’t looking right ultimately has to do with what the fuzz lobe in PxrSurface is meant to do, and where the fuzzy look in real flannel fabric comes from.
PxrSurface’s fuzz lobe can only really approximate the look of fuzzy surfaces from a distance, where the fuzz is small enough relative to the viewing position that they can essentially be captured as an aggregate microfacet effect.
Even specialized cloth BSDFs really only hold up at a relatively far distance from the camera, since they all attempt to capture cloth’s appearance as an aggregated microfacet effect; an enormous body of research exists on this topic <a href="https://doi.org/10.1111/j.1467-8659.2011.01987.x">[Schröder et al. 2011</a>, <a href="https://doi.org/10.1145/2185520.2185571">Zhao et al. 2012</a>, <a href="https://doi.org/10.1145/2897824.2925932">Zhao et al. 2016</a>, <a href="https://doi.org/10.1111/cgf.13222">Allaga et al. 2017</a>, <a href="https://dl.acm.org/citation.cfm?id=3085024">Deshmukh et al. 2017</a>, <a href="https://doi.org/10.1145/3414685.3417777">Montazeri et al. 2020]</a>.
However, up close, the fuzzy look in real fabric isn’t really a microfacet effect at all- the fuzzy look really arises from multiple scattering happening between individual flyaway fuzz fibers on the surface of the fabric; while these fuzz fibers are very small to the naked eye, they are still a macro-scale effect when compared to microfacets.
The way feature animation studios such as Disney Animation and Pixar have made fuzzy fabric look really convincing over the past half decade is to… just actually cover fuzzy fabric geometry with actual fuzz fiber geometry <a href="https://dl.acm.org/citation.cfm?id=3214787">[Crow et al. 2018]</a>.
In the past few years, Disney Animation and Pixar and others have actually gone even further.
On Frozen 2, embroidery details and lace and such were built out of actual curves instead of displacement on surfaces <a href="https://dl.acm.org/doi/10.1145/3388767.3407360">[Liu et al. 2020]</a>.
On Brave, some of the clothing made from very coarse fibers were rendered entirely as ray-marched woven curves instead of as subdivision surfaces and shaded using a specialized volumetric scheme <a href="https://drive.google.com/file/d/1bNSwpPusRmRmGfPwe11tjtloCP96WN1P/view?usp=sharing">[Child 2012]</a>, and on Soul, many of the hero character outfits (including ones made of finer woven fabrics) are similarly rendered as brute-force path-traced curves instead of as subdivision surfaces <a href="http://graphics.pixar.com/library/CurveCloth/">[Hoffman et al. 2020]</a>.
Animal Logic similarly renders hero cloth as actual woven curves <a href="https://dl.acm.org/citation.cfm?id=3214781">[Smith 2018]</a>, and I wouldn’t be surprised if most VFX shops use a similar technique now.</p>
<p>Anyhow, in the end I decided to just bite the bullet in terms of memory and render speed and cover the flannel shirt in bazillions of tiny little actual fuzz fibers, instanced and groomed using XGen.
The fuzz fibers are shaded using PxrMarschnerHair and colored to match the fabric surface beneath.
I didn’t actually go as crazy as replacing the entire cloth surface mesh with woven curves; I didn’t have nearly enough time to write all of the custom software that would require, but fuzzy curves on top of the cloth surface mesh is a more-than-good-enough solution for the distance that I was going to have the camera at from the character.
The end result instantly looked vastly better, as seen in this comparison of before and after adding fuzz fibers:</p>
<div class="embed-container">
<iframe src="/content/images/2021/Apr/magicshop/comparisons/shirt_fuzznofuzzcompare_embed.html" frameborder="0" border="0" scrolling="no"></iframe></div>
<div class="figcaption">Figure 5: Shirt before (left) and after (right) XGen fuzz. For a full screen comparison, <a href="/content/images/2021/Apr/magicshop/comparisons/shirt_fuzznofuzzcompare.html">click here.</a></div>
<p>Putting fuzz geometry on the shirt actually worked well enough that I proceeded to do the same for the character’s shorts and socks as well.
For the socks especially having actual fuzz geometry really helped sell the overall look.
I also added fine peach fuzz geometry to the character’s skin as well, which may sound a bit extreme, but has actually been standard practice in the feature animation world for several years now; Disney Animation began adding fine peach fuzz on all characters on Moana <a href="https://www.yiningkarlli.com/projects/ptcourse2017.html">[Burley et al. 2017]</a>, and Pixar started doing so on Coco.
Adding peach fuzz to character skin ends up being really useful for capturing effects like rim lighting without the need for dedicated lights or weird shader hacks to get that distinct bright rim look; the rim lighting effect instead comes entirely from multiple scattering through the peach fuzz curves.
Since I wanted my character to be strongly backlit in my final scene, I knew that having good rim lighting was going to be super important, and using actual peach fuzz geometry meant that it all just worked!
Here is a comparison of my final character texturing/shading/look, backlit without and with all of the geometric fuzz.
The lighting setup is exactly the same between the two renders; the only difference is the presence of fuzz causing the rim effect.
This effect doesn’t happen when using only the fuzz lobe of PxrSurface!</p>
<div class="embed-container-square">
<iframe src="/content/images/2021/Apr/magicshop/comparisons/character_backlightcompare_embed.html" frameborder="0" border="0" scrolling="no"></iframe></div>
<div class="figcaption">Figure 6: Character backlit without and with fuzz. The rim lighting effect is created entirely by backlighting scattering through XGen fuzz on the character and the outfit. For a full screen comparison, <a href="/content/images/2021/Apr/magicshop/comparisons/character_backlightcompare.html">click here.</a> Click <a href="/content/images/2021/Apr/magicshop/character.003.jpg">here</a> and <a href="/content/images/2021/Apr/magicshop/character.004.jpg">here</a> to see the full 4K renders by themselves.</div>
<p>I used SeExpr expressions instead of using XGen’s guides/splines workflow to control all of the fuzz; the reason for using expressions was because I only needed some basic noise and overall orientation controls for the fuzz instead of detailed specific grooming.
Of course, adding geometric fuzz to all of a character’s skin and clothing does increase memory usage and render times, but not by as much as one might expect!
According to RenderMan’s stats collection system, adding geometric fuzz increased overall memory usage for the character by about 20%, and for the renders in Figure 8, adding geometric fuzz increased render time by about 11%.
Without the geometric fuzz, there are 40159 curves on the character, and with geometric fuzz the curve count increases to 1680364.
Even though there was a 41x increase in the number of curves, the total render time didn’t really increase by too much, thanks to logarithmic scaling of ray tracing with respect to input complexity.
In a rasterizer, adding 41x more geometry would slow the render down to a crawl due to the linear scaling of rasterization, but ray tracing makes crazy things like actual geometric fuzz not just possible, but downright practical.
Of course all of this can be made to work in a rasterizer with sufficiently clever culling and LOD and such, but in a ray tracer it all just works out of the box!</p>
<p>Here are a few closeup test renders of all of the fuzz:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Apr/magicshop/woolysocks.png"><img src="https://blog.yiningkarlli.com/content/images/2021/Apr/magicshop/preview/woolysocks.jpg" alt="Figure 7: Closeup test render of the fuzz on the woolly socks, along with the character's shoes." /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Apr/magicshop/fuzzcloseup.png"><img src="https://blog.yiningkarlli.com/content/images/2021/Apr/magicshop/preview/fuzzcloseup.jpg" alt="Figure 8: Closeup test render of fuzz on the shirt and peach fuzz on the character's skin." /></a></p>
<p><strong>Layout, Framing, and Building the Shop</strong></p>
<p>After completing all of the grooming and re-shading work on the character, I finally reached a point where I felt confident enough in being able to make an okay looking character that I was willing to fully commit into entering this RenderMan Art Challenge.
I got to this decision really late in the process relative to on previous challenges!
Getting to this point late meant that I had actually not spent a whole lot of time thinking about the overall set yet, aside from a vague notion that I wanted backlighting and an overall bright and happy sort of setting.
For whatever reason, “magic shop” and “gloomy dark place” are often associated with each other (and looking at many of the other competitors’ entries, that association definitely seemed to hold on this challenge too).
I wanted to steer away from “gloomy dark place”, so I decided I instead wanted more of a sunny magic bookstore with lots of interesting props and little details to tell an overall story.</p>
<p>To build my magic bookstore set, I wound up remixing the provided assets fairly extensively; I completely dismantled the entire provided magic shop set and used the pieces to build a new corner set that would emphasize sunlight pouring in through windows.
I initially was thinking of placing the camera up somewhere in the ceiling of the shop and showing a sort of overhead view of the entire shop, but I abandoned the overhead idea pretty quickly since I wanted to emphasize the character more (especially after putting so much work into the character).
Once I decided that I wanted a more focused shot of the character with lots of bright sunny backlighting, I arrived at an overall framing and even set dressing that actually largely stayed mostly the same throughout the rest of the project, albeit with minor adjustments here and there.
Almost all of the props are taken from the original provided assets, with a handful of notable exceptions: in the final scene, the table and benches, telephone, and neon sign are my own models.
Figuring out where to put the character took some more experimentation; I originally had the character up front and center and sitting such that her side is facing the camera.
However, having the character up front and center made her feel not particularly integrated with the rest of the scene, so I eventually placed her behind the big table and changed her pose so that she’s sitting facing the camera.</p>
<p>Here are some major points along the progression of my layout and set dressing explorations:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Apr/magicshop/progress_frame.018.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Apr/magicshop/preview/progress_frame.018.jpg" alt="Figure 9: First layout test with set dressing and posed character." /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Apr/magicshop/progress_frame.026.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Apr/magicshop/preview/progress_frame.026.jpg" alt="Figure 10: Rotating the character and moving her behind the table for better integration into the overall scene." /></a></p>
<p>One interesting change that I think had a huge impact on how the scene felt overall actually had nothing to do with the set dressing at all, but instead had to do with the camera itself.
At some point I tried pulling the camera back further from the character and using a much narrower lens, which had the overall effect of pulling the entire frame much closer and tighter on the character and giving everything an ever-so-slightly more orthographic feel.
I really liked how this lensing worked; to me it made the overall composition feel much more focused on the character.
Also around this point is when I started integrating the character with completed shading and texturing and fuzz into the scene, and I was really happy to see how well the peach fuzz and clothing fuzz worked out with the backlighting:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Apr/magicshop/progress_frame.032.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Apr/magicshop/preview/progress_frame.032.jpg" alt="Figure 11: Focusing on the character by using a narrower lens on a camera placed further back. Also at this point I integrated the reshaded/retextured outfit and fuzz elements in." /></a></p>
<p>Once I had the overall blocking of the scene and rough set dressing done, the next step was to shade and texture everything!
Since my scene is set indoors, I knew that global illumination coming off of the walls and floor and ceiling of the room itself was going to play a large role in the overall lighting and look of the final image, so I started the lookdev process with the room’s structure itself.</p>
<p>The first decision to tackle was whether or not to have glass in the big window thing behind the character.
I didn’t really want to put glass in the window, since most of the light for the scene was coming through the window and having to sample the primary light source through glass was going to be really bad for render times.
Instead, I decided that the window was going to be an <em>interior</em> window opening up into some kind of sunroom on the other side, so that I could get away with not putting glass in.
The story I made up in my head was that the sunroom on the other side, being a sunroom, would be bright enough that I could just blow it out entirely to white in the final image.
To help sell the idea, I thought it would be fun to have some ivy or vines growing through the window’s diamond-shaped sections; maybe they’re coming from a giant potted plant or something in the sunroom on the other side.</p>
<p>I initially tried creating the ivy vines using SpeedTree, but I haven’t really used SpeedTree too extensively before and the vines toolset was completely unfamiliar to me.
Since I didn’t have a whole lot of time to work on this project overall, I wound up tabling SpeedTree on this project and instead opted to fall back on a (much) older but more familiar tool: <a href="http://ivy-generator.com">Thomas Luft’s standalone Ivy Generator program</a>.
After several iterations to get an ivy growth pattern that I liked, I textured and shaded the vines and ivy leaves using some atlases from Quixel Megascans.
The nice thing about adding in the ivy was that it helped break up how overwhelmingly bright the entire window was:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Apr/magicshop/progress_frame.037.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Apr/magicshop/preview/progress_frame.037.jpg" alt="Figure 12: Scene with ivy vines integrated in to break up the giant background window. Also, at this point I had adjusted the camera lensing again to arrive at what was basically my final layout." /></a></p>
<p>For the overall look of the room, I opted for a sort-of Mediterranean look inspired by the architecture of the tower that came with the scene (despite the fact that the tower isn’t actually in my image).
Based on the Mediterranean idea, I wanted to make the windows out of a fired terracotta brick sort of material and, after initially experimenting with brick walls, I decided to go with stone walls.
To help sell the look of a window made out of stacked fired terracotta blocks, I added a bit more unevenness to the window geometry, and I used fired orange terracotta clay flower pots as a reference for what the fired terracotta material should look like.
To help break up how flat the window geometry is and to help give the blocks a more handmade look, I added unique color unevenness per block and also added a bunch of swirly and dimply patterns to the material’s displacement:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Apr/magicshop/window_terracotta.003.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Apr/magicshop/preview/window_terracotta.003.jpg" alt="Figure 13: Lookdev test for the fired terracotta window blocks. All of the unevenness and swirly patterns are coming from roughness and displacement." /></a></p>
<p>To create the stone walls, I just (heavily) modified a preexisting stone material that I got off of Substance Source; the final look relies very heavily on displacement mapping since the base geometry is basically just a flat plane.
I made only the back wall a stone wall; I decided to make the side wall on the right out of plaster instead just so I wouldn’t have to figure out how to make two stone walls meet up at a corner.
I also wound up completely replacing the stone floor with a parquet wood floor, since I wanted some warm bounce coming up from the floor onto the character.
Each plank in the parquet wood floor is a piece of individual geometry.
Putting it all together, here’s what the shading for the room structure looks like:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Apr/magicshop/window_terracotta.004.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Apr/magicshop/preview/window_terracotta.004.jpg" alt="Figure 14: Putting the room all together. The rock walls rely entirely on displacement, while the parquet floor uses individually modeled floorboards instead of displacement." /></a></p>
<p>The actual materials in my final image are not nearly as diffuse looking as everything looks in the above test render; my lookdev test setup’s lighting setup is relatively diffuse/soft, which I guess didn’t really serve as a great predictor for how things looked in my actual scene since the lighting in my actual scene landed somewhere super strongly backlit.
Also, note how all of the places where different walls meet each other and where the walls meet the floor are super janky; I didn’t bother putting much effort in there since I knew that those areas were either going to be outside of the final frame or were going to be hidden behind props and furniture.</p>
<p><strong>So Many Props!</strong></p>
<p>With the character and room completed, all that was left to do for texturing and shading was just lots and lots of props.
This part was both the easiest and most difficult part of the entire project- easy because all of the miscellaneous props were relatively straightforward to texture and shade, but difficult simply because there were a <em>lot</em> of props.
However, the props were also the funnest part of the whole project!
Thinking about how to make each prop detailed and interesting and unique was an interesting exercise, and I also had fun sneaking in a lot of little easter eggs and references to things I like here and there.</p>
<p>My process for texturing and shading props was a very straightforward workflow that is basically completely unchanged from the workflow I settled into on the previous Shipshape RenderMan Art Challenge: use Substance Painter for texturing, UDIM tiles for high resolution textures, and PxrSurface as the shader for everything.
The only different from in previous projects was that I used a far lazier UV mapping process: almost every prop was just auto-UV’d with some minor adjustments here and there.
The reason I relied on auto-UVs this time was just because I didn’t have a whole lot of time on this project and couldn’t afford to spend the time to do precise careful high quality by-hand UVs for everything, but I figured that since all of the props would be relatively small in image space in the final frame, I could get away with hiding seams from crappy UVs by just exporting really high-resolution textures from Substance Painter.
Yes, this approach is extremely inefficient, but it worked well enough considering how little time I had.</p>
<p>Since a lot of bounce lighting on the character’s face was going to have to come from the table, the first props I textured and shaded were the table and accompanying benches.
I tried to make the table and bench match each other; they both use a darker wood for the support legs and have metal bits in the frame, and have a lighter wood for the top.
I think I got a good amount of interesting wear and stuff on the benches on my first attempt, but getting the right amount of wear on the table’s top took a couple of iterations to get right.
Again, due to how diffuse my lookdev test setup on this project was, the detail and wear in the table’s top showed up better in my final scene than in these test renders:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Apr/magicshop/bench.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Apr/magicshop/preview/bench.jpg" alt="Figure 15: Bench with dark wood legs, metal diagonal braces, and lighter wood top." /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Apr/magicshop/table.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Apr/magicshop/preview/table.jpg" alt="Figure 16: Main table with chiseled dark wood legs, a metal underframe, a lighter wood top, and gold inlaid runes on the side." /></a></p>
<p>To have a bit of fun and add a slight tiny hint of mystery and magic into the scene, I put some inlaid gold runes into the side of the table.
The runes are a favorite scifi/fantasy quote of mine, which is an inversion of Clarke’s third law.
They read: “any sufficiently rigorously defined magic is indistinguishable from technology”; this quote became something of a driving theme for the props in the scene.
I wanted to give a sense that this shop is a bookshop specializing in books about magic, but the magic of this world is not arbitrary and random; instead, this world’s magic has been studied and systematized into almost another branch of science.</p>
<p>A lot of the props did require minor geometric modifications to make them more plausible.
For example, the cardboard box was originally made entirely out of single-sided surfaces with zero thickness; I had to extrude the surfaces of the box in order to have enough thickness to seem convincing.
There’s not a whole lot else interesting to write about with the cardboard box; it’s just corrugated cardboard.
Although, I do have to say that I am pretty happy with how convincingly cardboard the cardboard boxes came out!
Similarly, the scrolls just use a simple paper texture and, as one would expect with paper, use some diffuse transmission as well.
Each of the scrolls has a unique design, which provided an opportunity for some fun personal easter eggs.
Two of the scrolls have some SIGGRAPH paper abstracts translated into the same runes that the inlay on the table uses.
One of the scrolls has a wireframe schematic of the wand prop that sits on the table in the final scene; my idea was that this scroll is one of the technical schematics that the character used to construct her wand.
To fit with this technical schematic idea, the two sheets of paper in the background on the right wall use the same paper texture as the scrolls and similarly have technical blueprints on them for the record player and camera props.
The last scroll in the box is a city map made using <a href="https://github.com/watabou/TownGeneratorOS">Oleg Dolya’s wonderful Medieval Fantasy City Generator</a> tool, which is a fun little tool that does exactly what the name suggests and with which I’ve wasted more time than I’d like to admit generating and daydreaming about made up little fantasy towns.</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Apr/magicshop/box.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Apr/magicshop/preview/box.jpg" alt="Figure 17: Corrugated cardboard box containing technical magic scrolls and maps." /></a></p>
<p>The next prop I worked on was the mannequin, which was even more straightforward than the cardboard box and scrolls.
For the mannequin’s wooden components, I relied entirely on triplanar projections in Substance Painter oriented such that the grain of the wood would flow correctly along each part.
The wood material is just a modified version of a default Substance Painter smart material, with additional wear and dust and stuff layered on top to give everything a bit more personality:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Apr/magicshop/mannequin.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Apr/magicshop/preview/mannequin.jpg" alt="Figure 18: Mannequin prop made from wood and metal." /></a></p>
<p>The record player was a fun prop the texture and shade, since there were a lot of components and a lot of room for adding little details and touches.
I found a bunch of reference online for briefcase record players and, based off of the reference, I chose to make the actual record player part of the briefcase out of metal, black leather, and black plastic.
The briefcase itself is made from a sort of canvas-like material stretched over a hard shell, with brass hardware for the clasps and corner reinforcements and stuff.
For the speaker openings, instead of going with a normal grid-like dot pattern, I put in an interesting swirly design.
The inside of the briefcase lid uses a red fabric, with a custom gold imprinted logo for an imaginary music company that I made up for this project: “SeneTone”.
I don’t know why, but my favorite details to do when texturing and shading props is stuff like logos and labels and stuff; I think that it’s always things like labels that you’d expect in real life that really help make something CG believable.</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Apr/magicshop/recordplayer.001.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Apr/magicshop/preview/recordplayer.001.jpg" alt="Figure 19: Record player briefcase prop, wide view." /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Apr/magicshop/recordplayer.002.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Apr/magicshop/preview/recordplayer.002.jpg" alt="Figure 20: Close-up of the actual record player part of the briefcase." /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Apr/magicshop/recordplayer.003.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Apr/magicshop/preview/recordplayer.003.jpg" alt="Figure 21: Close-up of the red fabric briefcase liner and gold "SeneTone" logo." /></a></p>
<p>The camera prop took some time to figure out what to do with, mostly because I wasn’t actually sure whether it was a camera or a projector initially!
While this prop looks like an old hand-cranked movie camera. the size of the prop in the scene that Pixar provided threw me off; the prop is way larger than any references for hand-cranked movie cameras that I could find.
I eventually decided that the size could probably be handwaved away by explaining the camera as some sort of really large-format camera.
I decided to model the look of the camera prop after professional film equipment from roughly the 1960s, when high-end cameras and stuff were almost uniformly made out of steel or aluminum housings with black leather or plastic grips.
Modern high-end camera gear also tends to be made from metal, but in modern gear the metal is usually completely covered in plastic or colored power-coating, whereas the equipment from the 1960s I saw had a lot of exposed silvery-grey metal finishes with covering materials only in areas that a user would expect to touch or hold.
So, I decided to give the camera prop an exposed gunmetal finish, with black leather and black plastic grips.
I also reworked the lens and what I think is a rangefinder to include actual optical elements, so that they would look right when viewed from a straight-on angle.
As an homage to old film cinema, I made a little “Super 35” logo for the camera (even though the Super 35 film format is a bit anachronistic for a 1960s era camera).
The “Senecam” typemark is inspired by how camera companies often put their own typemark right across the top of the camera over the lens mount.</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Apr/magicshop/camera.001.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Apr/magicshop/preview/camera.001.jpg" alt="Figure 22: Camera prop front view. Note all of the layers of refraction and reflection in the lens." /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Apr/magicshop/camera.002.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Apr/magicshop/preview/camera.002.jpg" alt="Figure 23: Top view of the camera." /></a></p>
<p>The crystal was really interesting to shade.
I wanted to give the internals of the crystal some structure, and I didn’t want the crystal to refract a uniform color throughout.
To get some interesting internal structure, I wound up just shoving a bunch of crumpled up quads inside of the crystal mesh.
The internal crumpled up geometry refracts a couple of different variants of blue and light blue, and the internal geometry has a small amount of emission as well to get a bit of a glowy effect.
The outer shell of the crystal refracts mostly pink and purple; this dual-color scheme gives the internals of the crystal a lot of interesting depth.
The back-story in my head was that this crystal came from a giant geode or something, so I made the bottom of the crystal have bits of a more stony surface to suggest where the crystal was once attached to the inside of a stone geode.
The displacement on the crystal is basically just a bunch of rocky displacement patterns piled on top of each other using triplanar projects in Substance Painter; I think the final look is suitably magical!</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Apr/magicshop/crystal_inside.png"><img src="https://blog.yiningkarlli.com/content/images/2021/Apr/magicshop/crystal_inside.png" alt="Figure 24: Wireframe of the crystal's internal geometry with crumpled up quads." /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Apr/magicshop/crystal.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Apr/magicshop/preview/crystal.jpg" alt="Figure 25: Final magical glowy look of the crystal." /></a></p>
<p>Originally the crystal was going to be on one of the back shelves, but I liked how the crystal turned out so much that I decided to promote it to a foreground prop and put it on the foreground table.
I then filled the crystal’s original location on the back shelf with a pile of books.</p>
<p>I liked the crystal look so much that I decided to make the star on the magic wand out of the same crystal material.
The story I came up with in my head is that in this world, magic requires these crystals as a sort of focusing or transmitting element.
The magic wand’s star is shaded using the same technique as the crystal: the inside has a bunch of crumpled up refractive geometry to produce all of the interesting color variation and appearance of internal fractures and cracks, and the outer surface’s displacement is just a bunch of rocky patterns randomly stacked on top of each other.</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Apr/magicshop/wand.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Apr/magicshop/preview/wand.jpg" alt="Figure 26: Magic wand star made from the same material as the crystal." /></a></p>
<p>The flower-shaped lamps hanging above the table are also made from the same crystal material, albeit a much more simplified version.
The lamps are polished completely smooth and don’t have all of the crumpled up internal geometry since I wanted the lamps to be crack-free.</p>
<p>The potted plant on top of the stack of record crates was probably one of the easiest props to texture and shade.
The pot itself uses the same orange fired terracotta material as the main windows, but with displacement removed and with a bit less roughness.
The leaves and bark on the branches are straight from Quixel Megascans.
The displacement for the branches is actually slightly broken in both the test render below and in the final render, but since it’s a background prop and relatively far from the camera, I actually didn’t really notice until I was writing this post.</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Apr/magicshop/pottedherb.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Apr/magicshop/preview/pottedherb.jpg" alt="Figure 27: Magic wand star made from the same material as the crystal." /></a></p>
<p>The reason that the character in my scene is talking on an old-school rotary dial phone is… actually, there isn’t a strong reason.
I originally was tinkering with a completely different idea on that did have a strong story reason for the phone, but I abandoned that idea very early on.
Somehow the phone always stayed in my scene though!
Since the setting of my final scene is a <em>magic</em> bookshop, I figured that maybe the character is working at the shop and maybe she’s casting a spell over the phone!</p>
<p>The phone itself is kit-bashed together from a stock model that I had in my stock model library.
I did have to create the cord from scratch, since the cord needed to stretch from the main phone set to the receiver in the character’s hand.
I modeled the cord in Maya by first creating a guide curve that described the path the cord was supposed to follow, and then making a helix and making it follow the guide curve using Animate -> Motion Paths -> Flow Path Object tool.
The Flow Path Object tool puts a lattice deformer around the helix and makes the lattice deformer follow the guide curve, which in turn deforms the helix to follow as well.</p>
<p>As with everything else in the scene, all of the shading and texturing for the phone is my own.
The phone is made from a simple red Bakelite plastic with some scuffs and scratches and fingerprints to make it look well used, while the dial and hook switch are made of a simple metal material.
I noticed that in some of the references images of old rotary phones that I found, the phones sometimes had a nameplate on them somewhere with the name of the phone company that provided the phone, so I made up yet another fictional logo and stuck it on the front of the phone.
The fictional phone company is “Senecom”; all of the little references to a place called Seneca hint that maybe this image is set in the same world as my entry for the previous RenderMan Art Challenge.
In the final image, you can’t actually see the Senecom logo though, but again at least I know it’s there!</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Apr/magicshop/phone_set.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Apr/magicshop/preview/phone_set.jpg" alt="Figure 28: "Senecom" phone set, with custom modeled curly cord." /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Apr/magicshop/phone_receiver.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Apr/magicshop/preview/phone_receiver.jpg" alt="Figure 29: Phone handset, made from red plastic." /></a></p>
<p><strong>Signs and Records and Books</strong></p>
<p>While I was looking up reference for bookstores with shading books in mind, I came across an image of a sign reading “Books are Magic” from a bookstore in Brooklyn with that name.
Seeing that sign provided a good boost of inspiration for how I proceeded with theming my bookstore set, and I liked the sign so much that I decided to make a bit of an homage to it in my scene.
I wasn’t entirely sure how to make a neon sign though, so I had to do some experimentation.
I started by laying out curves in Adobe Illustrator and bringing them into Maya.
I then made each glass tube by just extruding a cylinder along each curve, and then I extruded a narrower cylinder along the same curve for the glowy part inside of the glass tube.
Each glass tube has a glass shader with colored refraction and uses the thin glass option, since real neon glass tubes are hollow.
The glowy part inside is a mesh light.
To make the renders converge more quickly, I actually duplicated each mesh light; one mesh light is white, is visible to camera, and has thin shadows disabled to provide to look of the glowy neon core, and the second mesh light is red, invisible to camera, and has thin shadows enabled to allow for casting colored glow outside of the glass tubes without introducing tons of noise.
Inside of Maya, this setup looks like the following:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Apr/magicshop/neonsign_maya.png"><img src="https://blog.yiningkarlli.com/content/images/2021/Apr/magicshop/neonsign_maya.png" alt="Figure 30: Neon sign setup in Maya." /></a></p>
<p>After all of this setup work, I gave the neon tubes a test render, and to my enormous surprise and relief, it looks promising!
This was the first test render of the neon tubes; when I saw this, I knew that the neon sign was going to work out after all:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Apr/magicshop/neonsign_1.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Apr/magicshop/preview/neonsign_1.jpg" alt="Figure 31: First neon sign render test." /></a></p>
<p>After getting the actual neon tubes part of the neon sign working, I added in a supporting frame and wires and stuff.
In the final scene, the neon sign is held onto the back wall using screws (which I actually modeled as well, even though as usual for all of the tiny things that I put way too much effort into, you can’t really see them).
Here is the neon sign on its frame:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Apr/magicshop/neonsign_2.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Apr/magicshop/preview/neonsign_2.jpg" alt="Figure 32: Final neon sign prop with frame and wires." /></a></p>
<p>The single most time consuming prop in the entire project wound up being the stack of record crates behind the character to the right; I don’t know why I decided to make a stack of record crates, considering how many unique records I wound up having to make to give the whole thing a plausible feel.
In the end I made around twenty different custom album covers; the titles are borrowed from stuff I had recently listened to at the time, but all of the artwork is completely custom to avoid any possible copyright problems with using real album artwork.
The sharp-eyed long-time blog reader may notice that a lot of the album covers reuse renders that I’ve previously posted on this blog before!
For the record crates themselves, I chose a layered laminated wood, which I figured in real life is a sturdy but relatively inexpensive material.
Or course, instead of making all of the crates identical duplicates of each other, I gave each crate a unique wood grain pattern.
The vinyl records that are sticking out here and there have a simple black glossy plastic material with bump mapping for the grooves; I was pleasantly surprised at how well the grooves catch light given that they’re entirely done through bump mapping.</p>
<p>Coming up with all of the different album covers was pretty fun!
Different covers have different neat design elements; some have metallic gold leaf text, others have embossed designs, there are a bunch of different paper varieties, etc.
The common design element tying all of the album covers together is that they all have a “SeneTone” logo on them, to go with the “SeneTone” record player prop.
To create the album covers, I created the designs in Photoshop with separate masks for different elements like metallic text and whatnot, and then used the masks to drive different layers in Substance Painter.
In Substance Painter, I actually created different paper finishes for different albums; some have a matte paper finish, some have a high gloss magazine-like finish, some have rough cloth-like textured finishes, some have smooth finishes, and more.
I guess none of this really matters from a distance, but it was fun to make, and more importantly to myself, I know that all of those details are there!
After randomizing which records get which album covers, here’s what the record crates look like:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Apr/magicshop/recordcrates_4k.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Apr/magicshop/preview/recordcrates.jpg" alt="Figure 33: Record crates stack with randomized, custom album covers. Click through for a high-res 4K render if you want to see all of the little details." /></a></p>
<p>The various piles of books sitting around the scene also took a ton of time, for similar reasons to why the records took so much time: I wanted each book to be unique.
Much like the records, I don’t know why I chose to have so many books, because it sure took a long time to make around twenty different unique books!
My idea was to have a whole bunch of the books scattered around suggesting that the main character has been teaching herself how to build a magic wand and cast spells and such- quite literally “books are magic” because the books are textbooks for various magical topics
Here is one of the textbooks- this one about casting spells over the telephone, since the character is on the phone.
Maybe she’s trying to charm whoever is on the other end!</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Apr/magicshop/spellbook.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Apr/magicshop/preview/spellbook.jpg" alt="Figure 34: Hero "Casting Spells over Telephone" book prop. This book was also the protoype for all of the other books!" /></a></p>
<p>I wound up significantly modifying the provided book model; I created several different basic book variants and also a few open book variants, for which I had to also model some pages and stuff.
Because of how visible the books are in my framing, I didn’t want to have any obvious repeats in the books, so I textured every single one of them to be unique.
I also added in some little sticky-note bookmarks into the books, to make it look like they’re being actively read and referenced.</p>
<p>Creating all of the different books with completely different cover materials and bindings and page styles was a lot of fun!
Some of the most interesting covers to create were the ones with intricate gold or silver foil designs on the front; for many of these, I found pictures of really old books and did a bunch of Photoshop work to extract and clean up the cover design for use as a layer mask in Substance Painter.
Here are some of the books I made:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Apr/magicshop/books1.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Apr/magicshop/preview/books1.jpg" alt="Figure 35: Each one of these textbooks is a play on something I have on my home bookshelf." /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Apr/magicshop/books3.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Apr/magicshop/preview/books3.jpg" alt="Figure 36: Test render of various different types of pages, along with sticky notes." /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Apr/magicshop/books4.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Apr/magicshop/preview/books4.jpg" alt="Figure 37: Another test render of different types of pages and of pages sticking out." /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Apr/magicshop/books6.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Apr/magicshop/preview/books6.jpg" alt="Figure 38: A bunch more books, including a Seneca book!" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Apr/magicshop/books7.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Apr/magicshop/preview/books7.jpg" alt="Figure 39: Even more books. Did you notice the copy of PBRTv3 in the background?" /></a></p>
<p>One fun part of making all of these books was that they were a great opportunity for sneaking in a bunch of personal easter eggs.
Many of the book titles are references to computer graphics and rendering concepts.
Some of the book authors are just completely made up or pulled from whatever book caught my eye off of my bookshelf at the moment, but also included among the authors are all of the names of the Hyperion team’s current members at the time that I did this project.
There is also, of course, a book about Seneca, and there’s a book referencing Minecraft.
The green book titled “The Compleat Atlas of the House and Immediate Environs” is a reference to Garth Nix’s “Keys to the Kingdom” series, which my brother and I loved when we were growing up and had a significant influence on how the type of kind-of-a-science magic I like in fantasy settings.
Also, of course, as is obligatory since I am a rendering engineer, there is a copy of <a href="http://www.pbr-book.org">Physically Based Rendering 3rd Edition</a> hidden somewhere in the final scene; see if you can spot it!</p>
<p><strong>Putting Everything Together</strong></p>
<p>At this point, with all extra modeling completed and everything textured and shaded, the time came for final touches and lighting!
Since one of the books I made is about levitation enchantments, I decided to use that to justify making one of the books float in mid-air in front of the character.
To help sell that floating-in-air enchantment, I made some magical glowy pixie dust particles coming from the wand; the pixie dust is just some basic nParticles following a curve.
The pixie dust is shaded using PxrSurface’s glow parameter.
I used the particleId primvar to drive a PxrVary node, which in turn is used to randomize the pixie dust colors and opacity.
Putting everything together at this point looked like this:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Apr/magicshop/progress.075.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Apr/magicshop/preview/progress.075.jpg" alt="Figure 40: Putting everything together for the first time with everything textured and shaded." /></a></p>
<p>I originally wanted to add some cobwebs in the corners of the room and stuff, but at this point I had so little time remaining that I had to move on directly to final shot lighting.
I did however have time for two small last-minute tweaks: I adjusted the character’s pose a slight amount to tilt her head towards the phone more, which is closer to how people actually talk on the phone, and I also moved up the overhead lamps a bit to try not to crowd out her head.</p>
<p>The final shot lighting is not actually that far of a departure from the lighting I had already roughed in at this point; mostly the final lighting just consisted of tweaks and adjustments here and there.
I added a bunch of PxrRodFilters to take down hot spots and help shape the lighting overall a bit more.
The rods I added were to bright down the overhead lamps and prevent the lamps from blowing out, to slightly brighten up some background shelf books, to knock down a hot spot on a foreground book, and to knock down hot spots on the floor and on the bench.
I also brought down the brightness of the neon sign a bit, since the brightness of the sign should be lower relative to how incredibly bright the windows were.
Here is what my Maya viewport looked like with all of the rods; everything green in this screenshot is a rod:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Apr/magicshop/rods.png"><img src="https://blog.yiningkarlli.com/content/images/2021/Apr/magicshop/rods.png" alt="Figure 41: Maya viewport with rods highlighted in green." /></a></p>
<p>One of the biggest/trickiest changes I made to the lighting setup was actually for technical reasons instead of artistic reasons: the back window was originally so bright that the brightness was starting to break pixel filtering for any pixel that partially overlapped the back window.
To solve this problem, I split the dome light outside of the window into two dome lights; the two new lights added up to the same intensity as the old one, but the two lights split the energy such that one light had 85% of the energy and was not visible to camera while the other light had 15% of the energy and was visible to camera.
This change had the effect of preserving the overall illumination in the room while knocking down the actual whites seen through the windows to a level low enough that pixel filtering no longer broke as badly.</p>
<p>At this point I arrived at my final main beauty pass.
In previous RenderMan Art Challenges, I broke out lights into several different render passes so that I could adjust them separately in comp before recombining, but for this project, I just rendered out everything on a single pass:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Apr/magicshop/progress.083.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Apr/magicshop/preview/progress.083.jpg" alt="Figure 42: Final render, beauty pass." /></a></p>
<p>Here is a comparison of the final beauty pass with the initial putting-everything-together render from Figure 40.
Note how the overall lighting is actually not too different, but there are many small adjustments and tweaks:</p>
<div class="embed-container">
<iframe src="/content/images/2021/Apr/magicshop/comparisons/beforeafterlighting_embed.html" frameborder="0" border="0" scrolling="no"></iframe></div>
<div class="figcaption">Figure 43: Before (left) and after (right) final lighting. For a full screen comparison, <a href="/content/images/2021/Apr/magicshop/comparisons/beforeafterlighting.html">click here.</a></div>
<p>To help shape the lighting a bit more, I added a basic atmospheric volume pass.
Unlike in previous RenderMan Art Challenges where I used fancy VDBs and whatnot to create complex atmospherics and volumes, for this scene I just used a simple homogeneous volume box.
My main goal with the atmospheric volume pass was to capture some subtly godray-like lighting effects coming from the back windows:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Apr/magicshop/progress.083.volumes.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Apr/magicshop/preview/progress.083.volumes.jpg" alt="Figure 44: Final render, volumes pass." /></a></p>
<p>For the final composite, I used the same Photoshop and Lightroom workflow that I used for the previous two RenderMan Art Challenges.
For future personal art projects I’ll be moving to a DaVinci Resolve/Fusion compositing workflow, but this time around I reached for what I already knew since I was so short on time.
Just like last time, I used basically only exposure adjustments in Photoshop, flattened out, and brought the image into Lightroom for final color grading.
In Lightroom I further brightened things a bit, made the scene warmer, and added just a bit more glowy-ness to everything.
Figure 45 is a gif that visualizes the compositing steps I took for the final image.
Figure 46 shows what all of the lighting, comp, and color grading looks like applied to a 50% grey clay shaded version of the scene, and Figure 47 repeats what the final image looks like so that you don’t have to scroll all the way back to the top of this post.</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Apr/magicshop/composite_breakdown.gif"><img src="https://blog.yiningkarlli.com/content/images/2021/Apr/magicshop/composite_breakdown.gif" alt="Figure 45: Animated breakdown of compositing layers." /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Apr/magicshop/greyshaded_4k.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Apr/magicshop/preview/greyshaded.jpg" alt="Figure 46: Final lighting, comp, and color grading applied to a 50% grey clay shaded version. Click for 4K version." /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Apr/magicshop/magicshop_full_4k.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Apr/magicshop/preview/magicshop_full.jpg" alt="Figure 47: Final image. Click for 4K version." /></a></p>
<p><strong>Conclusion</strong></p>
<p>Despite having much less free time to work on this RenderMan Art Challenge, and despite not having really intended to even enter the contest initially, I think things turned out okay!
I certainly wasn’t expect to actually win a placed position again!
I learned a ton about character shading, which I think is a good step towards filling a major hole in my areas of experience.
For all of the props and stuff, I was pretty happy to find that my Substance Painter workflow is now sufficiently practiced and refined that I was able to churn through everything relatively efficiently.
At the end of the day, stuff like art simply requires practice to get better at, and this project was a great excuse to practice!</p>
<p>Here is a progression video I put together from all of the test and in-progress renders that I made throughout this entire project:</p>
<div class="embed-container"><iframe src="https://player.vimeo.com/video/486289496" frameborder="0">Magic Shop Art Challenge Progression Reel</iframe></div>
<div class="figcaption">Figure 48: Progression reel made from test and in-progress renders leading up to my final image.</div>
<p>As usual with these art projects, I owe an enormous debt of gratitude to my wife, Harmony Li, both for giving invaluable feedback and suggestions (she has a much better eye than I do!), and also for putting up with me going off on another wild time-consuming art adventure.
Also, as always, Leif Pederson from Pixar’s RenderMan group provided lots of invaluable feedback, notes, and encouragement, as did everyone else in the RenderMan Art Challenge community.
Seeing everyone else’s entries is always super inspiring, and being able to work side by side with such amazing artists and such friendly people is a huge honor and very humbling.
If you would like to see more about my contest entry, check out the <a href="https://renderman.pixar.com/answers/challenge/19140/call-me-maybe.html?page=1&pageSize=10&sort=oldest">work-in-progress thread I kept on Pixar’s Art Challenge forum</a>, and I also have an <a href="https://www.artstation.com/artwork/ykRWVK">Artstation post</a> for this project.</p>
<p>Finally, here’s a bonus alternate angle render of my scene. I made this alternate angle render for fun after the project and out of curiosity to see how well things held up from a different angle, since I very much “worked to camera” for the duration of the entire project.
I was pleasantly surprised that everything held up well from a different angle!</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Apr/magicshop/altangle_4k.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Apr/magicshop/preview/altangle.jpg" alt="Figure 49: Bonus image: alternate camera angle. Click for 4K version." /></a></p>
<p><strong>References</strong></p>
<p>Carlos Allaga, Carlos Castillo, Diego Gutierrez, Miguel A. Otaduy, Jorge López-Moreno, and Adrian Jarabo. 2017. <a href="https://doi.org/10.1111/cgf.13222">An Appearance Model for Textile Fibers</a>. <em>Computer Graphics Forum</em>. 36, 4 (2017), 35-45.</p>
<p>Brent Burley and Dylan Lacewell. 2008. <a href="https://doi.org/10.1111/j.1467-8659.2008.01253.x">Ptex: Per-face Texture Mapping for Production Rendering</a>. <em>Computer Graphics Forum</em>. 27, 4 (2008), 1155-1164.</p>
<p>Brent Burley. 2015. <a href="https://doi.org/10.1145/2776880.2787670">Extending the Disney BRDF to a BSDF with Integrated Subsurface Scattering</a>. In <a href="https://blog.selfshadow.com/publications/s2015-shading-course"><em>ACM SIGGRAPH 2015 Course Notes: Physically Based Shading in Theory and Practice</em></a>.</p>
<p>Brent Burley, David Adler, Matt Jen-Yuan Chiang, Ralf Habel, Patrick Kelly, Peter Kutz, Yining Karl Li, and Daniel Teece. 2017. <a href="https://www.yiningkarlli.com/projects/ptcourse2017.html">Recent Advances in Disney’s Hyperion Renderer</a>. In <a href="http://dx.doi.org/10.1145/3084873.3084904"><em>ACM SIGGRAPH 2017 Course Notes: Path Tracing in Production Part 1</em></a>, 26-34.</p>
<p>Matt Jen-Yuan Chiang, Benedikt Bitterli, Chuck Tappan, and Brent Burley. 2016. <a href="https://doi.org/10.1111/cgf.12830">A Practical and Controllable Hair and Fur Model for Production Path Tracing</a>. <em>Computer Graphics Forum</em>. 35, 2 (2016), 275-283.</p>
<p>Matt Jen-Yuan Chiang, Peter Kutz, and Brent Burley. 2016. <a href="https://dl.acm.org/doi/10.1145/2897839.2927433">Practical and Controllable Subsurface Scattering for Production Path Tracing</a>. In <em>ACM SIGGRAPH 2016 Talks</em>, 49:1-49:2.</p>
<p>Philip Child. 2012. <a href="https://drive.google.com/file/d/1bNSwpPusRmRmGfPwe11tjtloCP96WN1P/view?usp=sharing">Ill-Loom-inating Brave’s Handmade Fabric</a>. In <em>ACM SIGGRAPH 2012, Talks</em>.</p>
<p>Per H. Christensen and Brent Burley. 2015. <a href="https://graphics.pixar.com/library/ApproxBSSRDF">Approximate Reflectance Profiles for Efficient Subsurface Scattering</a>. <em>Pixar Technical Memo #15-04</em>.</p>
<p>Trent Crow, Michael Kilgore, and Junyi Ling. 2018. <a href="https://dl.acm.org/citation.cfm?id=3214787">Dressed for Saving the Day: Finer Details for Garment Shading on Incredibles 2</a>. In <em>ACM SIGGRAPH 2018 Talks</em>, 6:1-6:2.</p>
<p>Priyamvad Deshmukh, Feng Xie, and Eric Tabellion. 2017. <a href="https://dl.acm.org/citation.cfm?id=3085024">DreamWorks Fabric Shading Model: From Artist Friendly to Physically Plausible</a>. In <em>ACM SIGGRAPH 2017 Talks</em>. 38:1-38:2.</p>
<p>Eugene d’Eon. 2012. <a href="http://www.eugenedeon.com/project/a-better-dipole/">A Better Dipole</a>. <a href="http://www.eugenedeon.com/project/a-better-dipole/"><em>http://www.eugenedeon.com/project/a-better-dipole/</em></a></p>
<p>Eugene d’Eon, Guillaume Francois, Martin Hill, Joe Letteri, and Jean-Marie Aubry. 2011. <a href="https://doi.org/10.1111/j.1467-8659.2011.01976.x">An Energy-Conserving Hair Reflectance Model</a>. <em>Computer Graphics Forum</em>. 30, 4 (2011), 1181-1187.</p>
<p>Christophe Hery. 2003. <a href="https://graphics.pixar.com/library/RMan2003/">Implementing a Skin BSSRDF</a>. In <em>ACM SIGGRAPH 2003 Course Notes: RenderMan, Theory and Practice</em>. 73-88.</p>
<p>Christophe Hery. 2012. <a href="https://graphics.pixar.com/library/TexturingBetterDipole/">Texture Mapping for the Better Dipole Model</a>. <em>Pixar Technical Memo #12-11</em>.</p>
<p>Christophe Hery and Junyi Ling. 2017. <a href="http://graphics.pixar.com/library/PxrMaterialsCourse2017/index.html">Pixar’s Foundation for Materials: PxrSurface and PxrMarschnerHair</a>. In <a href="https://blog.selfshadow.com/publications/s2017-shading-course/"><em>ACM SIGGRAPH 2017 Course Notes: Physically Based Shading in Theory and Practice</em></a>.</p>
<p>Jonathan Hoffman, Matt Kuruc, Junyi Ling, Alex Marino, George Nguyen, and Sasha Ouellet. 2020. <a href="http://graphics.pixar.com/library/CurveCloth/">Hypertextural Garments on Pixar’s <em>Soul</em></a>. In <em>ACM SIGGRAPH 2020 Talks</em>. 75:1-75:2.</p>
<p>Henrik Wann Jensen, Steve Marschner, Marc Levoy, and Pat Hanrahan. 2001. <a href="https://dl.acm.org/doi/10.1145/383259.383319">A Practical Model for Subsurface Light Transport</a> In <em>Proceedings of SIGGRAPH 2001</em>. 511-518.</p>
<p>Ying Liu, Jared Wright, and Alexander Alvarado. 2020. <a href="https://dl.acm.org/doi/10.1145/3388767.3407360">Making Beautiful Embroidery for “Frozen 2”</a>. In <em>ACM SIGGRAPH 2020 Talks</em>, 73:1-73:2.</p>
<p>Steve Marschner, Henrik Wann Jensen, Mike Cammarano, Steve Worley, and Pat Hanrahan. 2003. <a href="https://doi.org/10.1145/882262.882345">Light Scattering from Human Hair Fibers</a>. <em>ACM Transactions on Graphics</em>. 22, 3 (2003), 780-791.</p>
<p>Zahra Montazeri, Søren B. Gammelmark, Shuang Zhao, and Henrik Wann Jensen. 2020. <a href="https://doi.org/10.1145/3414685.3417777">A Practical Ply-Based Appearance Model of Woven</a>. <em>ACM Transactions on Graphics</em>. 39, 6 (2020), 251:1-251:13.</p>
<p>Sean Palmer and Kendall Litaker. 2016. <a href="https://dl.acm.org/citation.cfm?id=2927466">Artist Friendly Level-of-Detail in a Fur-Filled World</a>. In <em>ACM SIGGRAPH 2016 Talks</em>. 32:1-32:2.</p>
<p>Leonid Pekelis, Christophe Hery, Ryusuke Villemin, and Junyi Ling. 2015. <a href="https://graphics.pixar.com/library/DataDrivenHairScattering/">A Data-Driven Light Scattering Model for Hair</a>. <em>Pixar Technical Memo #15-02</em>.</p>
<p>Kai Schröder, Reinhard Klein, and Arno Zinke. 2011. <a href="https://doi.org/10.1111/j.1467-8659.2011.01987.x">A Volumetric Approach to Predictive Rendering of Fabrics</a>. <em>Computer Graphics Forum</em>. 30, 4 (2011), 1277-1286.</p>
<p>Brian Smith, Roman Fedetov, Sang N. Le, Matthias Frei, Alex Latyshev, Luke Emrose, and Jean Pascal leBlanc. 2018. <a href="https://dl.acm.org/citation.cfm?id=3214781">Simulating Woven Fabrics with Weave</a>. In <em>ACM SIGGRAPH 2018 Talks</em>. 12:1-12:2.</p>
<p>Thomas V. Thompson, Ernest J. Petti, and Chuck Tappan. 2003. <a href="https://dl.acm.org/doi/10.1145/965400.965411">XGen: Arbitrary Primitive Generator</a>. In <em>ACM SIGGRAPH 2003 Sketches and Applications</em>.</p>
<p>Walt Disney Animation Studios. 2011. <a href="https://wdas.github.io/SeExpr/">SeExpr</a>.</p>
<p>Magnus Wrenninge, Ryusuke Villemin, and Christophe Hery. 2017. <a href="https://graphics.pixar.com/library/PathTracedSubsurface/">Path Traced Subsurface Scattering using Anisotropic Phase Functions and Non-Exponential Free Flighs</a>. <em>Pixar Technical Memo #17-07</em>.</p>
<p>Shuang Zhao, Wenzel Jakob, Steve Marschner, and Kavita Bala. 2012. <a href="https://doi.org/10.1145/2185520.2185571">Structure-Aware Synthesis for Predictive Woven Fabric Appearance</a>. <em>ACM Transactions on Graphics</em>. 31, 4 (2012), 75:1-75:10.</p>
<p>Shuang Zhao, Fujun Luan, and Kavita Bala. 2016. <a href="https://doi.org/10.1145/2897824.2925932">Fitting Procedural Yarn Models for Realistic Cloth Rendering</a>. <em>ACM Transactions on Graphics</em>. 35, 4 (2016), 51:1-51:11.</p>
</div>
https://blog.yiningkarlli.com/2021/03/raya-and-the-last-dragon.html
Raya and the Last Dragon
2021-03-05T00:00:00+00:00
2021-03-05T00:00:00+00:00
Yining Karl Li
<p>After a break in 2020, <a href="http://www.disneyanimation.com/">Walt Disney Animation Studios</a> has two films lined up for release in 2021!
The first of these is <a href="https://www.disneyanimation.com/films/">Raya and the Last Dragon</a>, which is simultaneously out in theaters and available on <a href="http://www.disneyplus.com/">Disney+ Premiere Access</a> on the day this post is being released.
I’ve been working on Raya and the Last Dragon in some form or another since early 2018, and Raya and the Last Dragon is the first original film I’ve worked on at Disney Animation that I was able to witness from the very earliest idea all the way through to release; every other project I’ve worked on up until now was either based on a previous idea or began before I started at the studio.
Raya and the Last Dragon was an incredibly difficult film to make, in every possible aspect.
The story took time to really get right, the technology side of things saw many challenges and changes, and the main production of the film ran headfirst into the Covid-19 pandemic.
Just as production was getting into the swing of things last year, the Covid-19 pandemic forced the physical studio building to temporarily shut down, and the studio’s systems/infrastructure teams had to scramble and go to heroic lengths to get production back up and running again from around 400 different homes.
As a result, Raya and the Last Dragon is the first Disney Animation film made entirely from our homes instead of from the famous “hat building”.</p>
<p>In the end though, all of the trials and tribulations this production saw were more than worthwhile; Raya and the Last Dragon is the most beautiful film we’ve ever made, and the movie has a message and story about trust that is deeply relevant for the present time.
The Druun as a concept and villain in Raya and the Last Dragon actually long predate the Covid-19 pandemic; they’ve been a part of every version of the movie going back years, but the Druun’s role in the movie’s plot meant that the onset of the pandemic suddenly lent extra weight to this movie’s core message.
Also, as someone of Asian descent, I’m so so proud that Raya and the Last Dragon’s basis is found in diverse Southeast Asian cultures.
Early in the movie’s conceptualization, before the movie even had a title or a main character, the movie’s producers and directors and story team reached out to all of the people in the studio of Asian descent and engaged us in discussing how the Asian cultures we came from shaped our lives and our families.
These discussions continued for years throughout the production process, and throughlines from those discussions can be seen everywhere from the movie, from major thematic elements like the importance of food and sharing meals in the world of Kumandra, all the way down to tiny details like young Raya taking off her shoes when entering the Dragon Gem chamber.
The way I get to contribute to our films is always in the technical realm, but thanks to Fawn Veerasunthorn, Scott Sakamoto, Adele Lim, Osnat Shurer, Paul Briggs, and Dean Wellins, this is the first time where I feel like I maybe made some small, tiny, but important contribution creatively too!
Raya and the Last Dragon has spectacular fight scenes with real combat, and the fighting styles aren’t just made up- they’re directly drawn from Thailand, Malaysia, Cambodia, Laos, and Vietnam.
Young Raya’s fighting sticks are Filipino Arnis sticks, the food in the film is recognizably dishes like fish amok, tom yam, chicken satay and more, Raya’s main mode of transport is her pet Tuk Tuk, who has the same name as those motorbike carriages that can be found all over Southeast Asia; the list goes on and on.</p>
<p>From a rendering technology perspective, Raya and the Last Dragon in a lot of ways represents the culmination of a huge number of many-year-long initiatives that began on previous films.
Water is a huge part of Raya and the Last Dragon, and the water in the film looks so incredible because we’ve been able to build even further upon the water authoring pipeline <a href="https://dl.acm.org/citation.cfm?id=3085067">[Palmer et al. 2017]</a> that we first built on <a href="https://blog.yiningkarlli.com/2016/11/moana.html">Moana</a> and improved on <a href="https://blog.yiningkarlli.com/2019/11/froz2.html">Frozen 2</a>.
One small bit of rendering tech I worked on for this movie was further improving the robustness and stability of the water levelset meshing system that we first developed on Moana.
Other elements of the film, such as being able to render convincing darker skin and black hair, along with the colorful fur of the dragons, are the result of multi-year efforts to productionize path traced subsurface scattering <a href="https://doi.org/10.1145/2897839.2927433">[Chiang et al. 2016b]</a> (first deployed on <a href="https://blog.yiningkarlli.com/2018/11/wir2.html">Ralph Breaks the Internet</a>) and a highly artistically controllable principled hair shading model <a href="https://onlinelibrary.wiley.com/doi/abs/10.1111/cgf.12830">[Chiang et al. 2016a]</a> (first deployed on <a href="https://blog.yiningkarlli.com/2016/02/zootopia.html">Zootopia</a>).
The huge geometric complexity challenges that we’ve had to face on all of our previous projects prepared us for rendering Raya and the Last Dragon’s setting, the vast world of Kumandra.
Even more niche features, such as our adaptive photon mapping system <a href="https://dl.acm.org/citation.cfm?id=3182159">[Burley et al. 2018]</a>, proved to be really useful on this movie, and even saw new improvements- Joe Schutte added support for more geometry types to the photon mapping system to allow for caustics to be cast on Sisu whenever Sisu was underwater.
Raya and the Last Dragon also contains a couple of more stylized sequences that look almost 2D, but even these sequences were rendered using Hyperion!
These more stylized sequences build upon the 3D-2D hybrid stylization experience that Disney Animation has gained over the years from projects such as <a href="https://www.disneyanimation.com/shorts/paperman/">Paperman</a>, <a href="https://www.disneyanimation.com/shorts/feast/">Feast</a>, and many of the <a href="https://www.disneyplus.com/series/walt-disney-animation-studios-short-circuit-experimental-films/3S2DLVtMPA7V">Short Circuit shorts</a> <a href="https://dl.acm.org/doi/10.1145/3388767.3409267">[Newfield and Staub 2020]</a>.
I think all of the above is really what makes a production renderer a <em>production</em> renderer- years and years of accumulated research, development, and experience over a variety of challenging projects forging a powerful, reliable tool custom tailored to our artists’ work and needs.
Difficult problems are still difficult, but they’re no longer scary, because now, we’ve seen them before!</p>
<p>For this movie though, the single biggest rendering effort by far was on volume rendering.
After encountering many volume rendering challenges on Moana, our team undertook an effort to replace Hyperion’s previous volume rendering system <a href="https://doi.org/10.1145/3084873.3084907">[Fong et al. 2017]</a> with a brand new, from scratch implementation based on new research we had conducted <a href="https://doi.org/10.1145/3072959.3073665">[Kutz et al. 2017]</a>.
The new system first saw wide deployment on Ralph Breaks the Internet, but all things considered, the volumes use cases on Ralph Breaks the Internet didn’t actually require us to encounter the types of difficult cases we ran into on Moana, such as ocean foam and spray.
Frozen 2 was really the show where we got a second chance at tackling the ocean foam and spray and dense white clouds cases that we had first encounted on Moana, and new challenges on Frozen 2 with thin volumes gave my teammate Wayne Huang the opportunity to make the new volume rendering system even better.
Raya and the Last Dragon is the movie where I feel like all of the past few years of development on our modern volume rendering system came together- this movie threw every single imaginable type of volume rendering problem at us, often in complex combinations with each other.
On top of that, Raya and the Last Dragon has volumes in basically every single shot; the highly atmospheric, naturalistic cinematography on this film demanded more volumes than we’ve ever had on any past movie.
Wayne really was our MVP in the volume rendering arena; Wayne worked with our lighters to introduce a swath of powerful new tools to give artists unprecedented control and artistic flexibility in our modern volume rendering system <a href="https://doi.org/10.1145/3450623.3464676">[Bryant et al. 2021]</a>, and Wayne also made huge improvements in the volume rendering system’s overall performance and efficiency <a href="https://doi.org/10.1145/3450623.3464644">[Huang et al. 2021]</a>.
We now have a single unified volume integrator that can robustly handle basically every volume you can think of: fog, thin atmospherics, fire, smoke, thick white clouds, sea foam, and even highly stylized effects such as the dragon magic <a href="https://doi.org/10.1145/3450623.3464652">[Navarro & Rice 2021]</a> and the chaotic Druun characters <a href="https://doi.org/10.1145/3450623.3464647">[Rice 2021]</a> in Raya and the Last Dragon.</p>
<p>A small fun new thing I got to do for this movie was to add support for arbitrarily custom texture-driven camera aperture shapes.
Raya and the Last Dragon’s cinematography makes extensive use of shallow depth-of-field, and one idea the film’s art directors had early on was to stylize bokeh shapes to resemble the Dragon Gem.
Hyperion has long had extensive support for fancy physically-based lensing features such as uniformly bladed apertures and cateye bokeh, but the request for a stylized bokeh required much more art-directability than we previously had in this area.
The texture-driven camera aperture feature I added to Hyperion is not necessarily anything innovative (similar features can be found on many commercial renderers), but iterating with artists to define and refine the feature’s controls and behavior was a lot of fun.
There were also a bunch of fun nifty little details to solve, such as making sure that importance sampling ray directions based on a arbitrary textured aperture didn’t mess up stratified sampling and Sobol distributions; repurposing hierarchical sample warping <a href="https://dl.acm.org/doi/10.1145/1073204.1073328">[Clarberg et al. 2005]</a> wound up being super useful here.</p>
<p>There are a ton more really cool technical advancements that were made for Raya and the Last Dragon, and there were also several really ambitious, inspiring, and potentially revolutionary projects that just barely missed being deployed in time for this movie.
One extremely important point I want to highlight is that, as cool as all of the tech that we develop at Disney Animation is, at the end of the day our tech and tools are only as good as the artists that use them every day to handcraft our films.
Hyperion only renders amazing films because the artists using Hyperion are some of the best in the world; I count myself as super lucky to be able to work with my teammates and with our artists every day.
At SIGGRAPH 2021, most of the talks about Raya and the Last Dragon are actually from our artists, not our engineers!
Our artists had to come up with new crowd simulation techniques for handling the huge crowds seen in the movie <a href="https://doi.org/10.1145/3450623.3464650">[Nghiem 2021</a>, <a href="https://doi.org/10.1145/3450623.3464648">Luceño Ros et al. 2021]</a>, new cloth simulation techniques for all of the beautiful, super complex outfits worn by all of the characters <a href="https://doi.org/10.1145/3450623.3464660">[Kaur et al. 2021</a>, <a href="https://doi.org/10.1145/3450623.3464659">Kaur & Coetzee 2021]</a>, and even new effects techniques to simulate cooking delicious Southeast Asia-inspired food <a href="https://doi.org/10.1145/3450623.3464651">[Wang et al. 2021]</a>.</p>
<p>Finally, here are a bunch of stills from the movie, 100% rendered using Hyperion.
Normally I post somewhere between 40 to 70 stills per film, but I had so many favorite images from Raya and the Last Dragon that for this post, there are considerably more.
You may notice what looks like noise in the stills below- it’s not noise!
The actual renders are super clean thanks to Wayne’s volumes work and David Adler’s continued work on our Disney-Research-tech-based deep learning denoising system <a href="https://dl.acm.org/citation.cfm?id=3328150">[Dahlberg et al. 2019</a>, <a href="https://doi.org/10.1145/3197517.3201388">Vogels et al. 2018]</a>, but the film’s cinematography style called for adding film grain back in after rendering.</p>
<p>I’ve pulled these from marketing materials, trailers, and Disney+; as usual, I’ll try to update this post with higher quality stills once the film is out on Bluray.
Of course, the stills here are just a few of my favorites, and represent just a tiny fraction of the incredible imagery in this film.
If you like what you see here, I’d strongly encourage seeing the film on Disney+ or on Blu-Ray; whichever way, I suggest watching on the biggest screen you have available to you!</p>
<p>To try to help avoid spoilers, the stills below are presented in no particular order; however, if you want to avoid spoilers entirely, then please go watch the movie first and then come back here to be able to appreciate each still on its own!</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_007.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_007.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_001.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_001.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_043.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_043.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_109.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_109.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_024.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_024.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_061.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_061.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_068.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_068.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_107.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_107.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_016.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_016.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_038.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_038.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_113.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_113.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_029.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_029.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_053.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_053.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_076.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_076.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_027.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_027.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_078.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_078.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_095.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_095.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_101.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_101.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_074.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_074.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_066.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_066.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_015.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_015.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_018.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_018.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_063.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_063.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_084.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_084.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_093.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_093.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_119.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_119.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_087.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_087.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_110.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_110.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_099.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_099.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_077.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_077.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_081.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_081.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_060.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_060.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_032.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_032.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_004.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_004.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_013.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_013.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_011.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_011.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_012.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_012.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_047.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_047.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_050.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_050.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_055.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_055.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_056.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_056.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_064.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_064.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_071.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_071.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_089.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_089.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_091.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_091.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_116.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_116.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_124.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_124.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_121.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_121.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_082.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_082.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_083.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_083.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_096.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_096.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_017.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_017.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_040.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_040.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_041.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_041.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_048.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_048.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_057.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_057.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_069.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_069.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_086.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_086.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_092.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_092.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_125.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_125.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_105.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_105.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_034.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_034.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_045.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_045.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_006.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_006.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_023.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_023.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_031.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_031.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_039.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_039.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_021.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_021.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_037.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_037.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_042.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_042.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_005.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_005.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_020.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_020.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_002.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_002.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_052.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_052.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_062.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_062.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_103.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_103.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_070.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_070.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_075.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_075.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_033.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_033.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_072.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_072.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_079.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_079.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_085.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_085.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_051.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_051.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_035.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_035.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_014.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_014.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_104.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_104.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_114.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_114.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_115.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_115.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_022.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_022.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_028.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_028.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_046.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_046.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_054.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_054.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_100.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_100.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_067.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_067.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_112.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_112.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_123.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_123.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_073.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_073.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_065.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_065.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_122.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_122.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_080.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_080.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_003.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_003.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_025.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_025.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_036.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_036.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_049.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_049.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_008.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_008.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_059.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_059.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_030.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_030.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_117.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_117.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_118.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_118.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_120.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_120.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_088.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_088.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_102.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_102.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_090.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_090.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_106.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_106.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_044.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_044.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_009.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_009.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_026.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_026.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_058.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_058.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_098.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_098.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_010.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_010.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_019.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_019.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_097.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_097.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_108.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_108.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_111.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_111.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_094.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_094.jpg" alt="" /></a></p>
<p>Here is the credits frame for Disney Animation’s rendering and visualization teams! The rendering and visualization teams are separate teams, but seeing them grouped together in the credits is very appropriate- we all are dedicated to making the best pixels possible for our films!</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_credits.jpg"><img src="https://blog.yiningkarlli.com/content/images/2021/Mar/raya/RAYA_credits.jpg" alt="" /></a></p>
<p>All images in this post are courtesy of and the property of Walt Disney Animation Studios.</p>
<p>Also, one more thing: in theaters (and also on Disney+ starting in the summer), Raya and the Last Dragon is accompanied by our first new theatrical short in 5 years, called Us Again.
Us Again is one of my favorite shorts Disney Animation has ever made; it’s a joyous, visually stunning celebration of life and dance and music.
I’ll probably dedicate a separate post to Us Again once it’s out on Disney+.</p>
<p><strong>References</strong></p>
<p>Brent Burley, David Adler, Matt Jen-Yuan Chiang, Hank Driskill, Ralf Habel, Patrick Kelly, Peter Kutz, Yining Karl Li, and Daniel Teece. 2018. <a href="https://dl.acm.org/citation.cfm?id=3182159">The Design and Evolution of Disney’s Hyperion Renderer</a>. <em>ACM Transactions on Graphics</em>. 37, 3 (2018), 33:1-33:22.</p>
<p>Marc Bryant, Ryan DeYoung, Wei-Feng Wayne Huang, Joe Longson, and Noel Villegas. 2021. <a href="https://doi.org/10.1145/3450623.3464676">The Atmosphere of Raya and the Last Dragon</a>. In <em>ACM SIGGRAPH 2021 Talks</em>. 51:1-51:2.</p>
<p>Matt Jen-Yuan Chiang, Benedikt Bitterli, Chuck Tappan, and Brent Burley. 2016. <a href="https://onlinelibrary.wiley.com/doi/abs/10.1111/cgf.12830">A Practical and Controllable Hair and Fur Model for Production Path Tracing</a>. <em>Computer Graphics Forum</em>. 35, 2 (2016), 275-283.</p>
<p>Matt Jen-Yuan Chiang, Peter Kutz, and Brent Burley. 2016. <a href="https://doi.org/10.1145/2897839.2927433">Practical and Controllable Subsurface Scattering for Production Path Tracing</a>. In <em>ACM SIGGRAPH 2016 Talks</em>. 49:1-49:2.</p>
<p>Petrik Clarberg, Wojciech Jarosz, Tomas Akenine-Möller, and Henrik Wann Jensen. 2005. <a href="https://dl.acm.org/doi/10.1145/1073204.1073328">Wavelet Importance Sampling: Efficiently Evaluating Products of Complex Functions</a>. <em>ACM Transactions on Graphics</em>. 24, 3 (2005), 1166-1175.</p>
<p>Henrik Dahlberg, David Adler, and Jeremy Newlin. 2019. <a href="https://dl.acm.org/citation.cfm?id=3328150">Machine-Learning Denoising in Feature Film Production</a>. In <em>ACM SIGGRAPH 2019 Talks</em>. 21:1-21:2.</p>
<p>Julian Fong, Magnus Wrenninge, Christopher Kulla, and Ralf Habel. 2017. <a href="https://doi.org/10.1145/3084873.3084907">Production Volume Rendering</a>. In <em>ACM SIGGRAPH 2017 Courses</em>.</p>
<p>Wei-Feng Wayne Huang, Peter Kutz, Yining Karl Li, and Matt Jen-Yuan Chiang. 2021. <a href="https://doi.org/10.1145/3450623.3464644">Unbiased Emission and Scattering Importance Sampling for Heterogeneous Volumes</a>. In <em>ACM SIGGRAPH 2021 Talks</em>. 3:1-3:2.</p>
<p>Avneet Kaur and Johann Francois Coetzee. 2021. <a href="https://doi.org/10.1145/3450623.3464659">Wrapped Clothing on Disney’s Raya and the Last Dragon</a>. In <em>ACM SIGGRAPH 2021 Talks</em>. 28:1-28:2.</p>
<p>Avneet Kaur, Erik Eulen, and Johann Francois Coetzee. 2021. <a href="https://doi.org/10.1145/3450623.3464660">Creating Diversity and Variety in the People of Kumandra for Disney’s Raya and the Last Dragon</a>. In <em>ACM SIGGRAPH 2021 Talks</em>. 58:1-58:2.</p>
<p>Peter Kutz, Ralf Habel, Yining Karl Li, and Jan Novák. 2017. <a href="https://doi.org/10.1145/3072959.3073665">Spectral and Decomposition Tracking for Rendering Heterogeneous Volumes</a>. <em>ACM Transactions on Graphics</em>. 36, 4 (2017), 111:1-111:16.</p>
<p>Alberto Luceño Ros, Kristin Chow, Jack Geckler, Norman Moses Joseph, and Nicolas Nghiem. 2021. <a href="https://doi.org/10.1145/3450623.3464648">Populating the World of Kumandra: Animation at Scale for Disney’s Raya and the Last Dragon</a>. In <em>ACM SIGGRAPH 2021 Talks</em>. 39:1-39:2.</p>
<p>Mike Navarro and Jacob Rice. 2021. <a href="https://doi.org/10.1145/3450623.3464652">Stylizing Volumes with Neural Networks</a>. In <em>ACM SIGGRAPH 2021 Talks</em>. 54:1-54:2.</p>
<p>Jennifer Newfield and Josh Staub. 2020. <a href="https://dl.acm.org/doi/10.1145/3388767.3409267">How Short Circuit Experiments: Experimental Filmmaking at Walt Disney Animation Studios</a>. In <em>ACM SIGGRAPH 2020 Talks</em>. 72:1-72:2.</p>
<p>Nicolas Nghiem. 2021. <a href="https://doi.org/10.1145/3450623.3464650">Mathematical Tricks for Scalable and Appealing Crowds in Walt Disney Animation Studios’ Raya and the Last Dragon</a>. In <em>ACM SIGGRAPH 2021 Talks</em>. 38:1-38:2.</p>
<p>Sean Palmer, Jonathan Garcia, Sara Drakeley, Patrick Kelly, and Ralf Habel. 2017. <a href="https://dl.acm.org/citation.cfm?id=3085067">The Ocean and Water Pipeline of Disney’s Moana</a>. In <em>ACM SIGGRAPH 2017 Talks</em>. 29:1-29:2.</p>
<p>Jacob Rice. 2021. <a href="https://doi.org/10.1145/3450623.3464647">Weaving the Druun’s Webbing</a>. In <em>ACM SIGGRAPH 2021 Talks</em>. 32:1-32:2.</p>
<p>Thijs Vogels, Fabrice Rousselle, Brian McWilliams, Gerhard Röthlin, Alex Harvill, David Adler, Mark Meyer, and Jan Novák. 2018. <a href="https://doi.org/10.1145/3197517.3201388">Denoising with Kernel Prediction and Asymmetric Loss Functions</a>. <em>ACM Transactions on Graphics</em>. 37, 4 (2018), 124:1-124:15.</p>
<p>Cong Wang, Dale Mayeda, Jacob Rice, Thom Whicks, and Benjamin Huang. 2021. <a href="https://doi.org/10.1145/3450623.3464651">Cooking Southeast Asia-Inspired Soup in Animated Film</a>. In <em>ACM SIGGRAPH 2021 Talks</em>. 35:1-35:2.</p>
https://blog.yiningkarlli.com/2020/07/shipshape-renderman-challenge.html
Shipshape RenderMan Art Challenge
2020-07-31T00:00:00+00:00
2020-07-31T00:00:00+00:00
Yining Karl Li
<div>
<p>Last year, I <a href="https://blog.yiningkarlli.com/2019/11/woodville-renderman-challenge.html">participated in one of Pixar’s RenderMan Art Challenges</a> as a way to learn more about modern RenderMan <a href="https://dl.acm.org/citation.cfm?id=3182162">[Christensen et al. 2018]</a> and as a way to get some exposure to tools outside of my normal day-to-day toolset (Disney’s Hyperion Renderer professionally, Takua Renderer as a hobby and learning exercise).
I had a lot of fun, and wound up doing better in the “Woodville” art challenge contest than I expected to!
Recently, I entered another one of <a href="https://renderman.pixar.com/news/renderman-shipshape-art-challenge">Pixar’s RenderMan Art Challenges, “Shipshape”</a>.
This time around I entered just for fun; since I had so much fun last time, I figured why not give it another shot!
That being said though, I want to repeat the main point I made in my post about the previous “Woodville” art challenge: I believe that for rendering engineers, there is enormous value in learning to use tools and renderers that aren’t the ones we work on ourselves.
Our field is filled with brilliant people on every major rendering team, and I find both a lot of useful information/ideas and a lot of joy in seeing the work that friends and peers across the field have put into commercial renderers such as RenderMan, Arnold, Vray, Corona, and others.</p>
<p>As usual for the RenderMan Art Challenges, Pixar <a href="https://renderman.pixar.com/shipshape-pup-asset">supplied some base models</a> without any uvs, texturing, shading, lighting or anything else, and challenge participants had to start with the base models and come up with a single compelling image for a final entry.
I had a lot of fun spending evenings and weekends throughout the duration of the contest to create my final image, which is below.
I got to explore and learn a lot of new things that I haven’t tried before, which this post will go through.
To my enormous surprise, this time around my entry <a href="https://renderman.pixar.com/news/renderman-shipshape-art-challenge-final-results">won first place in the contest</a>!</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2020/Jul/shipshape/shipshape_full_4k.jpg"><img src="https://blog.yiningkarlli.com/content/images/2020/Jul/shipshape/preview/shipshape_full.jpg" alt="Figure 1: My entry to Pixar's RenderMan Shipshape Art Challenge, titled "Oh Good, The Bus is Here". Click for 4K version. Base ship, robot, and sextant models are from Pixar; all shading, lighting, additional modeling, and environments are mine. Ship concept by Ian McQue. Robot concept by Ruslan Safarov. Models by Cheyenne Chapel, Aliya Chen, Damian Kwiatkowski, Alyssa Minko, Anthony Muscarella, and Miguel Zozaya © Disney / Pixar - RenderMan "Shipshape" Art Challenge." /></a></p>
<p><strong>Initial Explorations</strong></p>
<p>For this competition, Pixar provided five models: a futuristic scifi ship based on an Ian McQue concept, a robot based on a Ruslan Safarov concept, an old wooden boat, a butterfly, and a sextant.
The fact that one of the models was based on an Ian McQue concept was enough to draw me in; I’ve been a big fan of Ian McQue’s work for many years now!
I like to start these challenges by just rendering the provided assets as-is from a number of different angles, to try to get a sense of what I like about the assets and how I will want to showcase them in my final piece.
I settled pretty quickly on wanting to focus on the scifi ship and the robot, and leave the other three models aside.
I did find an opportunity to bring in the sextant in my final piece as well, but wound up dropping the old wooden boat and the butterfly altogether.
Here are some simple renders showing what was provided out of the box for the scifi ship and the robot:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2020/Jul/shipshape/scifiship_base.jpg"><img src="https://blog.yiningkarlli.com/content/images/2020/Jul/shipshape/scifiship_base.jpg" alt="Figure 2: Scifi ship base model provided by Pixar, rendered against a white cyclorama background using a basic skydome." /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2020/Jul/shipshape/robot_base.jpg"><img src="https://blog.yiningkarlli.com/content/images/2020/Jul/shipshape/robot_base.jpg" alt="Figure 3: Robot base model provided by Pixar, rendered against a white cyclorama background using a basic skydome." /></a></p>
<p>I initially had a lot of trouble settling on a concept and idea for this project; I actually started blocking out an entirely different idea before pivoting to the idea that eventually became my final image.
My initial concept included the old wooden boat in addition the scifi ship and the robot; this initial concept was called “River Explorer”.
My initial instinct was to try to show the scifi ship from a top-down view, in order to get a better view of the deck-boards and the big VG engine and the crane arm.
I liked the idea of putting the camera at roughly forest canopy height, since forest canopy height is a bit of an unusual perspective for most photographs due to canopy height being this weird height that is too high off the ground for people to shoot from, but too low for helicopters or drones to be practical either.
My initial idea was about a robot-piloted flying patrol boat exploring an old forgotten river in a forest; the ship would be approaching the old sunken boat in the river water.
With this first concept, I got as far as initial compositional blocking and initial time-of-day lighting tests:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2020/Jul/shipshape/progress_012.jpg"><img src="https://blog.yiningkarlli.com/content/images/2020/Jul/shipshape/preview/progress_012.jpg" alt="Figure 4: Initial "River Explorer" concept, daylight lighting test." /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2020/Jul/shipshape/progress_013.jpg"><img src="https://blog.yiningkarlli.com/content/images/2020/Jul/shipshape/preview/progress_013.jpg" alt="Figure 5: Initial "River Explorer" concept, dusk lighting test." /></a></p>
<p>If you’ve followed my blog for a while now, those pine trees might look familiar.
They’re actually the same trees from <a href="https://blog.yiningkarlli.com/2018/10/bidirectional-mipmap.html">the forest scene I used a while back</a>, ported from Takua’s shading system to RenderMan’s PxrSurface shader.</p>
<p>I wasn’t ever super happy with the “River Explorer” concept; I think the overall layout was okay, but it lacked a sense of dynamism and overall just felt very static to me, and the robot on the flying scifi ship felt kind of lost in the overall composition.
Several other contestants wound up also going for similar top-down-ish views, which made me worry about getting lost in a crowd of similar-looking images.
After a week of trying to get the “River Explorer” concept to work better, I started to play with some completely different ideas; I figured that this early in the process, a better idea was worth more than a week’s worth of sunk time.</p>
<p><strong>Layout and Framing</strong></p>
<p>I had started UV unwrapping the ship already, and whilst tumbling around the ship unwrapping all of the components one-by-one, I got to see a lot more of the ship and a lot more interesting angles, and I suddenly came up with a completely different idea for my entry.
The idea that popped into my head was to have a bunch of the little robots waiting to board one of the flying ships at a quay or something of the sort.
I wanted to convey a sense of scale between the robots and the flying scifi ship, so I tried putting the camera far away and zooming in using a really long lens.
Since long lenses have the effect of flattening perspective a bit, using a long lens helped make the ships feel huge compared to the robots.
At this point I was just doing very rough, quick, AO render “sketches”.
This is the AO sketch where my eventual final idea started:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2020/Jul/shipshape/progress_015.jpg"><img src="https://blog.yiningkarlli.com/content/images/2020/Jul/shipshape/progress_015.jpg" alt="Figure 6: Rough AO render "sketch" that eventually evolved into my final idea." /></a></p>
<p>I’ve always loved the idea of the mundane fantastical; the flying scifi ship model is fairly fantastical, which led me to want to do something more everyday with them.
I thought it would be fun to texture the scifi ship model as if it was just part of a regular metro system that the robots use to get around their world.
My wife, Harmony, suggested a fun idea: set the entire scene in drizzly weather and give two of the robots umbrellas, but give the third robot a briefcase instead and have the robot use the briefcase as a makeshift umbrella, as if it had forgotten its umbrella at home.
The umbrella-less robot’s reaction to seeing the ship arriving provided the title for my entry- “Oh Good, The Bus Is Here”.
Harmony also pointed out that the back of the ship has a lot more interesting geometric detail compared to the front of the ship, and suggested placing the focus of the composition more on the robots than on the ships.
To incorporate all of these ideas, I played more with the layout and framing until I arrived at the following image, which is broadly the final layout I used:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2020/Jul/shipshape/progress_019.jpg"><img src="https://blog.yiningkarlli.com/content/images/2020/Jul/shipshape/progress_019.jpg" alt="Figure 7: Rough AO render "sketch" of my final layout." /></a></p>
<p>I chose to put an additional ship in the background flying away from the dock for two main reasons.
First, I wanted to be able to showcase more of the ship, since the front ship is mostly obscured by the foreground dock.
Second, the background ship helps fill out and balance the right side of the frame more, which would otherwise have been kind of empty.</p>
<p>In both this project and in the previous Art Challenge, my workflow for assembling the final scene relies heavily on Maya’s referencing capabilities.
Each separate asset is kept in its own .ma file, and all of the .ma files are referenced into the main scene file.
The only the things the main scene file contains are references to assets, along with scene-level lighting, overrides, and global-scale effects such as volumes and, in the case of this challenge, the rain streaks.
So, even though the flying scifi ship appears in my scene twice, it is actually just the same .ma file referenced into the main scene twice instead of two separate ships.</p>
<p>The idea of a rainy scene largely drove the later lighting direction of my entry; from this point I basically knew that the final scene was going to have to be overcast and drizzly, with a heavy reliance on volumes to add depth separation into the scene and to bring out practical lights on the ships.
I had a lot of fun modeling out the dock and gangway, and may have gotten slightly carried away.
I modeled every single bolt and rivet that you would expect to be there in real life, and I also added lampposts to use later as practical light sources for illuminating the dock and the robots.
Once I had finished modeling the dock and had made a few more layout tweaks, I arrived at a point where I was happy to start with shading and initial light blocking.
Zoom in if you want to see all of the rivets and bolts and stuff on the dock:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2020/Jul/shipshape/progress_032.jpg"><img src="https://blog.yiningkarlli.com/content/images/2020/Jul/shipshape/progress_032.jpg" alt="Figure 8: AO render of my layout going into shading and lighting. Check out all of the crazy detail on the dock that I modeled!" /></a></p>
<p><strong>UV Unwrapping</strong></p>
<p>UV unwrapping the ship took a ton of time.
For the last challenge, I relied on a combination of manual UV unwrapping by hand in Maya and using <a href="https://www.sidefx.com/tutorials/houdini-game-dev-tools-auto-uvs/">Houdini’s Auto UV SOP</a>, but I found that the Auto UV SOP didn’t work as well on this challenge due to the ship and robot having a lot of strange geometry with really complex topology.
On the treehouse in the last challenge, everything was more or less some version of a cylinder or a rectangular prism, with some morphs and warps and extra bits and bobs applied.
Almost every piece of the ship aside from the floorboards are very complex shapes that aren’t easy to find good seams for, so the Auto UV SOP wound up making a lot of choices for UV cuts that I didn’t like.
As a result, I basically manually UV unwrapped this entire challenge in Maya.</p>
<p>A lot of the complex undercarriage type stuff around the back thrusters on the ship was really insane to unwrap.
The muffler manifold and mechanical parts of the crane arm were difficult too.
Fortunately though, the models came with subdivision creases, and a lot of the subd crease tags wound up proving to be useful hints towards good places to place UV edge cuts.
I also found that the new and improved UV tools in Maya 2020 performed way better than the UV tools in Maya 2019.
For some meshes, I manually placed UV cuts and then used the unfold tool in Maya 2020, which I found generally worked a lot better than Maya 2019’s version of the same tool.
For other meshes, Maya 2020’s auto unwrap actually often provided a useful starting place as long a I rotated the piece I was unwrapping into a more-or-less axis-aligned orientation and froze its transform.
After using the auto-unwrap tool, I would then transfer the UVs back onto the piece in its original orientation using Maya’s Mesh Transfer Attributes tool.
The auto unwrap tended to cut meshes into too many UV islands, so I would then re-stitch islands together and place new cuts where appropriate.</p>
<p>When UV unwrapping, a good test to see how good the resultant UVs are is to assign some sort of a checkerboard grid texture to the model and look for distortion in the checkerboard pattern.
Overall I think I did an okay job here; not terrible, but could be better.
I think I managed to hide the vast majority of seams pretty well, and the total distortion isn’t too bad (if you look closely, you’ll be able to pick out some less than perfect areas, but it was mostly okay).
I wound up with a high degree of variability in the grid size between different areas, but I wasn’t too worried about that since my plan was to adjust texture resolutions to match.</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2020/Jul/shipshape/ship_uvs.jpg"><img src="https://blog.yiningkarlli.com/content/images/2020/Jul/shipshape/ship_uvs.jpg" alt="Figure 9: Checkerboard test for my UV unwrapping of the scifi ship." /></a></p>
<p>After UV unwrapping the ship, UV unwrapping the robot proved to be a lot easier in comparison.
Many parts of the robot turn out to be the same mesh just duplicated and squash/stretch/scaled/rotated, which means that they share the same underlying topology.
For all parts that share the same topology, I was able to just UV unwrap one of them, and then copy the UVs to all of the others.
One great example is the robot’s fingers; most components across all fingers shared the same topology.
Here’s the checkerboard test applied to my final UVs for the robot:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2020/Jul/shipshape/robot_uvs.jpg"><img src="https://blog.yiningkarlli.com/content/images/2020/Jul/shipshape/robot_uvs.jpg" alt="Figure 10: Checkerboard test for my UV unwrapping of the robot." /></a></p>
<p><strong>Texturing the Ship</strong></p>
<p>After trying out Substance Painter for the previous RenderMan Art Challenge and getting fairly good results, I went with Substance Painter again on this project.
The overall texturing workflow I used on this project was actually a lot simpler compared with the workflow I used for the previous Art Challenge.
Last time I tried to leave a lot of final decisions about saturation and hue and whatnot as late as possible, which meant moving those decisions into the shader so that they could be changed at render-time.
This time around, I decided to make those decisions upfront in Substance Painter; doing so makes the Substance Painter workflow much simpler since it means I can just paint colors directly in Substance Painter like a normal person would, as opposed to painting greyscale or desaturated maps in Substance Painter that are expected to be modulated in the shader later.
Also, because of the nature of the objects in this project, I actually used very little displacement mapping; most detail was brought in through normal mapping, which makes more sense for hard surface metallic objects.
Not having to worry about any kind of displacement mapping simplified the Substance Painter workflow a bit more too, since that was one fewer texture map type I had to worry about managing.</p>
<p>One the last challenge I relied on a lot of Quixel Megascans surfaces as starting points for texturing, but this time around I (unintentionally) found myself relying on Substance smart materials more for starting points.
One thing I like about Substance Painter is how it comes with a number of good premade smart materials, and there are even more good smart materials on Substance Source.
Importantly though, I believe that smart materials should only serve as a starting point; smart materials can look decent out-of-the-box, but to really make texturing shine, a lot more work is required on top of the out-of-the-box result in order to really create story and character and a unique look in texturing.
I don’t like when I see renders online where a smart material was applied and left in its out-of-the-box state; something gets lost when I can tell which default smart material was used at a glance!
For every place that I used a smart material in this project, I used a smart material (or several smart materials layered and kitbashed together) as a starting point, but then heavily customized on top with custom paint layers, custom masking, decals, additional layers, and often even heavy custom modifications to the smart material itself.</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2020/Jul/shipshape/substance_screenshot.png"><img src="https://blog.yiningkarlli.com/content/images/2020/Jul/shipshape/preview/substance_screenshot.jpg" alt="Figure 11: Texturing the main piece of the ship's hull in Substance Painter." /></a></p>
<p>I was originally planning on using a UDIM workflow for bringing the ship into Substance Painter, but I wound up with so many UDIM tiles that things quickly became unmanageable and Substance Painter ground to a halt with a gigantic file containing 80 (!!!) 4K UDIM tiles.
To work around this, I broke up the ship into a number of smaller groups of meshes and brought each group into Substance Painter separately.
Within each group I was able to use a UDIM workflow with usually between 5 to 10 tiles.</p>
<p>I had a lot of fun creating custom decals to apply to various parts of the ships and to some of the robots; even though a lot of the details and decals aren’t very visible in the final image, I still put a good amount of time into making them simply to keep things interesting for myself.
All of the decals were made in Photoshop and Illustrator and then brought in to Substance Painter along with opacity masks and applied to surfaces using Substance Painter’s projection mode, either in world space or in UV space depending on situation.
In Substance Painter, I created a new layer in with a custom paint material and painted the base color for the paint material by projecting the decal, and then masked the decal layer using the opacity mask I made using the same projection that I used for the base color.
The “Seneca” logo seen throughout my scene has <a href="https://blog.yiningkarlli.com/2016/07/minecraft-in-renderman-ris.html">shown up on my blog before</a>!
A few years ago on a Minecraft server that I played a lot on, a bunch of other players and I had a city named Seneca; ever since then, I’ve tried to sneak in little references to Seneca in projects here and there as a small easter egg.</p>
<p>Many of the buses around where I live have an orange and silver color scheme, and while I was searching the internet for reference material, I also found pictures of the Glasgow Subway’s trains, which have an orange and black and white color scheme.
Inspired by the above, I picked an orange and black color scheme for the ship’s Seneca Metro livery.
I like orange as a color, and I figured that orange would bring a nice pop of color to what was going to be an overall relatively dark image,
I made the upper part of the hull orange but kept the lower part of the hull black since the black section was going to be the backdrop that the robots would be in front of in the final image; the idea was that keeping that part of the hull darker would allow the robots to pop a bit more visually.</p>
<p>One really useful trick I used for masking different materials was to just follow edgeloops that were already part of the model.
Since everything in this scene is very mechanical anyway, following straightedges in the UVs helps give everything a manufactured, mechanical look.
For example, Figure 12 shows how I used Substance Painter’s Polygon Fill tool to mask out the black paint from the back metal section of the ship’s thrusters.
In some other cases, I added new edgeloops to the existing models just so I could follow the edgeloops while masking different layers.</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2020/Jul/shipshape/substance_uvmask.png"><img src="https://blog.yiningkarlli.com/content/images/2020/Jul/shipshape/preview/substance_uvmask.jpg" alt="Figure 12: Masking in the metal section of the ship's thrusters by following existing edgeloops using Substance Painter's Polygon Fill tool." /></a></p>
<p><strong>Shading the Ship</strong></p>
<p>For the previous Art Challenge, I used a combination of PxrDisney and PxrSurface shaders; this time around, in order to get a better understanding of how PxrSurface works, I opted to go all-in on using PxrSurface for everything in the scene.
Also, for the rain streaks effect (discussed later in this post), I needed some features that are available in the extended Disney Bsdf model <a href="https://doi.org/10.1145/2776880.2787670">[Burley 2015]</a> and in PxrSurface <a href="http://graphics.pixar.com/library/PxrMaterialsCourse2017/index.html">[Hery and Ling 2017]</a>, but RenderMan 23 only implements the base Disney Brdf <a href="https://doi.org/10.1145/2343483.2343493">[Burley 2012]</a> without the extended Bsdf features; this basically meant I had to use PxrSuface.</p>
<p>One of the biggest differences I had to adjust to was how metallic color is controlled in PxrSurface.
The Disney Bsdf drives the diffuse color and metallic color using the same base color parameter and shifts energy between the diffuse/spec and metallic lobes using a “metallic” parameter, but PxrSurface separates the diffuse and metallic colors entirely.
PxrSurface uses a “Specular Face Color” parameter to directly drive the metallic lobe and has a separate “Specular Edge Color” control; this parameterization reminds me a lot of Framestore’s artist-friendly metallic fresnel parameterization <a href="http://jcgt.org/published/0003/04/03/">[Gulbrandsen 2014]</a>, but I don’t know if this is actually what PxrSurface is doing under the hood.
PxrSurface also has two different modes for its specular controls: an “artistic” mode and a “physical” mode; I only used the artistic mode.
To be honest, while PxrSurface’s extensive controls are extremely powerful and offer an enormous degree of artistic control, I found trying to understand what every control did and how they interacted with each other to be kind of overwhelming.
I wound up paring back the set of controls I used back to a small subset that I could mentally map back to what the Disney Bsdf or VRayMtl or Autodesk Standard Surface <a href="https://autodesk.github.io/standard-surface/">[Georgiev et al. 2019]</a> models do.</p>
<p>Fortunately, converting from the Disney Bsdf’s baseColor/metallic parameterization to PxrSurface’s diffuse/specFaceColor is very easy:</p>
<div>\[ diffuse = baseColor * (1 - metallic) \\ specFaceColor = baseColor * metallic \]</div>
<p>The only gotcha to look out for is that everything needs to be in linear space first.
Alternatively, Substance Painter already has a output template for PxrSurface as well.
Once I had the maps in the right parameterization, for the most part all I had to do was plug the right maps into the right parameters in PxrSurface and then make minor manual adjustments to dial in the look.
In addition to two different specular parameterization modes, PxrSurface also supports choosing from a few different microfacet models for the specular lobes; by default PxrSurface is set to use the Beckmann model <a href="https://us.artechhouse.com/The-Scattering-of-Electromagnetic-Waves-from-Rough-Surfaces-P257.aspx">[Beckmann and Spizzichino 1963]</a>, but I selected the GGX model <a href="http://dx.doi.org/10.2312/EGWR/EGSR07/195-206">[Walter et al. 2007]</a> for everything in this scene since GGX is what I’m more used to.</p>
<p>For the actual look of the ship, I didn’t want to go with the dilapidated look that a lot of the other contestants went with.
Instead, I wanted the ship to look like it was a well maintained working vehicle, but with all of the grime and scratches that build up over daily use.
So, there are scratches and dust and dirt streaks on the boat, but nothing is actually rusting.
I also did modeled some glass for the windows at the top of the tower superstructure, and added some additional lamps to the top of the ship’s masts and on the tower superstructure for use in lighting later.
After getting everything dialed, here is the “dry” look of the ship:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2020/Jul/shipshape/ship_shading_progress_angle3_31.png"><img src="https://blog.yiningkarlli.com/content/images/2020/Jul/shipshape/preview/ship_shading_progress_angle3_31.jpg" alt="Figure 13: Fully shaded "dry" look for the ship." /></a></p>
<p>Here’s a close-up render of the back engine section of the ship, which has all kinds of interesting bits and bobs on it.
The engine exhaust kind of looks like it could be a volume, but it’s not.
I made the engine exhaust by making a bunch of cards, arranging them into a truncated cone, and texturing them with a blue gradient in the diffuse slot and a greyscale gradient in PxrSurface’s “presence” slot.
The glow effect is done using the glow parameter in PxrSurface.
The nice thing about using this more cheat-y approach instead of a real volume is that it’s way faster to render!</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2020/Jul/shipshape/ship_shading_progress_angle2_23.png"><img src="https://blog.yiningkarlli.com/content/images/2020/Jul/shipshape/preview/ship_shading_progress_angle2_23.jpg" alt="Figure 14: Fully shaded "dry" look for the back engine area of the ship." /></a></p>
<p>Most of the ship’s metal components are covered over using a black, semi-matte paint material, but in areas that I thought would be subjected to high temperatures, such as exhaust vents or the inside of the thrusters or the many floodlights on the ship, I chose to use a beaten copper material instead.
Basically wherever I wound up placing a practical light, the housing around the practical light is made of beaten copper.
Well, I guess it’s actually some kind of high-temperature copper alloy or copper-colored composite material, since real copper’s melting point is lower than real steel’s melting point.
The copper color had an added nice effect of making practical lights look more yellow-orange, which I think helps sell the look of engine thrusters and hot exhaust vents more.</p>
<p>Each exhaust vent and engine thruster actually contains two practical lights: one extremely bright light near the back of the vent or thruster pointing into the vent or thruster, and one dimmer but more saturated light pointing outwards.
This setup produces a nice effect where areas deeper into the vent or thruster look brighter and yellower, while areas closer to the outer edge of the vent or thruster look a bit dimmer and more orange.
The light point outwards also casts light outside of the vent or thruster, providing some neat illumination on nearby surfaces or volumes.
Later in this post, I’ll write more about how I made use of this in the final image.</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2020/Jul/shipshape/ship_shading_progress_angle4_07.png"><img src="https://blog.yiningkarlli.com/content/images/2020/Jul/shipshape/preview/ship_shading_progress_angle4_07.jpg" alt="Figure 15: Wide view of the back of the ship, showing the practical lights in the ship's various engine thrusters and exhaust vents." /></a></p>
<p>Here’s a turntable video of the ship, showcasing all of the texturing and shading that I did.
I had a lot of fun taking care of all of the tiny details that are part of the ship, even though many of them aren’t actually visible in my final image.
The dripping wet rain effect is discussed later in this post.</p>
<div class="embed-container"><iframe src="https://player.vimeo.com/video/433151006?loop=1" frameborder="0">Shipshape Art Challenge Ship Turntable</iframe></div>
<div class="figcaption">Figure 16: Turntable of the ship showing both dry and wet variants.</div>
<p><strong>Shading and Texturing the Robots</strong></p>
<p>For the robots, I used the same Substance Painter based texturing workflow and the same PxrSurface based shading workflow that I used for the ship.
However, since the robot has far fewer components than the ship, I was able to bring all of the robot’s UDIM tiles into Substance Painter at once.
The main challenge with the robots wasn’t the sheer quantity of parts that had to be textured, but instead was in the variety of robot color schemes that had to be made.
In order to populate the scene and give my final image a sense of life, I wanted to have a lot of robots on the ships, and I wanted all of the robots to have different paint and color schemes.</p>
<p>I knew from an early point that I wanted the robot carrying the suitcase to be yellow, and I knew I wanted a robot in some kind of conductor’s uniform, but aside from that, I didn’t much pre-planned for the robot paint schemes.
As a result, coming up with different robot paint schemes was a lot of fun and involved a lot of just goofing around and improvisation in Substance Painted until I found ideas that I liked.
To help unify how all of the robots looked and to help with speeding up the texturing process, I came up with a base metallic look for the robot’s legs and arms and various functional mechanical parts.
I alternated between steel and copper parts to help bring some visual variety to all of the mechanical parts.
The metallic parts are the same across all of the robots; the parts that vary between robots are the body shell and various outer casing parts on the arms:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2020/Jul/shipshape/yellow_bot.jpg"><img src="https://blog.yiningkarlli.com/content/images/2020/Jul/shipshape/preview/yellow_bot.jpg" alt="Figure 17: Robot with steel and copper mechanical parts and yellow outer shell." /></a></p>
<p>I wanted very different looks for the other two robots that are on the dock with the yellow robot.
I gave one of them a more futuristic looking white glossy shell with a subtle hexagon imprint pattern and red accents.
The hexagon imprint pattern is created using a hexagon pattern in the normal map.
The red stripes use the same edgeloop-following technique that I used for masking some layers on the ship.
I made the other robot a matte green color, and I thought it would be fun make him into a sports fan.
He’s wearing the logo and colors of the local in-world sports team, the Seneca Senators!
Since the robots don’t wear clothes per se, I guess maybe the sports team logo and numbers are some kind of temporary sticker?
Or maybe this robot is such a bit fan that he had the logo permanently painted on… I don’t know!
Since I knew these two robots would be seen from the back in the final image, I made sure to put all of the interesting stuff on their sides and back.</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2020/Jul/shipshape/white_bot.jpg"><img src="https://blog.yiningkarlli.com/content/images/2020/Jul/shipshape/preview/white_bot.jpg" alt="Figure 18: Futuristic robot with glossy white outer shell and red accents." /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2020/Jul/shipshape/green_bot.jpg"><img src="https://blog.yiningkarlli.com/content/images/2020/Jul/shipshape/preview/green_bot.jpg" alt="Figure 19: Sports fan robot wearing the colors of the in-world team, the Seneca Senators." /></a></p>
<p>For the conductor robot, I chose a blue and gold color scheme based on real world conductor uniforms I’ve seen before.
I made the conductor robot overall a bit more cleaned up compared to the other robots, since I figured the conductor robot should look a bit more crisp and professional.
I also gave the conductor robot a gold mustache, for a bit of fun!
To complete the look, I modeled a simple conductor’s hat for the conductor robot to wear.
I also made a captain robot, which has a white/black/gold color scheme derived from the conductor robot.
The white/black/gold color scheme is based on old-school ship’s captain uniforms.
The captain robot required a bit of a different hat from the conductor hat; I made the captain hat a little bigger and a little bit more elaborate, complete with gold stitching on the front around the Seneca Metro emblem.
In the final scene you don’t really see the captain robots, since they wound up inside of the wheelhouse at the top of the ship’s tower superstructure, but hey, at least the captain robots were fun to make, and at least I know that they’re there!</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2020/Jul/shipshape/conductor_bot.jpg"><img src="https://blog.yiningkarlli.com/content/images/2020/Jul/shipshape/preview/conductor_bot.jpg" alt="Figure 20: Conductor robot with a blue and gold color scheme and a hat!" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2020/Jul/shipshape/captain_bot.jpg"><img src="https://blog.yiningkarlli.com/content/images/2020/Jul/shipshape/preview/captain_bot.jpg" alt="Figure 21: Captain robot with a white and black and gold color scheme and an even fancier hat." /></a></p>
<p>As a bit of a joke, I tried making a poncho for one of the robots.
I thought it would look very silly, which for me was all the more reason to try!
To make the poncho, I made a big flat disc in Maya and turned it into nCloth, and just let it fall onto the robot with the robot’s geometry acting as a static collider.
This approach basically worked out-of-the-box, although I made some manual edits to the geometry afterwards just to get the poncho to billow a bit more on the bottom.
The poncho’s shader is a simple glass PxrSurface shader, with the bottom frosted section and smooth diamond-shaped window section both driven using just roughness.
The crinkly plastic sheet appearance is achieved entirely through a wrinkle normal map.
The poncho bot is also not really visible in the final image, but somewhere in the final image, this robot is in the background on the deck of the front ship behind some other robots!</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2020/Jul/shipshape/poncho_bot.jpg"><img src="https://blog.yiningkarlli.com/content/images/2020/Jul/shipshape/preview/poncho_bot.jpg" alt="Figure 22: Robot wearing a clear plastic poncho." /></a></p>
<p>Don’t worry, I didn’t forget about the fact that the robots have antennae!
For the poncho robot, I modeled a hole into the poncho for the antenna to pass through, and I modeled similar holes into the captain robot and conductor robot’s hats as well.
Again, this is a detail that isn’t visible in the final image at all, but is there mostly just so that I can know that it’s there:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2020/Jul/shipshape/poncho_antenna_hole.png"><img src="https://blog.yiningkarlli.com/content/images/2020/Jul/shipshape/preview/poncho_antenna_hole.jpg" alt="Figure 23: Antenna pass-through hole modeled into the poncho." /></a></p>
<p>In total I created 12 different unique robot variants, which some variants duplicated in the final image.
All 12 variants are actually present in the scene!
Most of them are in the background (and a few variants are only on the background ship), so most of them aren’t very visible in the final image.
You, the reader, have probably noticed a theme in this post now where I put a lot of effort into things that aren’t actually visible in the final image… for me, a large part of this project wasn’t necessarily about the final image and was instead just about having fun and getting some practice with the tools and workflows.</p>
<p>Here is a turntable showcasing all 12 robot variants.
In the turntable, only the yellow robot has both a wet and dry variant, since all of the other robots in the scene remembered their umbrellas and were therefore able to stay dry.
The green sports fan robot does have a variant with a wet right arm though, since in the final image the green sports fan robot’s right arm is extended beyond the umbrella to wave at the incoming ship.</p>
<div class="embed-container"><iframe src="https://player.vimeo.com/video/433151137?loop=1" frameborder="0">Shipshape Art Challenge Robots Turntable</iframe></div>
<div class="figcaption">Figure 24: Turntable of the robots, with all 12 robot variants.</div>
<p><strong>The Wet Shader</strong></p>
<p>Going into the shading process, the single problem that worried me the most was how I was going to make everything in the rain look wet.
Having a good wet look is extremely important for selling the overall look of a rainy scene.
I actually wasn’t too worried about the base dry shading, since hard metal/plastic surfaces are one of the things that CG is really good at by default.
By contrast, getting a good wet rainy look took an enormous amount of experimentation and effort, and wound up even involving some custom tools.</p>
<p>From a cursory search online, I found some techniques for creating a wet rainy look that basically work by modulating the primary specular lobe and applying a normal map to the base normal of the surface.
However, I didn’t really like how this looked; in some cases, this approach basically makes it look like the underlying surface itself has rivulets and dots in it, not like there’s water running on top of the surface.
My hunch was to use PxrSurface’s clearcoat lobe instead, since from a physically motivated perspective, water streaks and droplets behave more like an additional transparent refractive coating layer on top of a base surface.
A nice bonus from trying to use the clearcoat lobe is that PxrSurface supports using different normal maps for each specular lobe; this way, I could have a specific water droplets and streaks normal map plugged into the bump normal parameter for the clearcoat lobe without having to disturb whatever normal map I had plugged into the bump normal parameter to the base diffuse and primary specular lobes.
My idea was to create a single shading graph for creating the wet rainy look, and then plug this graph into the clearcoat lobe parameters for any PxrSurface that I wanted a wet appearance for.
Here’s what the final graph looked like:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2020/Jul/shipshape/wetshader_graph.png"><img src="https://blog.yiningkarlli.com/content/images/2020/Jul/shipshape/wetshader_graph.png" alt="Figure 25: Shading graph for creating the wet rainy look. This graph plugs into the clearcoat parameters of any shader that I wanted to have a wet appearance." /></a></p>
<p>In the graph above, note how the input textures are fed into PxrRemap nodes for ior, edge color, thickness, and roughness; this is so I can rescale the 0-1 range inputs from the textures to whatever they need to be for each parameter.
The node labeled “mastercontrol” allows for disabling the entire wet effect by feeding 0.0 into the clearcoat edge color parameter, which effectively disables the clearcoat lobe.</p>
<p>Having to manually connect this graph into all of the clearcoat parameters in each PxrSurface shader I used was a bit of a pain.
Ideally I would have preferred if I could have just plugged all of the clearcoat parameters into a PxrLayer, disabled all non-clearcoat lobes in the PxrLayer, and then plugged the PxrLayer into a PxrLayerSurface on top of underlying base layers.
Basically, I wish PxrLayerSurface supported enabling/disabling layers on a per-lobe basis, but this ability currently doesn’t exist in RenderMan 23.
In Disney’s Hyperion Renderer, we support this functionality for sparsely layering Disney Bsdf parameters <a href="https://doi.org/10.1145/2776880.2787670">[Burley 2015]</a>, and it’s really really useful.</p>
<p>There are only four input maps required for the entire wet effect: a greyscale rain rivulets map, a corresponding rain rivulets normal map, a greyscale droplets map, and a corresponding droplets normal map.
The rivulets maps are used for the sides of a PxrRoundCube projection node, while the droplets maps are used for the top of the PxrRoundCube projection node; this makes the wet effect look more like rain drop streaks the more vertical a surface is, and more like droplets splashing on a surface the more horizontal a surface is.
Even though everything in my scene is UV mapped, I chose to use PxrRoundCube to project the wet effect on everything in order to make the wet effect as automatic as possible; to make sure that repetitions in the wet effect textures weren’t very visible, I used a wide transition width for the PxrRoundCube node and made sure that the PxrRoundCube’s projection was rotated around the Y-axis to not be aligned with any model in the scene.</p>
<p>To actually create the maps, I used a combination of Photoshop and a custom tool that I originally wrote for Takua Renderer.
I started in Photoshop by kit-bashing together stuff I found online and hand-painting on top to produce a 1024 by 1024 pixel square example map with all of the characteristics I wanted.
While in Photoshop, I didn’t worry about making sure that the example map could tile; tiling comes in the next step.
After initial work in Photoshop, this is what I came up with:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2020/Jul/shipshape/painted_wetmask.jpg"><img src="https://blog.yiningkarlli.com/content/images/2020/Jul/shipshape/painted_wetmask.jpg" alt="Figure 26: Initial kit-bashed / hand-painted exemplars for streak and droplet wet maps." /></a></p>
<p>Next, to make the maps repeatable and much larger, I used a custom tool I previously wrote that implements a practical form of histogram-blending hex tiling <a href="http://jcgt.org/published/0008/04/02/">[Burley 2019]</a>.
Hex tiling with histogram preserving blending, originally introduced by <a href="https://doi.org/10.1145/3233304">Heitz and Neyret [2018]</a>, is one of the closest things to actual magic in recent computer graphics research; using hex tiling instead of normal rectilinear tiling basically completely hides obvious repetitions in the tiling from the human eye, and the histogram preserving blending makes sure that hex tile boundaries blend in a way that makes them completely invisible as well.
I’ll write more about hex tiling and make my implementation publicly available in a future post.
What matters for this project is hex tiling allowed me to convert my exemplar map from Photoshop into a much larger 8K seamlessly repeatable texture map with no visible repetition patterns.
Below is a cropped section from each 8K map:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2020/Jul/shipshape/hextiled_wetmask.jpg"><img src="https://blog.yiningkarlli.com/content/images/2020/Jul/shipshape/hextiled_wetmask.jpg" alt="Figure 27: Crops from the 8K wet maps generated from the exemplar maps using my custom implementation of histogram-blending hex tiling." /></a></p>
<p>For the previous Art Challenge, I also made some custom textures that had to be tileable.
Last time though, I used Substance Designer to make the textures tileable, which required setting up a big complicated node graph and produced results where obvious repetition was still visible.
Conversely, hex tiling basically works automatically and doesn’t require any kind of manual setup or complex graphs or anything.</p>
<p>To generate the normal maps, I used Photoshop’s “Generate Normal Map” filter, which is found under “Filter > 3D”.
For generating normal maps from simple greyscale heightmaps, this Photoshop feature works reasonably well.
Because of the deterministic nature of the hex tiling implementation though, I could have also generated normal maps from the grey scale exemplars and then fed the normal map exemplars through the hex tiling tool with the same parameters as how I fed in the greyscale maps, and I would have gotten the same result as below.</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2020/Jul/shipshape/hextiled_normals.jpg"><img src="https://blog.yiningkarlli.com/content/images/2020/Jul/shipshape/hextiled_normals.jpg" alt="Figure 28: Crops from the 8K wet map normals generated using Photoshop's "Generate Normal Map" filter tool." /></a></p>
<p>For the wet effect’s clearcoat lobe, I chose to use the physical mode instead of the artistic mode (unlike for the base dry shaders, where I only used the artistic mode).
The reason I used the physical mode for the wet effect is because of the layer thickness control, which darkens the underlying base shader according to how thick the clearcoat layer is supposed to be. I wanted this effect, since wet surfaces appear darker than their dry counterparts in real life.
Using the greyscale wet map, I modulated the layer thickness control according to how much water there was supposed to be at each part of the surface.</p>
<p>Finally, after wiring everything together in Maya’s HyperShade editor, everything just worked!
I think the wet look my approach produces looks reasonable convincing, especially from the distances that everything is from the camera in my final piece.
Up close the effect still holds up okay, but isn’t as convincing as using real geometry for the water droplets with real refraction and caustics drive by manifold next event estimation <a href="http://dx.doi.org/10.1111/cgf.12681">[Hanika et al. 2015]</a>.
In the future, if I need to do close up water droplets, I’ll likely try an MNEE based approach instead; fortunately, RenderMan 23’s PxrUnified integrator already comes with an MNEE implementation as an option, along with various other strategies for handling caustic cases <a href="http://graphics.pixar.com/library/BiDir/">[Hery et al. 2016]</a>.
However, the approach I used for this project is far cheaper from a render time perspective compare to using geometry and MNEE, and from a mid to far distance, I’m pretty happy with how it turned out!</p>
<p>Below are some comparisons of the ship and robot with and without the wet effect applied.
The ship renders are from the same camera angles as in Figures 13, 14, and 15. drag the slider left and right to compare:</p>
</div>
<div class="embed-container">
<iframe src="/content/images/2020/Jul/shipshape/comparisons/wideship_wetdrycompare_embed.html" frameborder="0" border="0" scrolling="no"></iframe></div>
<div class="figcaption">Figure 29: Wide view of the ship with (left) and without (right) the wet shader applied. For a full screen comparison, <a href="/content/images/2020/Jul/shipshape/comparisons/wideship_wetdrycompare.html">click here.</a></div>
<p>
<div class="embed-container">
<iframe src="/content/images/2020/Jul/shipshape/comparisons/backship_wetdrycompare_embed.html" frameborder="0" border="0" scrolling="no"></iframe></div>
<div class="figcaption">Figure 30: Back view of the ship with (left) and without (right) the wet shader applied. For a full screen comparison, <a href="/content/images/2020/Jul/shipshape/comparisons/backship_wetdrycompare.html">click here.</a></div>
<p>
<div class="embed-container">
<iframe src="/content/images/2020/Jul/shipshape/comparisons/sideship_wetdrycompare_embed.html" frameborder="0" border="0" scrolling="no"></iframe></div>
<div class="figcaption">Figure 31: Side view of the ship with (left) and without (right) the wet shader applied. For a full screen comparison, <a href="/content/images/2020/Jul/shipshape/comparisons/sideship_wetdrycompare.html">click here.</a></div>
<p>
<div class="embed-container">
<iframe src="/content/images/2020/Jul/shipshape/comparisons/robot_wetdrycompare_embed.html" frameborder="0" border="0" scrolling="no"></iframe></div>
<div class="figcaption">Figure 32: Main yellow robot with (left) and without (right) the wet shader applied. For a full screen comparison, <a href="/content/images/2020/Jul/shipshape/comparisons/robot_wetdrycompare.html">click here.</a></div>
<div>
<p><strong>Additional Props and Set Elements</strong></p>
<p>In addition to texturing and shading the flying scifi ship and robot models, I had to create from scratch several other elements to help support the story in the scene.
By far the single largest new element that had to be created was the entire dock structure that the robots stand on top of.
As mentioned earlier, I wound up modeling the dock to a fairly high level of detail; the dock model contains every single bolt and rivet and plate that would be necessary for holding together a similar real steel frame structure.
Part of this level of detail is justifiable by the fact that the dock structure is in the foreground and therefore relatively close to camera, but part of having this level of detail is just because I could and I was having fun while modeling.
To model the dock relatively quickly, I used a modular approach where I first modeled a toolkit of basic reusable elements like girders, connection points, bolts, and deckboards.
Then, from these basic elements, I assembled larger pieces such as individual support legs and crossbeams and such, and then I assembled these larger pieces into the dock itself.</p>
<p>Shading the dock was relatively fast and straightforward; I created a basic galvanized metal material and applied it using a PxrRoundCube projection.
To get a bit more detail and break up the base material a bit, I added a dirt layer on top that is basically just low-frequency noise multiplied by ambient occlusion.
I did have to UV map the gangway section of the dock in order to add the yellow and black warning stripe at the end of the gangway; however, since the dock is made up almost entirely of essentially rectangular prisms oriented at 90 degree angles to each other, just using Maya’s automatic UV unwrapping provided something good enough to just use as-is.
The yellow and black warning stripe uses the same thick worn paint material that the warning stripes on the ship uses.
On top of all of this, I then applied my wet shader clearcoat lobe.</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2020/Jul/shipshape/dock_wide.png"><img src="https://blog.yiningkarlli.com/content/images/2020/Jul/shipshape/preview/dock_wide.jpg" alt="Figure 33: Shading test for the dock, with wet effect applied. The lampposts are in a different orientation compared to where they are in the final scene." /></a></p>
<p>The metro sign on the dock is just a single rectangular prism with a dark glass material applied.
The glowing text is a color texture map plugged into PxrSurface’s glow parameter; whereever there is glowing text, I also made the material diffuse instead of glass, with the diffuse color matching the glow color.
To balance the intensity of the glow, I had to cheat a bit; turning the intensity of the glow down enough so that the text and colors read well means that the glow is no longer bright enough to show up in reflections or cast enough light to show up in a volume.
My solution was to turn down the glow in the PxrSurface shader, and then add a PxrRectLight immediately in front of the metro sign driven by the same texture map.
The PxrRectLight is set to be invisible to the camera.
I suppose I could have done this in post using light path expressions, but cheating it this way was simpler and allowed for everything to just look right straight out of the render.</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2020/Jul/shipshape/dock_closeup.png"><img src="https://blog.yiningkarlli.com/content/images/2020/Jul/shipshape/preview/dock_closeup.jpg" alt="Figure 34: Closeup test of the metro sign on the dock." /></a></p>
<p>The suitcase was a really simple prop to make.
Basically it’s just a rounded cube with some extra bits stuck on to it for the handles and latch; the little rivets are actually entirely in shading and aren’t part of the geometry at all.
I threw on a basic burlap material for the main suitcase, multiplied on some noise to make it look a bit dirtier and worn, and applied basic brass and leather materials to the latch and handle, and that was pretty much it.
Since the suitcase was going to serve as the yellow robot’s makeshift umbrella, making sure that the suitcase looked good with the wet effect applied turned out to be really important.
Here’s a lookdev test render of the suitcase, with and without the wet effect applied (slide left and right to compare):</p>
</div>
<div class="embed-container">
<iframe src="/content/images/2020/Jul/shipshape/comparisons/suitcase_wetdrycompare_embed.html" frameborder="0" border="0" scrolling="no"></iframe></div>
<div class="figcaption">Figure 35: Suitcase with (left) and without (right) the wet shader applied. For a full screen comparison, <a href="/content/images/2020/Jul/shipshape/comparisons/suitcase_wetdrycompare.html">click here.</a></div>
<div>
<p>From early on, I was fairly worried about making the umbrellas look good; I knew that making sure the the umbrellas looked convincingly wet was going to be really important for selling the overall rainy day setting.
I originally was going to make the umbrellas opaque, but realized that opaque umbrellas were going to cast a lot of shadows and block out a lot of parts of the frame.
Switching to transparent umbrellas made out of clear plastic helped a lot with brightening up parts of the frame and making sure that large parts of the ship weren’t completely blocked out in the final image.
As a bonus, I think the clear umbrellas also help the overall setting feel slightly more futuristic.
I modeled the umbrella canopy as a single-sided mesh, so the “thin” setting in PxrSurface’s glass parameters was really useful here.
Since the umbrella canopy is transparent with refraction roughness, having the wet effect work through the clearcoat lobe proved really important here since doing so allowed for the rain droplets and rivulets to have sharp specular highlights while simultaneously preserving the more blurred refraction in the underlying umbrella canopy material.
In the end, lighting turned out to be really important for selling the look of the wet umbrella as well; I found that having tons of little specular highlights coming from all of the rain drops helped a lot.</p>
<p>As a bit of an aside, settling on a final umbrella canopy shape took a surprising amount of time!
I started with a much flatter umbrella canopy, but eventually made it more bowed after looking at various umbrellas I have sitting around at home.
Most clear umbrella references I found online are of these Japanese bubble umbrellas which are actually far more bowed than a standard umbrella, but I wanted a shape that more closely matched a standard opaque umbrella.</p>
<p>One late addition I made to the umbrella was the small lip at the bottom edge of the umbrella canopies; for much of the development process, I didn’t have this small lip and kept feeling like something was off about the umbrellas.
I eventually realized that some real umbrellas have a bit of a lip to help catch and guide water runoff; adding this feature to the umbrellas helped them feel a bit more correct.</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2020/Jul/shipshape/umbrella.jpg"><img src="https://blog.yiningkarlli.com/content/images/2020/Jul/shipshape/preview/umbrella.jpg" alt="Figure 36: Lookdev test of the umbrella, with wet effect applied." /></a></p>
<p>Shortly before the due date for the final image, I made a last-minute addition to my scene: I took the sextant that came with Pixar’s base models and made the white/red robot on the dock hold it.
Since the green and yellow robots were both doing something a bit more dynamic than just standing around, I wanted the middle white/red robot to be doing something as well.
Maybe the white/red robot is going to navigation school!
I did a very quick-and-dirty shading job on the sextant using Maya’s automatic UVs; overall the sextant prop is not shaded to the same level of detail as most of the other elements in my scene, but considering how small the sextant is in the final image, I think it holds up okay.
I still tried to add a plausible amount of wear and age to the metal materials on the sextant, but I didn’t have time to put in carved numbers and decals and grippy textures and stuff.
There are also a few small areas where you can see visible texture stretching at UV seams, but again, in the final image, it didn’t matter too much.</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2020/Jul/shipshape/sextant.jpg"><img src="https://blog.yiningkarlli.com/content/images/2020/Jul/shipshape/preview/sextant.jpg" alt="Figure 37: Quick n' dirty lookdev test of the sextant. Model is by Aliyah Chen and was provided by Pixar as one of the contest's base models." /></a></p>
<p><strong>Rain FX</strong></p>
<p>Having a good wet surface look was one half of getting my scene to look convincingly rainy; the other major problem to solve was making the rain itself!
My initial, extremely naive plan was to simulate all of the rainfall as one enormous FLIP sim in Houdini.
However, I almost immediately realized what a bad idea that was, due to the scale of the scene.
Instead, I opted to simulate the rain as nParticles in Maya.</p>
<p>To start, I first duplicated all of the geometry that I wanted the rain to interact with, combined it all into one single huge mesh, and then decimated the mesh heavily and simplified as much as I could.
This single mesh acted as a proxy for the full scene for use as a passive collider in the nParticles network.
Using a decimated proxy for the collider instead of the full scene geometry was very important for making sure that the sim ran fast enough for me to be able to get in a good number of different iterations and attempts to find the look that I wanted.
I mostly picked geometry that was upward facing for use in the proxy collider:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2020/Jul/shipshape/rain_proxygeo.png"><img src="https://blog.yiningkarlli.com/content/images/2020/Jul/shipshape/rain_proxygeo.png" alt="Figure 38: The proxy mesh I used for the rain nParticles sim. This is an earlier version of the proxy mesh before I settled on final scene geometry; the final sim was run with an updated proxy mesh made from the final scene geometry." /></a></p>
<p>Next, I set up a huge volume nParticle emitter node above the scene, covering the region visible in the camera frustum.
The only forces I set up were gravity and a small amount of wind, and then I ran the nParticles system and let it run until rain had filled all parts of the scene visible to the camera.
To give the impression of fast moving motion-blurred rain droplets, I set the rendering mode of the nParticles to ‘multistreak’, which makes each particle look like a set of lines with lengths varying according to velocity.
I had to play with the collider proxy mesh’s properties a bit to get the right amount of raindrops bouncing off of surfaces and to dial in how high raindrops bounced.
I initially tried allowing particles to collide with each other as well, but this slowed the entire sim down to basically a halt, so for the final scene I have particle-to-particle collision disabled.</p>
<p>After a couple of rounds of iteration, I started getting something that looked reasonably like rain!
Using the proxy collision geometry wa really useful for creating “rain shadows”, which are areas that rain isn’t present due to being stopped by something else.
I also tuned the wind speed a lot in order to get rain particles bouncing off of the umbrellas to look like they were being blown aside in the wind.
After getting a sim that I liked, I baked out the frame of the sim that I wanted for my final render using Maya’s nCache system, which caches the nParticle simulation to disk so that it can be rapidly loaded up later without having to re-run the entire simulation.</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2020/Jul/shipshape/rain_viewport.png"><img src="https://blog.yiningkarlli.com/content/images/2020/Jul/shipshape/preview/rain_viewport.jpg" alt="Figure 39: Closeup of a work-in-progress version of the rain sim. Note how the umbrellas properly block rain from falling on the robots under the umbrellas." /></a></p>
<p>To add just an extra bit of detail and storytelling, near the end of the competition period I revisited my original idea for making the rain in Houdini using a FLIP solver.
I wanted to add in some “hero” rain drops around the foreground robots, running off of their umbrellas and suitcases and stuff.
To create these “hero” droplets, I brought the umbrella canopies and suitcase into Houdini and built a basic FLIP simulation, meshed the result, and brought it back into Maya to integrate back into the scene.</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2020/Jul/shipshape/houdini_rainsim.png"><img src="https://blog.yiningkarlli.com/content/images/2020/Jul/shipshape/houdini_rainsim.png" alt="Figure 40: Using a FLIP simulation in Houdini to create some "hero" rain droplets running off of the umbrella canopies and suitcase." /></a></p>
<p>Dialing in the look of the rain required a lot of playing with both the width of the rain drop streaks and with the rain streak material.
I was initially very wary of making the rain in my scene heavy, since I was concerned about how much a heavy rain look would prevent me from being able to pull good detail and contrast from the ships.
However, after some successful initial tests, I felt a bit more confident about a heavier rain look.
I took the test from yesterday with more rain, and tried increasing the amount of rain by around 10x.
I originally started working on the sim with only around a million particles, but by the end I had bumped up the particle count to around 10 million.
In order to prevent the increased amount of rain from completely washing out the scene, I made each rain drop streak on the thinner and shorter side, and also tweaked the material to be slightly more forward scattering.
My rain material is basically a mix of a rough glass and grey diffuse, with the reasoning being rain needs to be a glass material since rain is water, but since the rain droplet streaks are meant to look motion blurred, throwing in some diffuse just helps them show up better in camera; making the rain material more forwards scattering in this case just means changing the ratio of glass/diffuse to be more glass.
I eventually arrived at a ratio of 60% diffuse light grey to 40% glass, which I found helped the rain show up in the camera and catch light a bit better.
I also used the “presence” parameter (which is really just opacity) in PxrShader to make final adjustments to balance how visible the rain was with how much it was washing out other details.
For the “hero” droplets, I used a completely bog-standard glass material.</p>
<p>Figuring out how to simulate the rain and make it look good was by far the single largest source of worries for me in this whole project, so I was incredibly relieved at the end when it all came together and started looking good.
Here’s a 2K crop from my final image showing the “hero” droplets and all of the surrounding rain streaks around the foreground robots.</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2020/Jul/shipshape/raindrops_crop.jpg"><img src="https://blog.yiningkarlli.com/content/images/2020/Jul/shipshape/preview/raindrops_crop.jpg" alt="Figure 41: 2K crop showing "hero" droplets and rain streaks." /></a></p>
<p><strong>Lighting and Compositing</strong></p>
<p>Lighting this scene proved to be very interesting and very different from what I did for the previous challenge!
Looking back, I think I actually may have “overlit” the scene in the previous challenge; I tend to prefer a slightly more naturalistic look, but while in the thick of lighting, it’s easy to get carried away and push things far beyond the point of looking naturalistic.
Another aspect of this scene that it made it very different from anything I’ve tried before is both the sheer number of practical lights in the scene and the fact that practical lights are the primary source of all lighting in this scene!</p>
<p>The key lighting in this scene is provided by the overhead lampposts on the dock, which illuminate the foreground robots.
I initially had a bunch of additional invisible PxrRectLights providing additional illumination and shaping on the robots, but I got rid of all of them and in the final image I relied only on the actual lights on the lampposts.
To prevent the visible light surfaces themselves from blowing out an aliasing, I used two lights for every lamppost: one visible-to-camera PxrRectLight set to a low intensity that wouldn’t alias in the render, and one invisible-to-camera PxrRectLight set to a relatively higher intensity for providing the actual lighting.
The visible-to-camera PxrRectLight is rendered out as the only element on a separate render layer, which can then be added back in to the main key lighting render layer.</p>
<p>To better light the ships, I added a number of additional floodlights to the ship that weren’t part of the original model; you can see these additional floodlights mounted on top of the various masts of the ships and also on the sides of the tower superstructure.
These additional floodlights illuminate the decks of the ships and help provide specular highlights to all of the umbrellas on the deck of the foreground ship, which enhances the rainy water droplet covered look.
For the foreground robots on the dock, the ship floodlights also act as something of a rim light.
Each of the ship floodlights is modeled as a visible-to-camera PxrDiscLight behind a glass lens with a second invisible-to-camera PxrDiscLight in front of the glass lens. The light behind the glass lens is usually lower in intensity and is there to provide the in-camera look of the physical light, while the invisible light in front of the lens is usually higher in intensity and provides the actual illumination in the scene.</p>
<p>In general, one of the major lessons I learned on this project was that when lighting using practical lights that have to be be visible in camera, a good approach is to use two different lights: one visible-to-camera and one invisible-to-camera.
This approach allows for separating how the light itself looks versus what kind of lighting it provides.</p>
<p>The overall fill lighting and time of day is provided by the skydome, which is of an overcast sky at dusk.
I waffled back and forth for a while between a more mid-day setting versus a dusk setting, but eventually settled on the dusk skydome since the overall darker time of day allows the practical lights to stand out more.
I think allowing the background trees to fade almost completely to black actually helps a lot in keeping the focus of the image on the main story elements in the foreground.
One feature of RenderMan 23 that really helped in quickly testing different lighting setups and iterating on ideas was RenderMan’s IPR mode, which has come a long way since RendermMan first moved to path tracing.
In fact, throughout this whole project, I used the IPR mode extensively for both shading tests and for the lighting process.
I have a lot of thoughts about the huge, compelling improvements to artist workflows that will be brought by even better interactivity (RenderMan XPU is very exciting!), but writing all of those thoughts down is probably better material for a different blog post in the future.</p>
<p>In total I had five lighting render layers: the key from the lampposts, the foreground rim and background fill from the floodlights, overall fill from the skydome, and two practicals layers for the visible-to-camera parts of all of the practical lights.
Below are the my lighting render layers, although with the two practicals layers merged:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2020/Jul/shipshape/lights_key.jpg"><img src="https://blog.yiningkarlli.com/content/images/2020/Jul/shipshape/lights_key.jpg" alt="Figure 42: Final render, lampposts key lighting pass." /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2020/Jul/shipshape/lights_floods.jpg"><img src="https://blog.yiningkarlli.com/content/images/2020/Jul/shipshape/lights_floods.jpg" alt="Figure 43: Final render, floodlights lighting pass." /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2020/Jul/shipshape/lights_sky.jpg"><img src="https://blog.yiningkarlli.com/content/images/2020/Jul/shipshape/lights_sky.jpg" alt="Figure 44: Final render, sky fill lighting pass." /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2020/Jul/shipshape/lights_practicals.jpg"><img src="https://blog.yiningkarlli.com/content/images/2020/Jul/shipshape/lights_practicals.jpg" alt="Figure 45: Final render, practical lights lighting pass." /></a></p>
<p>I used a number of PxrRodLightFilters to knock down some distractingly bright highlights in the scene (especially on the foreground robots’ umbrellas in the center of the frame).
As a rendering engineer, rod light filters are a constant source of annoyance due to the sampling problems they introduce; rods allow for arbitrarily increasing or decreasing the amount of light going through an area, which throws off energy conservation, which can mess up importance sampling strategies that depend on a degree of energy conservation.
However, as a user, rod light filters have become one of my favorite go-to tools for shaping and adjusting lighting on a local basis, since they offer an enormous amount of localized artistic control.</p>
<p>To convey the humidity of a rainstorm and to provide volumetric glow around all of the practical lights in the scene, I made extensive use of volume rendering on this project as well.
Every part of the scene visible in-camera has some sort of volume in it!
There are generally two types of volumes in this scene: a group of thinner, less dense volumes to provide atmospherics, and then a group of thicker, denser “hero” volumes that provide some of the more visible mist below the foreground ship and swirling around the background ship.
All of these volumes are heterogeneous volumes brought in as VDB files.</p>
<p>One odd thing I found with volumes was some major differences in sampling behavior between RenderMan 23’s PxrPathtracer and PxrUnified integrators.
I found that by default, whenever I had a light that was embedded in a volume, areas in the volume near the light were extremely noisy when rendered using PxrUnified but rendered normally when using PxrPathtracer.
I don’t know enough about the details of how PxrUnified and PxrPathtracer’s volume integration <a href="https://doi.org/10.1145/3084873.3084907">[Fong et al. 2017]</a> approaches differ, but it almost looks to me like PxrPathtracer is correctly using RenderMan’s equiangular sampling implementation <a href="http://dx.doi.org/10.1111/j.1467-8659.2012.03148.x">[Kulla and Fajardo 2012]</a> in these areas and PxrUnified for some reason is not.
As a result, for rendering all volume passes I relied on PxrPathtracer, which did a great job with quickly converging on all passes.</p>
<p>An interesting unintended side effect of filling the scene with volumes was in how the volumes interacted with the orange thruster and exhaust vent lights.
I had originally calibrated the lights in the thrusters and exhaust vents to provide an indication of heat coming from those areas of the ship without being so bright as to distract from the rest of the image, but the orange glows these lights produced in the volumes made the entire bottom of the image orange, which was distracting anyway.
As a result, I had to re-adjust the orange thruster and exhaust vent lights to be considerably dimmer than I had originally had them, so that when interacting with the volumes, everything would be brought up to the apparent image-wide intensity that I had originally wanted.</p>
<p>In total I had eight separate render passes for volumes; each of the consolidated lighting passes from above had two corresponding volume passes.
Within the two volume passes for each consolidated lighting pass, one volume pass was for the atmospherics and one was for the heavier mist and fog.
Below are the volume passes consolidated into four images, with each image showing both the atmospherics and mist/fog in one image:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2020/Jul/shipshape/volumes_key.jpg"><img src="https://blog.yiningkarlli.com/content/images/2020/Jul/shipshape/volumes_key.jpg" alt="Figure 46: Final render, lampposts key volumes combined passes." /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2020/Jul/shipshape/volumes_floods.jpg"><img src="https://blog.yiningkarlli.com/content/images/2020/Jul/shipshape/volumes_floods.jpg" alt="Figure 47: Final render, floodlights volumes combined passes." /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2020/Jul/shipshape/volumes_sky.jpg"><img src="https://blog.yiningkarlli.com/content/images/2020/Jul/shipshape/volumes_sky.jpg" alt="Figure 48: Final render, sky fill volumes combined passes." /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2020/Jul/shipshape/volumes_practicals.jpg"><img src="https://blog.yiningkarlli.com/content/images/2020/Jul/shipshape/volumes_practicals.jpg" alt="Figure 49: Final render, practical lights volumes combined passes." /></a></p>
<p>One final detail I added in before final rendering was to adjust the bokeh shape to something more interesting than a uniform circle.
RenderMan 23 offers a variety of controls for customizing the camera’s aperture shape, which in turn controls the bokeh shape when using depth of field.
All of the depth of field in my final image is in-render, and because of all of the tiny specular hits from all of the raindrops and from the wet shader, there is a lot of visible bokeh going on.
I wanted to make sure that all of this bokeh was interesting to look at!
I picked a rounded 5-bladed aperture with a significant amount of non-uniform density (that is, the outer edges of the bokeh are much brighter than the center core).</p>
<p>For final compositing, I used a basic Photoshop and Lightroom workflow like I did in the previous challenge, mostly because Photoshop is a tool I already know extremely well and I don’t have Nuke at home.
I took a relatively light-handed approach to compositing this time around; adjustments to layers were limited to just exposure adjustments.
All of the layers shown above already have the exposure adjustments I made baked in.
After making adjustments in Photoshop and flattening out to a single layer, I then brought the image into Lightroom for final color grading.
For the final color grade, I tried push the overall look to be a bit moodier and a bit more contrast-y, with the goal of having the contrast further draw the viewer’s eye to the foreground robots where the main story is.
Figure 50 is a gif that visualizes the compositing process for my final image by showing how all of the successive layers are added on top of each other.
Figure 51 shows what all of the lighting, comp, and color grading looks like applied to a 50% grey clay shaded version of the scene, and if you don’t want to scroll all the way back to the top of this post to see the final image, I’ve included it again as Figure 52.</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2020/Jul/shipshape/final_layers_lossy.gif"><img src="https://blog.yiningkarlli.com/content/images/2020/Jul/shipshape/final_layers_lossy.gif" alt="Figure 50: Animated breakdown of compositing layers." /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2020/Jul/shipshape/clayrender_graded_4k.jpg"><img src="https://blog.yiningkarlli.com/content/images/2020/Jul/shipshape/preview/clayrender_graded.jpg" alt="Figure 51: Final lighting, comp, and color grading applied to a 50% grey clay shaded version. Click for 4K version." /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2020/Jul/shipshape/shipshape_full_4k.jpg"><img src="https://blog.yiningkarlli.com/content/images/2020/Jul/shipshape/preview/shipshape_full.jpg" alt="Figure 52: Final image. Click for 4K version." /></a></p>
<p><strong>Conclusion</strong></p>
<p>On a whole, I’m happy with how this project turned out!
I think a lot of what I did on this project represents a decent evolution over and applies a lot of lessons learned on the previous RenderMan Art Challenge.
I started this project mostly as an excuse to just have fun, but along the way I still learned a lot more, and going forward I’m definitely hoping to be able to do more pure art projects alongside my main programming and technical projects.</p>
<p>Here is a progression video I put together from all of the test and in-progress renders that I made throughout this entire project:</p>
<div class="embed-container"><iframe src="https://player.vimeo.com/video/433150588" frameborder="0">Shipshape Art Challenge Progression Reel</iframe></div>
<div class="figcaption">Figure 53: Progression reel made from test and in-progress renders leading up to my final image.</div>
<p>My wife, Harmony Li, deserves an enormous amount of thanks on this project.
First off, the final concept I went with is just as much her idea as it is mine, and throughout the entire project she provided valuable critiques and suggestions and direction.
As usual with the RenderMan Art Challenges, Leif Pederson from Pixar’s RenderMan group provided a lot of useful tips, advice, feedback, and encouragement as well.
Many other entrants in the Art Challenge also provided a ton of support and encouragement; the community that has built up around the Art Challenges is really great and a fantastic place to be inspired and encouraged.
Finally, I owe an enormous thanks to all of the judges for this RenderMan Art Challenge, because they picked my image for first place!
Winning first place in a contest like this is incredibly humbling, especially since I’ve never really considered myself as much of an artist.
Various friends have since pointed out that with this project, I no longer have the right to deny being an artist!
If you would like to see more about my contest entry, check out the <a href="https://renderman.pixar.com/answers/challenge/15577/river-patrol.html">work-in-progress thread I kept on Pixar’s Art Challenge forum</a>, and I also made <a href="https://www.artstation.com/artwork/WK2OJv">an Artstation post</a> for this project.</p>
<p>As a final bonus image, here’s a daylight version of the scene.
My backup plan in case I wasn’t able to pull off the rainy look was to just go for a boring daylight setup; I figured that the lighting would be a lot more boring, but the additional visible detail would be an okay consolation prize for myself.
Thankfully, the rainy look worked out and I didn’t have to go to my backup plan!
After the contest wrapped up, I went back and made a daylight version out of curiosity:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2020/Jul/shipshape/daylight_comp_4k.jpg"><img src="https://blog.yiningkarlli.com/content/images/2020/Jul/shipshape/preview/daylight_comp.jpg" alt="Figure 54: Bonus image: daylight version. Click for 4K version." /></a></p>
<p><strong>References</strong></p>
<p>Petr Beckmann and André Spizzichino. 1963. <a href="https://us.artechhouse.com/The-Scattering-of-Electromagnetic-Waves-from-Rough-Surfaces-P257.aspx">The Scattering of Electromagnetic Waves from Rough Surfaces</a>. New York: Pergamon.</p>
<p>Brent Burley. 2012. <a href="https://doi.org/10.1145/2343483.2343493">Physically Based Shading at Disney</a>. In <a href="https://blog.selfshadow.com/publications/s2012-shading-course/"><em>ACM SIGGRAPH 2012 Course Notes: Practical Physically-Based Shading in Film and Game Production</em></a>.</p>
<p>Brent Burley. 2015. <a href="https://doi.org/10.1145/2776880.2787670">Extending the Disney BRDF to a BSDF with Integrated Subsurface Scattering</a>. In <a href="https://blog.selfshadow.com/publications/s2015-shading-course"><em>ACM SIGGRAPH 2015 Course Notes: Physically Based Shading in Theory and Practice</em></a>.</p>
<p>Brent Burley. 2019. <a href="http://jcgt.org/published/0008/04/02/">On Histogram-Preserving Blending for Randomized Texture Tiling</a>. <em>Journal of Computer Graphics Techniques</em>. 8, 4 (2019), 31-53.</p>
<p>Per Christensen, Julian Fong, Jonathan Shade, Wayne Wooten, Brenden Schubert, Andrew Kensler, Stephen Friedman, Charlie Kilpatrick, Cliff Ramshaw, Marc Bannister, Brenton Rayner, Jonathan Brouillat, and Max Liani. 2018. <a href="https://dl.acm.org/citation.cfm?id=3182162">RenderMan: An Advanced Path-Tracing Architecture for Movie Rendering</a>. <em>ACM Transactions on Graphics</em>. 37, 3 (2018), 30:1–30:21.</p>
<p>Johannes Hanika, Marc Droske, and Luca Fascione. 2015. <a href="http://dx.doi.org/10.1111/cgf.12681">Manifold Next Event Estimation</a>. <em>Computer Graphics Forum</em>. 34, 4 (2015), 87-97.</p>
<p>Eric Heitz and Fabrice Neyret. 2018. <a href="https://dl.acm.org/doi/10.1145/3233304">High-Performance By-Example Noise using a Histogram-Preserving Blending Operator</a>. <em>Proceedings of the ACM on Computer Graphics and Interactive Techniques</em>. 1, 2 (2018), 31:1-31:25.</p>
<p>Christophe Hery and Junyi Ling. 2017. <a href="http://graphics.pixar.com/library/PxrMaterialsCourse2017/index.html">Pixar’s Foundation for Materials: PxrSurface and PxrMarschnerHair</a>. In <a href="https://blog.selfshadow.com/publications/s2017-shading-course/"><em>ACM SIGGRAPH 2017 Course Notes: Physically Based Shading in Theory and Practice</em></a>.</p>
<p>Christophe Hery, Ryusuke Villemin, and Florian Hecht. 2016. <a href="http://graphics.pixar.com/library/BiDir/">Towards Bidirectional Path Tracing at Pixar</a>. In <a href="https://blog.selfshadow.com/publications/s2016-shading-course/"><em>ACM SIGGRAPH 2016 Course Notes: Physically Based Shading in Theory and Practice</em></a>.</p>
<p>Julian Fong, Magnus Wrenninge, Christopher Kulla, and Ralf Habel. 2017. <a href="https://doi.org/10.1145/3084873.3084907">Production Volume Rendering</a>. In <em>ACM SIGGRAPH 2017 Courses</em>.</p>
<p>Iliyan Georgiev, Jamie Portsmouth, Zap Andersson, Adrien Herubel, Alan King, Shinji Ogaki, Frederic Servant. 2019. <a href="https://autodesk.github.io/standard-surface/">Autodesk Standard Surface</a>. Autodesk white paper.</p>
<p>Ole Gulbrandsen. 2014. <a href="http://jcgt.org/published/0003/04/03/">Artistic Friendly Metallic Fresnel</a>. <em>Journal of Computer Graphics Techniques</em>. 3, 4 (2014), 64-72.</p>
<p>Christopher Kulla and Marcos Fajardo. 2012. <a href="http://dx.doi.org/10.1111/j.1467-8659.2012.03148.x">Important Sampling Techniques for Path Tracing in Participating Media</a>. <em>Computer Graphics Forum</em>. 31, 4 (2012), 1519-1528.</p>
<p>Bruce Walter, Steve Marschner, Hongsong Li, and Kenneth E. Torrance. 2007. <a href="http://dx.doi.org/10.2312/EGWR/EGSR07/195-206">Microfacet Models for Refraction through Rough Surfaces</a>. In <em>Rendering Techniques 2007 (Proceedings of the 18th Eurographics Symposium on Rendering)</em>, 195-206.</p>
</div>
</p></p></p>
https://blog.yiningkarlli.com/2020/02/shadow-terminator-in-takua.html
Shadow Terminator in Takua
2020-02-09T00:00:00+00:00
2020-02-09T00:00:00+00:00
Yining Karl Li
<div>
<p>I recently implemented two techniques in Takua for solving the harsh shadow terminator problem; I implemented both the Disney Animation solution <a href="(https://www.yiningkarlli.com/projects/shadowterminator.html)">[Chiang et al. 2019]</a> that we published at SIGGRAPH 2019, and the Sony Imageworks technique <a href="https://link.springer.com/chapter/10.1007/978-1-4842-4427-2_12">[Estevez et al. 2019]</a> published in Ray Tracing Gems.
We didn’t show too many comparisons between the two techniques (which I’ll refer to as the Chiang and Estevez approaches, respectively) in our SIGGRAPH 2019 presentation, and we didn’t show comparisons on any actual “real-world” scenes, so I thought I’d do a couple of my own renders using Takua as a bit of a mini-followup and share a handful of practical implementation tips.
For a recap of the harsh shadow terminator problem, please see either the Estevez paper or the slides from the Chiang talk, which both do excellent jobs of describing the problem and why it happens in detail.
Here’s a small scene that I made for this post, thrown together using some Evermotion assets that I had sitting around:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2020/Jan/shadowterminator/bedroom.chiang.pt.0.jpg"><img src="https://blog.yiningkarlli.com/content/images/2020/Jan/shadowterminator/preview/bedroom.chiang.pt.0.jpg" alt="Figure 1: A simple bedroom scene, rendered in Takua Renderer. This image was rendered using the Chiang 2019 shadow terminator solution." /></a></p>
<p>In this scene, all of the blankets and sheets and pillows on the bed use a fabric material that uses extremely high-frequency, high-resolution normal maps to achieve the fabric-y fiber-y look.
Because of these high-frequency normal maps, the bedding is susceptible to the harsh shadow terminator problem.
All of the bedding also has diffuse transmission and a very slight amount of high roughness specularity to emulate the look of a sheen lobe, making the material (and therefore this comparison) overall more interesting than just a single diffuse lobe.</p>
<p>Since the overall scene is pretty brightly lit and the bed is lit from all directions either by direct illumination from the window or bounce lighting from inside of the room, the shadow terminator problem is not as apparent in this scene; it’s still there, but it’s much more subtle than in the examples we showed in our talk.
Below are some interactive comparisons between renders using Chiang 2019, Estevez 2019, and no shadow terminator fix; drag the slider left and right to compare:</p>
</div>
<div class="embed-container">
<iframe src="/content/images/2020/Jan/shadowterminator/comparisons/bedroom_chiang_nofix_embed.html" frameborder="0" border="0" scrolling="no"></iframe></div>
<div class="figcaption">Figure 2: The bedroom scene rendered in Takua Renderer using Chiang 2019 (left) and no harsh shadow terminator fix (right). For a full screen comparison, <a href="/content/images/2020/Jan/shadowterminator/comparisons/bedroom_chiang_nofix.html">click here.</a></div>
<p>
<div class="embed-container">
<iframe src="/content/images/2020/Jan/shadowterminator/comparisons/bedroom_chiang_estevez_embed.html" frameborder="0" border="0" scrolling="no"></iframe></div>
<div class="figcaption">Figure 3: The bedroom scene rendered in Takua Renderer using Chiang 2019 (left) and Estevez 2019 (right). For a full screen comparison, <a href="/content/images/2020/Jan/shadowterminator/comparisons/bedroom_chiang_estevez.html">click here.</a></div>
<p>
<div class="embed-container">
<iframe src="/content/images/2020/Jan/shadowterminator/comparisons/bedroom_diffuse_nofix_embed.html" frameborder="0" border="0" scrolling="no"></iframe></div>
<div class="figcaption">Figure 4: The bedroom scene rendered in Takua Renderer using no normal mapping (left) and normal mapping with no harsh shadow terminator fix (right). For a full screen comparison, <a href="/content/images/2020/Jan/shadowterminator/comparisons/bedroom_diffuse_nofix.html">click here.</a></div>
<div>
<p>If you would like to compare the 4K renders directly, they are located here: <a href="/content/images/2020/Jan/shadowterminator/bedroom.chiang.pt.0.jpg">Chiang 2019</a>, <a href="/content/images/2020/Jan/shadowterminator/bedroom.estevez.pt.0.jpg">Estevez 2019</a>, <a href="/content/images/2020/Jan/shadowterminator/bedroom.none.pt.0.jpg">No Fix</a>, <a href="/content/images/2020/Jan/shadowterminator/bedroom.diffuse.pt.0.jpg">No Normal Mapping</a>.
As mentioned above, due to this scene being brightly lit, differences between the two techniques and not having any harsh shadow terminator fix at all will be a bit more subtle.
However, differences are still visible, especially in brighter areas of the blanket and white pillows.
Note that in this scenario, the difference between Chiang 2019 and Estevez 2019 is fairly small, while the difference between using either shadow terminator fix and not having a fix is more apparent.
Also note how both Chiang 2019 and Estevez 2019 produce results that come pretty close to matching the reference image with no normal mapping; this is good, since we would expect fix techniques to match the reference image more closely than not having a fix!</p>
<p>If we remove the bedroom set and put the bed onto more of a studio lighting setup with two area lights and a seamless grey backdrop, we can start seeing more prominent differences between the two techniques and between either technique and no fix.
Seeing how everything plays out in this type of a lighting setup is useful, since this is the type of render that one often sees as part of a standard lookdev department’s workflow:</p>
</div>
<div class="embed-container">
<iframe src="/content/images/2020/Jan/shadowterminator/comparisons/bed_chiang_nofix_embed.html" frameborder="0" border="0" scrolling="no"></iframe></div>
<div class="figcaption">Figure 5: The bed in a studio lighting setup, rendered in Takua Renderer using Chiang 2019 (left) and no harsh shadow terminator fix (right). For a full screen comparison, <a href="/content/images/2020/Jan/shadowterminator/comparisons/bed_chiang_nofix.html">click here.</a></div>
<p>
<div class="embed-container">
<iframe src="/content/images/2020/Jan/shadowterminator/comparisons/bed_chiang_estevez_embed.html" frameborder="0" border="0" scrolling="no"></iframe></div>
<div class="figcaption">Figure 6: The bed in a studio lighting setup, rendered in Takua Renderer using Chiang 2019 (left) and Estevez 2019 (right). For a full screen comparison, <a href="/content/images/2020/Jan/shadowterminator/comparisons/bed_chiang_estevez.html">click here.</a></div>
<p>
<div class="embed-container">
<iframe src="/content/images/2020/Jan/shadowterminator/comparisons/bed_diffuse_nofix_embed.html" frameborder="0" border="0" scrolling="no"></iframe></div>
<div class="figcaption">Figure 7: The bed in a studio lighting setup, rendered in Takua Renderer using no normal mapping (left) and normal mapping with no harsh shadow terminator fix (right). For a full screen comparison, <a href="/content/images/2020/Jan/shadowterminator/comparisons/bed_diffuse_nofix.html">click here.</a></div>
<div>
<p>If you would like to compare the 4K renders directly for the studio lighting setup, they are located here: <a href="/content/images/2020/Jan/shadowterminator/bed.chiang.pt.0.jpg">Chiang 2019</a>, <a href="/content/images/2020/Jan/shadowterminator/bed.estevez.pt.0.jpg">Estevez 2019</a>, <a href="/content/images/2020/Jan/shadowterminator/bed.none.pt.0.jpg">No Fix</a>, <a href="/content/images/2020/Jan/shadowterminator/bed.diffuse.pt.0.jpg">No Normal Mapping</a>.
In this setup, we can now see differences between the four images much more clearly.
Compared to the no normal mapping reference, the render with no fix produces considerably more darkening on silhouettes, and the harsh sudden transition from bright to shadowed areas is much more apparent.
In the render with no fix, the bedding suddenly looks a lot less soft and starts to look a little more like a hard solid surface instead of like fabric.</p>
<p>Chiang 2019 and Estevez 2019 both restore more of the soft fabric look by softening out the harsh shadow terminator areas, but the differences between Chiang 2019 and Estevez 2019 become more apparent and interesting in this setting.
Chiang 2019 produces an overall softer look that has shadow terminators that more closely match the reference with no normal mapping, but Chiang 2019 produces a slightly darker look overall compared to Estevez 2019.
Estevez 2019 doesn’t match the reference’s shadow terminators quite as closely as Chiang 2019, but manages to preserve more of the overall energy.
In Figure 5 in the Chiang 2019 paper, we explain where this difference comes from: for small shading normal deviations, Estevez 2019 produces less shadowing than our method, whereas for larger shading normal deviations, Estevez 2019 produces more shadowing than our method.
As a result, Estevez 2019 generally produces a higher contrast look compared to Chiang 2019.</p>
<p>All of these differences are more apparent in a close-up crop of the full 4K render.
Here are comparisons of the same studio lighting setup from above, but cropped in; pay close attention to slightly right of center of the image, where the white blanket overhangs the edge of the bed:</p>
</div>
<div class="embed-container">
<iframe src="/content/images/2020/Jan/shadowterminator/comparisons/bed_crop_chiang_nofix_embed.html" frameborder="0" border="0" scrolling="no"></iframe></div>
<div class="figcaption">Figure 8: Crop of the studio lighting setup render from earlier, using Chiang 2019 (left) and no harsh shadow terminator fix (right). For a larger comparison, <a href="/content/images/2020/Jan/shadowterminator/comparisons/bed_crop_chiang_nofix.html">click here.</a></div>
<p>
<div class="embed-container">
<iframe src="/content/images/2020/Jan/shadowterminator/comparisons/bed_crop_chiang_estevez_embed.html" frameborder="0" border="0" scrolling="no"></iframe></div>
<div class="figcaption">Figure 9: Crop of the studio lighting setup render from earlier, using Chiang 2019 (left) and Estevez 2019 (right). For a larger comparison, <a href="/content/images/2020/Jan/shadowterminator/comparisons/bed_crop_chiang_estevez.html">click here.</a></div>
<p>
<div class="embed-container">
<iframe src="/content/images/2020/Jan/shadowterminator/comparisons/bed_crop_diffuse_nofix_embed.html" frameborder="0" border="0" scrolling="no"></iframe></div>
<div class="figcaption">Figure 10: Crop of the studio lighting setup render from earlier, using no normal mapping (left) and normal mapping with no harsh shadow terminator fix (right). For a larger comparison, <a href="/content/images/2020/Jan/shadowterminator/comparisons/bed_crop_diffuse_nofix.html">click here.</a></div>
<div>
<p>Of course, the scenario that makes the harsh shadow terminator problem the most apparent is when there is a single strong light source and we are viewing the scene from an angle from which we can see areas where the light hits surfaces at a glancing angle.
These types of lighting setups are often used for checking silhouettes and backlighting and whatnot in modeling and lookdev turntable renders.
In the comparisons below, the differences are most noticeable in the folds and on the shadowed sides of all of the bedding:</p>
</div>
<div class="embed-container">
<iframe src="/content/images/2020/Jan/shadowterminator/comparisons/singlelight_chiang_nofix_embed.html" frameborder="0" border="0" scrolling="no"></iframe></div>
<div class="figcaption">Figure 11: The bed lit with a single very bright light, rendered in Takua Renderer using Chiang 2019 (left) and no harsh shadow terminator fix (right). For a full screen comparison, <a href="/content/images/2020/Jan/shadowterminator/comparisons/singlelight_chiang_nofix.html">click here.</a></div>
<p>
<div class="embed-container">
<iframe src="/content/images/2020/Jan/shadowterminator/comparisons/singlelight_chiang_estevez_embed.html" frameborder="0" border="0" scrolling="no"></iframe></div>
<div class="figcaption">Figure 12: The bed lit with a single very bright light, rendered in Takua Renderer using Chiang 2019 (left) and Estevez 2019 (right). For a full screen comparison, <a href="/content/images/2020/Jan/shadowterminator/comparisons/singlelight_chiang_estevez.html">click here.</a></div>
<p>
<div class="embed-container">
<iframe src="/content/images/2020/Jan/shadowterminator/comparisons/singlelight_diffuse_nofix_embed.html" frameborder="0" border="0" scrolling="no"></iframe></div>
<div class="figcaption">Figure 13: The bed lit with a single very bright light, rendered in Takua Renderer using no normal mapping (left) and normal mapping with no harsh shadow terminator fix (right). For a full screen comparison, <a href="/content/images/2020/Jan/shadowterminator/comparisons/singlelight_diffuse_nofix.html">click here.</a></div>
<div>
<p>If you would like to compare the 4K renders directly for the single light source renders, they are located here: <a href="/content/images/2020/Jan/shadowterminator/bed_singlelight.pt.chiang.0.jpg">Chiang 2019</a>, <a href="/content/images/2020/Jan/shadowterminator/bed_singlelight.pt.estevez.0.jpg">Estevez 2019</a>, <a href="/content/images/2020/Jan/shadowterminator/bed_singlelight.pt.none.0.jpg">No Fix</a>, <a href="/content/images/2020/Jan/shadowterminator/bed_singlelight.pt.diffuse.0.jpg">No Normal Mapping</a>.
With a single light source, the differences between the four images are now very clear, since a single light setup produces strong contrast between the lit and shadowed parts of the image.
The harsh shadow terminator problem is especially visible in the folds of the blanket, where we can see one side of the fold fully lit and one side of the fold in shadow (although because the bedding all has diffuse transmission, the harsh shadow terminator is still not as prevalent as it would be for a purely diffuse reflecting surface).
Something else that is interesting is how the bedding with no shadow terminator fix overall appears slightly brighter than the bedding with no normal mapping; this is because the shading normals “bend” more light towards the light source.
Chiang 2019 restores the overall brightness of the bedding back to something closer to the reference with no normal mapping but softens out more of the fine detail from the normal mapping, while Estevez 2019 preserves more of the fine details but has a brightness level closer to the render with no fix.</p>
<p>Just like in the studio lighting renders, differences become more apparent in close-up crops of the full 4K render.
Here are some cropped in comparisons, this time centered more on the top of the bed than on the edge.
In these crops, the glancing light angles make the shadow terminators more apparent in the folds of the blankets and such:</p>
</div>
<div class="embed-container">
<iframe src="/content/images/2020/Jan/shadowterminator/comparisons/singlelight_crop_chiang_nofix_embed.html" frameborder="0" border="0" scrolling="no"></iframe></div>
<div class="figcaption">Figure 14: Crop of the single light render from earlier, using Chiang 2019 (left) and no harsh shadow terminator fix (right). For a larger comparison, <a href="/content/images/2020/Jan/shadowterminator/comparisons/singlelight_crop_chiang_nofix.html">click here.</a></div>
<p>
<div class="embed-container">
<iframe src="/content/images/2020/Jan/shadowterminator/comparisons/singlelight_crop_chiang_estevez_embed.html" frameborder="0" border="0" scrolling="no"></iframe></div>
<div class="figcaption">Figure 15: Crop of the single light render from earlier, using Chiang 2019 (left) and Estevez 2019 (right). For a larger comparison, <a href="/content/images/2020/Jan/shadowterminator/comparisons/singlelight_crop_chiang_estevez.html">click here.</a></div>
<p>
<div class="embed-container">
<iframe src="/content/images/2020/Jan/shadowterminator/comparisons/singlelight_crop_diffuse_nofix_embed.html" frameborder="0" border="0" scrolling="no"></iframe></div>
<div class="figcaption">Figure 16: Crop of the single light render from earlier, using no normal mapping (left) and normal mapping with no harsh shadow terminator fix (right). For a larger comparison, <a href="/content/images/2020/Jan/shadowterminator/comparisons/singlelight_crop_diffuse_nofix.html">click here.</a></div>
<div>
<p>In the end, I don’t think either approach is better than the other, and from a physical basis there really isn’t a “right” answer since nothing about shading normals is physical to begin with; I think it’s up to a matter of personal preference and the requirements of the art direction on a given project.
Our artists at Walt Disney Animation Studios generally prefer the look of Chiang 2019 because of the lighting setups they usually work with, but I know that other artists prefer the look of Estevez 2019 because they have different requirements to meet.</p>
<p>Fortunately, both Chiang 2019 and Estevez 2019 are both really easy to implement!
Both techniques can be implemented in a handful of lines of code, and are easy to apply to any modern physically based shading model.
We didn’t actually include source code in our SIGGRAPH talk, mostly because we figured that translating the math from our short paper into code should be very straightforward and thus, including source code that is basically a direct transcription of the math into C++ would almost be insulting to the intelligence of the reader.
However, since then, I’ve gotten a surprising number of emails asking for source code, so here’s the math and the corresponding C++ code from my implementation in Takua Renderer.
Let G’ be the additional shadow terminator term that we will multiply the Bsdf result with:</p>
<div>\[ G = \min\bigg[1, \frac{\langle\omega_g,\omega_i\rangle}{\langle\omega_s,\omega_i\rangle\langle\omega_g,\omega_s\rangle}\bigg] \]</div>
<div>\[ G' = - G^3 + G^2 + G \]</div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>float calculateChiang2019ShadowTerminatorTerm(const vec3& outputDirection,
const vec3& shadingNormal,
const vec3& geometricNormal) {
float NDotL = max(0.0f, dot(shadingNormal, outputDirection));
float NGeomDotL = max(0.0f, dot(geometricNormal, outputDirection));
float NGeomDotN = max(0.0f, dot(geometricNormal, shadingNormal));
if (NDotL == 0.0f || NGeomDotL == 0.0f || NGeomDotN == 0.0f) {
return 0.0f;
} else {
float G = NGeomDotL / (NDotL * NGeomDotN);
if (G <= 1.0f) {
float smoothTerm = -(G * G * G) + (G * G) + G; // smoothTerm is G' in the math
return smoothTerm;
}
}
return 1.0f;
}
</code></pre></div> </div>
<p>That’s all there is to it!
<a href="https://github.com/Apress/ray-tracing-gems/blob/master/Ch_12_A_Microfacet-Based_Shadowing_Function_to_Solve_the_Bump_Terminator_Problem/terminator.cpp">Source code for Estevez 2019</a> is provided as part of the Ray Tracing Gems Github repository, but for the sake of completeness, my implementation is included below.
My implementation is just the sample implementation streamlined into a single function:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>float calculateEstevez2019ShadowTerminatorTerm(const vec3& outputDirection,
const vec3& shadingNormal,
const vec3& geometricNormal) {
float cos_d = min(abs(dot(geometricNormal, shadingNormal)), 1.0f);
float tan2_d = (1.0f - cos_d * cos_d) / (cos_d * cos_d);
float alpha2 = clamp(0.125f * tan2_d, 0.0f, 1.0f);
float cos_i = max(abs(dot(geometricNormal, outputDirection)), 1e-6f);
float tan2_i = (1.0f - cos_i * cos_i) / (cos_i * cos_i);
float spi_shadow_term = 2.0f / (1.0f + sqrt(1.0f + alpha2 * tan2_i));
return spi_shadow_term;
}
</code></pre></div> </div>
<p>Finally, I have a handful of small implementation notes.
First, to apply either Chiang 2019 or Estevez 2019 to your existing physically based shading model, just multiply the additional shadow terminator term with the contribution for each lobe that needs adjusting.
Technically speaking G’ is an adjustment to the G shadowing term in a standard microfacet model, but multiplying there versus multiplying with the overall lobe contribution works out to be the same thing.
If your Bsdf supports multiple shading normals for different specular lobes, you’ll need to calculate a separate shadow terminator term for each shading normal.
Second, note that both Chiang 2019 and Estevez 2019 are described with respect to unidirectional path tracing from the camera.
This frame of reference is very important; both techniques work specifically based on the outgoing direction being the direction towards a potential light source, meaning that this technique actually isn’t reciprocal by default.
The Estevez 2019 paper found that the shadow terminator term can be made reciprocal by just applying the term to both incoming and outgoing directions, but they also found that this adjustment can make edges too dark.
Instead, in order to make both techniques compatible with bidirectional path tracing integrators, I add in a check for whether the incoming or outgoing direction is pointed at a light, and feed the appropriate direction into the shadow terminator function.
Doing this check is enough to make my bidirectional renders match my unidirectional ones; intuitively this approach is similar to the check one has to carry out when applying adjoint Bsdf adjustments <a href="https://graphics.stanford.edu/papers/non-symmetric/">[Veach 1996]</a> for shading normals and refraction.</p>
<p>That’s pretty much it!
If you want the details for how these two techniques are derived and why they work, I strongly encourage reading the Estevez 2019 chapter in Ray Tracing Gems and reading through both the short paper and the presentation slides / notes for the Chiang 2019 SIGGRAPH talk.</p>
<p><strong>References</strong></p>
<p>Matt Jen-Yuan Chiang, Yining Karl Li, and Brent Burley. 2019. <a href="https://dl.acm.org/citation.cfm?doid=3306307.3328172">Taming the Shadow Terminator</a>. In <em>ACM SIGGRAPH 2019 Talks</em>. 71:1–71:2.</p>
<p>Alejandro Conty Estevez, Pascal Lecocq, and Clifford Stein. 2019. <a href="https://link.springer.com/chapter/10.1007/978-1-4842-4427-2_12">A Microfacet-Based Shadowing Function to Solve the Bump Terminator Problem</a>. <em>Ray Tracing Gems</em> (2019), 149-158.</p>
<p>Eric Veach. 1996. <a href="https://graphics.stanford.edu/papers/non-symmetric/">Non-Symmetric Scattering in Light Transport Algorithms</a>. In <em>Rendering Techniques 1996 (Proceedings of the 7th Eurographics Workshop on Rendering)</em>. 82-91.</p>
<p><strong>Errata</strong></p>
<p>Thanks to Matt Pharr for noticing and pointing out a minor bug in the calculateChiang2019ShadowTerminatorTerm() implementation; the code has been updated with a fix.</p>
</div>
</p></p></p></p></p></p></p></p></p></p>
https://blog.yiningkarlli.com/2019/11/woodville-renderman-challenge.html
Woodville RenderMan Art Challenge
2019-11-30T00:00:00+00:00
2019-11-30T00:00:00+00:00
Yining Karl Li
<p>Every once in a while, I make a <a href="https://blog.yiningkarlli.com/2016/07/minecraft-in-renderman-ris.html">point of spending some significant personal time</a> working on a personal project that uses tools outside of the stuff I’m used to working on day-to-day (Disney’s Hyperion renderer professionally, Takua Renderer as a hobby).
A few times each year, Pixar’s RenderMan group holds an art challenge contest where Pixar provides a un-shaded un-uv’d base model and contestants are responsible for layout, texturing, shading, lighting, additional modeling of supporting elements and surrounding environment, and producing a final image.
I thought the <a href="https://renderman.pixar.com/news/renderman-woodville-art-challenge">most recent RenderMan art challenge, “Woodville”</a>, would make a great excuse for playing with RenderMan 22 for Maya; here’s the final image I came up with:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/woodville/woodville_full_4k.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/woodville/preview/woodville_full.jpg" alt="Figure 1: My entry to Pixar's RenderMan Woodville Art Challenge, titled "Morning Retreat". Base treehouse model is from Pixar; all shading, lighting, additional modeling, and environments are mine. Concept by Vasylina Holod. Model by Alex Shilt © Disney / Pixar - RenderMan "Woodville" Art Challenge." /></a></p>
<p>One big lesson I have learned since entering the rendering world is that there is no such thing as the absolute best overall renderer- there are only renderers that are the best suited for particular workflows, tasks, environments, people, etc.
Every in-house renderer is the best renderer in the world for the particular studio that built that renderer, and every commercial renderer is the best renderer in the world for the set of artists that have chosen that renderer as their tool of choice.
Another big lesson that I have learned is that even though the Hyperion team at Disney Animation has some of the best rendering engineers in the world, so do all of the other major rendering teams, both commercial and in-house.
These lessons are humbling to learn, but also really cool and encouraging if you think about it- these lessons means that for any given problem that arises in the rendering world, as an academic field and as an industry, we get multiple attempts to solve it from many really brilliant minds from a variety of background and a variety of different contexts and environments!</p>
<p>As a result, something I’ve come to strongly believe is that for rendering engineers, there is enormous value in learning to use outside renderers that are not the one we work on day-to-day ourselves.
At any given moment, I try to have at least a working familiarity with the latest versions of Pixar’s <a href="https://renderman.pixar.com">RenderMan</a>, Solid Angle (Autodesk)’s <a href="https://www.arnoldrenderer.com">Arnold</a>, and Chaos Group’s <a href="https://www.chaosgroup.com">Vray</a> and <a href="https://corona-renderer.com">Corona</a> renderers.
All of these renderers are excellent, cutting edge tools, and when new artists join our studio, these are the most common commercial renderers that new artists tend to know how to use.
Therefore, knowing how these four renderers work and what vocabulary is associated with them tends to be useful when teaching new artists how to use our in-house renderer, and for providing a common frame of reference when we discuss potential improvements and changes to our in-house renderer.
All of the above is the mindset I went into this project with, so this post is meant to be something of a breakdown of what I did, along with some thoughts and observations made along the way.
This was a really fun exercise, and I learned a lot!</p>
<p><strong>Layout and Framing</strong></p>
<p>For this art challenge, Pixar <a href="https://renderman.pixar.com/woodville-pup-asset">supplied a base model</a> without any sort texturing or shading or lighting or anything else.
The model is by Alex Shilt, based on a concept by Vasylina Holod.
Here is a simple render showing what is provided out of the box:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/woodville_base_wide.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/woodville/woodville_base_wide.jpg" alt="Figure 2: Base model provided by Pixar, rendered against a white cyclorama background using a basic skydome." /></a></p>
<p>I started with just scouting for some good camera angles.
Since I really wanted to focus on high-detail shading for this project, I decided from close to the beginning to pick a close-up camera angle that would allow for showcasing shading detail, at the trade-off of not depicting the entire treehouse.
A nice (lazy) bonus is that picking a close-up camera angle meant that I didn’t need to shade the entire treehouse; just the parts in-frame.
Instead of scouting using just the GL viewport in Maya, I tried using RenderMan for Maya 22’s IPR mode, which replaces the Maya viewport with a live RenderMan render.
This mode wound up being super useful for scouting; being able to interactively play with depth of field settings and see even basic skydome lighting helped a lot in getting a feel for each candidate camera angle.
Here are a couple of different white clay test renders I did while trying to find a good camera position and framing:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/woodville/candidate_camera_02.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/woodville/candidate_camera_02.jpg" alt="Figure 3: Candidate camera angle with a close-up focus on the entire top of the treehouse." /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/woodville/candidate_camera_04.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/woodville/candidate_camera_04.jpg" alt="Figure 4: Candidate camera angle with a close-up focus on a specific triangular A-frame treehouse cabin." /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/woodville/candidate_camera_03.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/woodville/candidate_camera_03.jpg" alt="Figure 5: Candidate camera angle looking down from the top of the treehouse." /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/woodville/candidate_camera_01.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/woodville/candidate_camera_01.jpg" alt="Figure 6: Candidate camera angle with a close-up focus on the lower set of treehouse cabins." /></a></p>
<p>I wound up deciding to go with the camera angle and framing in Figure 6 for several reasons.
First off, there are just a lot of bits that looked fun to shade, such as the round tower cabin on the left side of the treehouse.
Second, I felt that this angle would allow me to limit how expansive of an environment I would need to build around the treehouse.
I decided around this point to put the treehouse in a big mountainous mixed coniferous forest, with the reasoning being that tree trunks as large as the ones in the treehouse could only come from huge redwood trees, which only grow in mountainous coniferous forests.
With this camera angle, I could make the background environment a single mountainside covered in trees and not have to build a wider vista.</p>
<p><strong>UVs and Geometry</strong></p>
<p>The next step that I took was to try to shade the main tree trunks, since the scale of the tree trunks worried me the most about the entire project.
Before I could get to texturing and shading though, I first had to UV-map the tree trunks, and I quickly discovered that before I could even UV-map the tree trunks, I would have to retopologize the meshes themselves, since the tree trunk meshes came with some really messy topology that was basically un-UV-able.
I retoplogized the mesh in ZBrush and exported it lower res than the original mesh, and then brought it back into Maya, where I used a shrink-wrap deformer to conform the lower res retopologized mesh back onto the original mesh.
The reasoning here was that a lower resolution mesh would be easier to UV unwrap and that displacement later would restore missing detail.
Figure 7 shows the wireframe of the original mesh on the left, and the wireframe of my retopologized mesh on the right:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/woodville/trunk_wireframe.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/woodville/trunk_wireframe.jpg" alt="Figure 7: Original mesh wireframe on the left, my retopologized version on the right." /></a></p>
<p>In previous projects, I’ve found a lot of success in using <a href="https://github.com/wjakob/instant-meshes">Wenzel Jakob’s Instance Meshes</a> application to retopologize messy geometry, but this time around I used <a href="http://docs.pixologic.com/user-guide/3d-modeling/topology/zremesher/">ZBrush’s ZRemesher tool</a> since I wanted as perfect a quad grid as possible (at the expense of losing some mesh fidelity) to make UV unwrapping easier.
I UV-unwrapped the remeshed tree trunks by hand; the general approach I took was to slice the tree trunks into a series of stacked cylinders and then unroll each cylinder into as rectangular of a UV shell as I could.
For texturing, I started with some photographs of redwood bark I found online, turned them greyscale in Photoshop and adjusted levels and contrast to produce height maps, and then took the height maps and source photographs into Substance Designer, where I made the maps tile seamlessly and also generated normal maps.
I then took the tileable textures into Substance Painter and painted the tree trunks using a combination of triplanar projections and manual painting.
At this point, I had also blocked in a temporary forest in the background made from just instancing two or three tree models all over the place, which I found useful for being able to help get a sense of how the shading on the treehouse was working in context:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/woodville/progress016.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/woodville/progress016.jpg" alt="Figure 8: In-progress test render with shaded tree trunks and temporary background forest blocked in." /></a></p>
<p>Next up, I worked on getting base shading done for the cabins and various bits and bobs on the treehouse.
The general approach I took for the entire treehouse was to do base texturing and shading in Substance Painter, and then add wear and tear, aging, and moss in RenderMan through procedural <a href="https://rmanwiki.pixar.com/display/REN22/PxrLayerSurface">PxrLayerSurface</a> layers driven by a combination of procedural <a href="https://rmanwiki.pixar.com/display/REN22/PxrRoundCube">PxrRoundCube</a> and <a href="https://rmanwiki.pixar.com/display/REN22/PxrDirt">PxrDirt</a> nodes and hand-painted dirt and wear masks.
First though, I had to UV-unwrap all of the cabins and stuff.
I tried using <a href="https://www.sidefx.com/tutorials/houdini-game-dev-tools-auto-uvs/">Houdini’s Auto UV SOP</a> that comes with Houdini’s Game Tools package… the result (for an example, see Figure 9) was really surprisingly good!
In most cases I still had to do a lot of manual cleanup work, such as re-stitching some UV shells together and re-laying-out all of the shells, but the output from Houdini’s Auto UV SOP provided a solid starting point.
For each cabin, I grouped surfaces that were going to have a similar material into a single UDIM tile, and sometimes I split similar materials across multiple UDIM tiles if I wanted more resolution.
This entire process was… not really fun… it took a lot of time and was basically just busy-work.
I vastly prefer being able to paint Ptex instead of having to UV-unwrap and lay out UDIM tiles, but since I was using Substance Painter, Ptex wasn’t an option on this project.</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/woodville/houdini-auto-uv.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/woodville/houdini-auto-uv.jpg" alt="Figure 9: Example of one of the cabins run through Houdini's Auto UV SOP. The cabin is on the left; the output UVs are on the right." /></a></p>
<p><strong>Texturing in Substance Painter and Shading</strong></p>
<p>In Substance Painter, the general workflow I used was to start with multiple triplanar projections of (heavily edited) Quixel Megascans surfaces masked and oriented to different sections of a surface, and then paint on top.
Through this process, I was able to get bark to flow with the curves of each log and whatnot.
Then, in RenderMan for Maya, I took all of the textures from Substance Painter and used them to drive the base layer of a PxrLayeredSurface shader. All of the textures were painted to be basically greyscale or highly desaturated, and then in Maya I used PxrColorCorrect and PxrVary nodes to add in color. This way, I was able to iteratively play with and dial in colors in RenderMan’s IPR mode without having to roundtrip back to Substance Painter too much.
Since the camera in my frame is relatively close to the treehouse, having lots of detail was really important.
I put high-res displacement and normal maps on almost everything, which I found helpful for getting that extra detail in.
I found that setting micropolygon length to be greater than 1 polygon per pixel was useful for getting extra detail in with displacement, at the cost of a bit more memory usage (which was perfectly tolerable in my case).</p>
<p>One of the unfortunate things about how I chose to UV-unwrap the tree trunks is that UV seams cut across parts of the tree trunks that are visible to the camera; as a result, if you zoom into the final 4K renders, you can see tiny line artifacts in the displacement where UV seams meet.
These artifacts arise from displacement values not interpolating smoothly across UV seams when texture filtering is in play; this problem can sometimes be avoided by very carefully hiding UV seams, but sometimes there is no way.
The problem in my case is somewhat reduced by expanding displacement values beyond the boundaries of each UV shell in the displacement textures (most applications like Substance Painter can do this natively), but again, this doesn’t completely solve the problem, since expanding values beyond boundaries can only go so far until you run into another nearby UV shell and since texture filtering widths can be variable.
This problem is one of the major reasons why we use Ptex so heavily at Disney Animation; Ptex’s robust cross-face filtering functionality sidesteps this problem entirely.
I really wish Substance Painter could output Ptex!</p>
<p>For dialing in the colors of the base wood shaders, I created versions of the wood shader base color textures that looked like newer wood and older sun-bleached wood, and then I used a PxrBlend node in each wood shader to blend between the newer and older looking wood, along with procedural wear to make sure that the blend wasn’t totally uniform.
Across all of the various wood shaders in the scene, I tied all of the blend values to a single PxrToFloat node, so that I could control how aged all wood across the entire scene looks with a single value.
For adding moss to everything, I used a PxrRoundCube triplanar to set up a base mask for where moss should go.
The triplanar mask was set up so that moss appears heavily on the underside of objects, less on the sides, and not at all on top.
The reasoning for making moss appear on undersides is because in the type of conifer forest I set my scene in, moss tends to grow where moisture and shade are available, which tends to be on the underside of things.
The moss itself was also driven by a triplanar projection and was combined into each wood shader as a layer in PxrLayerSurface.
I also did some additional manual mask painting in Substance Painter to get moss into some more crevices and corners and stuff on all of the wooden sidings and the wooden doors and whatnot.
Finally, the overall amount of moss across all of the cabins is modulated by another single PxrToFloat node, allowing me to control the overall amount of moss using another single value.
Figure 10 shows how I could vary the age of the wood on the cabins, along with the amount of moss.</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/woodville/cabin_shading_progress.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/woodville/cabin_shading_progress.jpg" alt="Figure 10: Example of age and moss controllability on one of the cabins. The top row shows, going from left to right, 0% aged, 50% aged, and 100% aged. The bottom row shows, going from left to right, 0% moss, 50% moss, and 100% moss. The final values used were close to 60% for both age and moss." /></a></p>
<p>The spiral staircase initially made me really worried; I originally thought I was going to have to UV unwrap the whole thing, and stuff like the railings are really not easy to unwrap.
But then, after a bit of thinking, I realized that the spiral staircase is likely a fire escape staircase, and so it could be wrought iron or something.
Going with a wrought iron look allowed me to handle the staircase mostly procedurally, which saved a lot of time.
Going along with the idea of the spiral staircase being a fire escape, I figured that the actual main way to access all of the different cabins in the treehouse must be through staircases internal to the tree trunks.
This idea informed how I handled that long skinny window above the front door; I figured it must be a window into a stairwell.
So, I put a simple box inside the tree behind that window, with a light at the top.
That way, a hint of inner space would be visible through the window:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/woodville/lower_window_maya.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/woodville/lower_window_maya.jpg" alt="Figure 11: Simple box inside the tree behind the lower window, to give a hint of inner space." /></a></p>
<p>In addition to shading everything, I also had to make some modifications to the provided treehouse geometry.
I that in the provided model, the satellite dish floats above its support pole without any actual connecting geometry, so I modeled a little connecting bit for the satellite dish.
Also, I thought it would be fun to put some furniture in the round cabin, so I decided to make the walls into plate glass.
Once I made the walls into plate glass, I realized that I needed to make a plausible interior for the round cabin.
Since the only way into the round cabin must be through a staircase in the main tree trunk, I modeled a new door in the back of the round cabin.
With everything shaded and the geometric modifications in place, here is how everything looked at this point:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/woodville/progress085_4k.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/woodville/preview/progress085.jpg" alt="Figure 12: In-progress test render with initial fully shaded treehouse, along with geoemtric modifications. Click for 4K version." /></a></p>
<p><strong>Set Dressing the Treehouse</strong></p>
<p>The next major step was adding some story elements.
I wanted the treehouse to feel lived in, like the treehouse is just somebody’s house (a very unusual house, but a house nonetheless).
To help convey that feeling, my plan was to rely heavily on set dressing to hint at the people living here.
So the goal was to add stuff like patio furniture, potted plants, laundry hanging on lines, furniture visible through windows, the various bits and bobs of life, etc.</p>
<p>I started by adding a nice armchair and a lamp to the round tower thing.
Of course the chair is an Eames Lounge Chair, and to match, the lamp is a modern style tripod floor lamp type thing.
I went with a chair and a lamp because I think that round tower would be a lovely place to sit and read and look out the window at the surrounding nature.
I thought it would be kind of fun to make all of the furniture kind of modern and stylish, but have all of the modern furniture be inside of a more whimsical exterior.
Next, I extended the front porch part of the main cabin, so that I could have some room to place furniture and props and stuff.
Of course any good front porch should have some nice patio furniture, so I added some chairs and a table.
I also put in a hanging round swing chair type thing with a bit poofy blue cushion; this entire area should be a fun place to sit around and talk in.
Since the entire treehouse sits on the edge of a pond, I figured that maybe the people living here like to sit out on the front porch, relax, shoot the breeze, and fish from the pond.
Since my scene is set in the morning, I figured maybe it’s late in the morning and they’ve set up some fishing lines to catch some fish for dinner later.
To help sell the idea that it’s a lazy fishing morning, I added a fishing hat on one of the chairs and put a pitcher of ice tea and some glasses on the table.
I also added a clothesline with some hanging drying laundry, along with a bunch of potted and hanging plants, just to add a bit more of that lived-in feel.
For the plants and several of the furniture pieces that I knew I would want to tweak later, I built in controls to their shading graphs using PxrColorCorrect nodes to allow me to adjust hue and saturation later.
Many of the furniture, plant and prop models are highly modified, kitbashed, re-textured versions of assets from Evermotion and CGAxis, although some of them (notable the Eames Lounge Chair) are entirely my own.</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/woodville/progress096_crop1.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/woodville/preview/progress096_crop1.jpg" alt="Figure 13: In-progress test render closeup crop of the lower main cabin, with furniture and plants and props." /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/woodville/progress096_crop2.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/woodville/preview/progress096_crop2.jpg" alt="Figure 14: In-progress test render closeup crop of the glass round cabin and the upper smaller cabin, with furniture and plants and props." /></a></p>
<p><strong>Building the Background Forest</strong></p>
<p>The last step before final lighting was to build a more proper background forest, as a replacement for the temporary forest I had used up until this point for blocking purposes.
For this step, I relied heavily on Maya’s MASH toolset, which I found to provide a great combination of power and ease-of-use; for use cases involving tons of instanced geometry, I certainly found it much easier than Maya’s older Xgen toolset.
MASH felt a lot more native to Maya, as opposed to Xgen, which requires a bunch of specific external file paths and file formats and whatnot.
I started with just getting some kind of reasonable base texturing down onto the groundplane.
In all of the in-progress renders up until this point, the ground plane was just white… you can actually tell if you look closely enough!
I eventually got to a place I was happy with using a bunch of different PxrRoundCubes with various rotations, all blended on top of each other using various noise projections.
I also threw in some rocks from Quixel Megascans, just to add a bit of variety.
I then laid down some low-level ground vegetation, which was meant to peek through the larger trees in various areas.
The base vegetation was made up of various ferns, shrubs, and small sapling-ish young conifers placed using Maya’s MASH Placer node:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/woodville/forest_progress029.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/woodville/preview/forest_progress029.jpg" alt="Figure 15: In-progress test render of the forest floor and under-canopy vegetation." /></a></p>
<p>In the old temporary background forest, the entire forest is made up of only three different types of trees, and it really shows; there was a distinct lack of color variation or tree diversity.
So, for the new forest, I decided to use a lot more types of trees.
Here is a rough lineup (not necessarily to scale with each other) of how all of the new tree species looked:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/woodville/trees_lineup.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/woodville/preview/trees_lineup.jpg" alt="Figure 16: Test render of a lineup of the trees used in the final forest." /></a></p>
<p>For the main forest, I hand-placed trees onto the mountain slope as instanced.
One cool thing I built in to the forest was PxrColorCorrect nodes in all of the tree shading graphs, with all controls wired up to single master controls for hue/saturation/value so that I could shift the entire forest’s colors easily if necessary.
This tool proved to be very useful for tuning the overall vegetation colors later while still maintaining a good amount of variation.
I also intentionally left gaps in the forest around the rock formations to give some additional visual variety.
Building up the entire under-layer of shrubs and saplings and stuff also paid off, since a lot of that stuff wound up peeking through various gaps between the larger trees:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/woodville/forest_progress050.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/woodville/preview/forest_progress050.jpg" alt="Figure 17: In-progress test render of the background forest." /></a></p>
<p>The last step for the main forest was adding some mist and fog, which is common in Pacific Northwest type mountainous conifer forests in the morning.
I didn’t have extensive experience working with volumes in RenderMan before this, so there was definitely something of a learning curve for me, but overall it wasn’t too hard to learn!
I made the mist by just having a Maya Volume Noise node plug into the density field of a PxrVolume; this isn’t anything fancy, but it provided a great start for the mist/fog:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/woodville/forest_progress051.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/woodville/preview/forest_progress051.jpg" alt="Figure 18: In-progress test render of the background forest with an initial version of mist and fog." /></a></p>
<p><strong>Lighting and Compositing</strong></p>
<p>At this point, I think the entire image together was starting to look pretty good, although, without any final shot lighting, the overall vibe felt more like a spread out of an issue of National Geographic than a more cinematic still out of a film.
Normally my instinct is to go with a more naturalistic look, but since part of the objective for this project was to learn to use RenderMan’s lighting toolset for more cinematic applications, I wanted to push the overall look of the image beyond this point:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/woodville/progress099.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/woodville/preview/progress099.jpg" alt="Figure 19: In-progress test render with everything together, before final shot lighting." /></a></p>
<p>From this point onwards, following <a href="https://www.youtube.com/watch?v=PWFU-QIljRI">a tutorial made by Jeremy Heintz</a>, I broke out the volumetric mist/fog into a separate layer and render pass in Maya, which allowed for adjusting the mist/fog in comp without having to re-render the entire scene.
This strategy proved to be immensely useful and a huge time saver in final lighting.
Before starting final lighting, I made a handful of small tweaks, which included reworking the moss on the front cabin’s lower support frame to get rid of some visible repetition, tweaking and adding dirt on all of the windows, and dialing in saturation and hue on the clothesline and potted plants a bit more.
I also changed the staircase to have aged wooden steps instead of all black cast iron, which helped blend the staircase into the overall image a bit more, and finally added some dead trees in the background forest.
Finally, in a last-minute change, I wound up upgrading a lot of the moss on the main tree trunk and on select parts of the cabins to use instanced geometry instead of just being a shading effect.
The geometric moss used atlases from Quixel Megascans, bunched into little moss patches, and then hand-scattered using the Maya MASH Placer tool.
Upgrading to geometric moss overall provided only a subtle change to the overall image, but I think helped enormously in selling some of the realism and detail; I find it interesting how small visual details like this often can have an out-sized impact on selling an overall image.</p>
<p>For final lighting, I added an additional uniform atmospheric haze pass to help visually separate the main treehouse from the background forest a bit more.
I also added a spotlight fog pass to provide some subtle godrays; the spotlight is a standard PxrRectLight oriented to match the angle of the sun. The PxrRectLight also has the cone modified enabled to provide the spot effect, and also has a <a href="https://rmanwiki.pixar.com/display/REN22/PxrCookieLightFilter">PxrCookieLightFilter</a> applied with a bit of a cucoloris pattern applied to provide the breakup effect that godrays shining through a forest canopy should have.
To provide a stronger key light, I rotated the skydome until I found something I was happy with, and then I split out the sun from the skydome into separate passes.
I split out the sun by painting the sun out of the skydome texture and then creating a PxrDistantLight with an exposure, color, and angle matched to what the sun had been in the skydome.
Splitting out the sun then allowed me to increase the size of the sun (and decrease the exposure correspondingly to maintain overall the same brightness), which helped soften some otherwise pretty harsh sharp shadows.
I also used a good number of <a href="https://rmanwiki.pixar.com/display/REN22/PxrRodLightFilter">PxrRodLightFilters</a> to help take down highlights in some areas, lighten shadows in others, and provide overall light shaping to areas like the right hand side of the right tree trunk.
I’ve conceptually known why artists like rods for some time now (especially since rods are heavily used feature in Hyperion at my day job at Disney Animation), but I think this project helped me really understand at a more hands-on level why rods are so great for hitting specific art direction.</p>
<p>After much iteration, here is the final set of render passes I wound up with going into final compositing:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/woodville/woodville_sun_4k.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/woodville/preview/woodville_sun.jpg" alt="Figure 19: Final render, sun (key) pass. Click for 4K version." /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/woodville/woodville_sky_4k.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/woodville/preview/woodville_sky.jpg" alt="Figure 20: Final render, sky (fill) pass. Click for 4K version." /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/woodville/woodville_practical_4k.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/woodville/preview/woodville_practical.jpg" alt="Figure 21: Final render, practical lights pass. Click for 4K version." /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/woodville/woodville_volumes_4k.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/woodville/preview/woodville_volumes.jpg" alt="Figure 22: Final render, mist/fog pass. Click for 4K version." /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/woodville/woodville_atmos_4k.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/woodville/preview/woodville_atmos.jpg" alt="Figure 23: Final render, atmospheric pass. Click for 4K version." /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/woodville/woodville_spot_4k.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/woodville/preview/woodville_spot.jpg" alt="Figure 24: Final render, spotlight pass. Click for 4K version." /></a></p>
<p>In final compositing, since I had everything broken out into separate passes, I was able to quickly make a number of adjustments that otherwise would have been much slower to iterate on if I had done them in-render.
I tinted the sun pass to be warmer (which is equivalent to changing the sun color in-render and re-rendering) and tweaked the exposures of the sun pass up and some of the volumetric passes down to balance out the overall image.
I also applied a color tint to the mist/fog pass to be cooler, which would have been very slow to experiment with if I had changed the actual fog color in-render.
I did all of the compositing in Photoshop, since I don’t have a Nuke license at home.
Not having a node-based compositing workflow was annoying, so next time I’ll probably try to learn DaVinci Resolve Fusion (which I hear is pretty good).</p>
<p>For color grading, I mostly just fiddled around in Lightroom.
I also added in a small amount of bloom by just duplicating the sun pass, clipping it to only really bright highlight values by adjusting levels in Photoshop, applying a Gaussian blur, exposing down, and adding back over the final comp.
Finally, I adjusted the gamma by 0.8 and exposed up by half a stop to give some additional contrast and saturation, which helped everything pop a bit more and feel a bit more moody and warm.
Figure 25 shows what all of the lighting, comp, and color grading looks like applied to a 50% grey clay shaded version of the scene, and if you don’t want to scroll all the way back to the top of this post to see the final image, I’ve included it again as Figure 26.</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/woodville/woodville_grey_4k.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/woodville/preview/woodville_grey.jpg" alt="Figure 25: Final lighting, comp, and color grading applied to a 50% grey clay shaded version. Click for 4K version." /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/woodville/woodville_full_4k.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/woodville/preview/woodville_full.jpg" alt="Figure 26: Final image. Click for 4K version." /></a></p>
<p><strong>Conclusion</strong></p>
<p>Overall, I had a lot of fun on this project, and I learned an enormous amount!
This project was probably the most complex and difficult art project I’ve ever done.
I think working on this project has shed a lot of light for me on why artists like certain workflows, which is an incredibly important set of insights for my day job as a rendering engineer.
I won’t grumble as much about having to support rods in production rendering now!</p>
<p>Here is a neat progression video I put together from all of the test and in-progress renders that I saved throughout this entire project:</p>
<div class="embed-container"><iframe src="https://player.vimeo.com/video/376059761" frameborder="0">Woodville Art Challenge Progression</iframe></div>
<p>I owe several people an enormous debt of thanks on this project.
My wife, Harmony Li, deserves all of my gratitude for her patience with me during this project, and also for being my art director and overall sanity checker.
My coworker at Disney Animation, lighting supervisor Jennifer Yu, gave me a lot of valuable critiques, advice, and suggestions, and acted as my lighting director during the final lighting and compositing stage.
Leif Pederson from Pixar’s RenderMan group provided a lot of useful tips and advice on the RenderMan contest forum as well.</p>
<p>Finally, my final image somehow managed to score an honorable mention in <a href="https://renderman.pixar.com/news/renderman-woodville-art-challenge-final-results">Pixar’s Art Challenge Final Results</a>, which was a big, unexpected, pleasant surprise, especially given how amazing all of the other entries in the contest are!
Since the main purpose of this project for me was as a learning exercise, doing well in the actual contest was a nice bonus, and kind of makes me think I’ll likely give the next RenderMan Art Challenge a shot too with a more serious focus on trying to put up a good showing.
If you’d like to see more about my contest entry, check out the <a href="https://renderman.pixar.com/answers/idea/10201/morning-retreat.html">work-in-progress thread I kept up in Pixar’s Art Challenge forum</a>; some of the text for this post was adapted from updates I made in my forum thread.</p>
https://blog.yiningkarlli.com/2019/11/froz2.html
Frozen 2
2019-11-14T00:00:00+00:00
2019-11-14T00:00:00+00:00
Yining Karl Li
<p>The 2019 film from <a href="http://www.disneyanimation.com">Walt Disney Animation Studios</a> is, of course, <a href="http://www.disneyanimation.com/projects/frozen2">Frozen 2</a>, which really does not need any additional introduction.
Instead, here is a brief personal anecdote.
I remember seeing the first Frozen in theaters the day it came out, and at some point halfway through the movie, it dawned on me that what was unfolding on the screen was really something special.
By the end of the first Frozen, I was convinced that I had to somehow get myself a job at Disney Animation some day.
Six years later, here we are, with Frozen 2’s release imminent, and here I am at Disney Animation.
Frozen 2 is my fourth credit at Disney Animation, but somehow seeing my name in the credits at the wrap party for this film was even more surreal than seeing my name in the credits on my first film.
Working with everyone on Frozen 2 was an enormous privilege and thrill; I’m incredibly proud of the work we have done on this film!</p>
<p>Under team lead Dan Teece’s leadership, for Frozen 2 we pushed Disney’s <a href="http://www.disneyanimation.com/technology/innovations/hyperion">Hyperion Renderer </a> the hardest and furthest yet to date, and I think the result really shows in the final film.
Frozen 2 is stunningly beautiful to look at it; seeing it for the first time in its completed form was a humbling experience, since there were many moments where I realized I honestly had no idea how our artists had managed to push the renderer as far as they did.
During the production of Frozen 2, we also welcomed three superstar rendering engineers to the rendering team: <a href="http://rgba32.blogspot.com">Mark Lee</a>, <a href="https://schuttejoe.github.io">Joe Schutte</a>, and <a href="http://rendering-memo.blogspot.com">Wei-Feng Wayne Huang</a>; their contributions to our team and to Frozen 2 simply cannot be overstated!</p>
<p>On Frozen 2, I got to play a part on several fun and interesting initiatives!
Hyperion’s modern volume rendering system saw a number of major improvements and advancements for Frozen 2, mostly centered around rendering optically thin volumes.
Hyperion’s modern volume rendering system is <a href="https://blog.yiningkarlli.com/2017/07/spectral-and-decomposition-tracking.html">based on null-collision tracking theory</a> <a href="https://dl.acm.org/citation.cfm?id=3073665">[Kutz et al. 2017]</a>, which is exceptionally well suited for dense volumes dominated by high-order scattering (such as clouds and snow).
However, as anyone with experience developing a volume rendering system knows, optically thin volumes (such as mist and fog) are a major weak point for null-collision techniques .
Wayne was responsible for a number of major advancements that allowed us to efficiently render mist and fog on Frozen 2 using the modern volume rendering system, and Wayne was kind enough to allow me to play something of an advisory / consulting role on that project.
Also, Frozen 2 is the first feature film on which we’ve deployed Hyperion’s path guiding implementation into production; this project was the result of some very tight collaboration between Disney Animation and <a href="https://studios.disneyresearch.com">Disney Research Studios</a>.
Last summer, I worked with Peter Kutz, our summer intern <a href="http://omnigraphica.com">Laura Lediaev</a>, and with <a href="https://research.nvidia.com/person/thomas-mueller">Thomas Müller</a> from ETH Zürich / Disney Research Studios to prototype an implementation of <a href="https://tom94.net/pages/publications/mueller17practical-erratum">Practical Path Guiding</a> <a href="https://doi.org/10.1111/cgf.13227">[Müller et al. 2017]</a> in Hyperion.
Joe Schutte then took on the massive task (as one of his first tasks on the team, no less!) of turning the prototype into a production-quality feature, and Joe worked with Thomas to develop a number of improvements to the original paper <a href="https://tom94.net/data/courses/vorba19guiding/vorba19guiding.pdf">[Müller 2019]</a>.
Finally, I worked on some lighting / shading improvements for Frozen 2, which included developing a new spot light implementation for theatrical lighting, and, with Matt Chiang and Brent Burley, a <a href="https://www.yiningkarlli.com/projects/shadowterminator.html">solution to the long-standing normal / bump mapped shadow terminator problem</a> <a href="https://dl.acm.org/citation.cfm?id=3328172">[Chiang et al. 2019]</a>.
We also benefited from more improvements in our denoising tech <a href="https://doi.org/10.1145/3306307.3328150">[Dahlberg et al. 2019]</a> which arose as a joint effort between our own David Adler, ILM, Pixar and the Disney Research Studios rendering team.</p>
<p>I think Frozen projects provide an interesting window into how far rendering has progressed at Disney Animation over the past six years.
We’ve basically had some Frozen project going on every few years, and each Frozen project upon completion has represented the most cutting edge rendering capabilities we’ve had at the time.
The original Frozen in 2013 was the studio’s last project rendered using Renderman, and also the studio’s last project to not use path tracing.
Frozen Fever in 2015, by contrast, was one of the first projects (alongside Big Hero 6) to use Hyperion and full path traced global illumination.
The jump in visual quality between Frozen and Frozen Fever was enormous, especially considering that they were released only a year and a half apart.
Olaf’s Frozen Adventure, which I’ve <a href="https://blog.yiningkarlli.com/2017/11/olafs-frozen-adventure.html">written about before</a>, served as the testbed for a number of enormous changes and advancements that were made to Hyperion in preparation for Ralph Breaks the Internet.
Frozen 2 represents the full extent of what Hyperion can do today, now that Hyperion is a production-hardened, mature renderer backed by a team that is now very experienced.
The original Frozen looked decent when it first came out, but since it was the last non-path-traced film we made, it looked dated visually just a few years later.
Comparing the original Frozen with Frozen 2 is like night and day; I’m very confident that Frozen 2 will still look visually stunning and hold up well long into the future.
A great example is in all of the clothing in Frozen 2; when watching the film, take a close look at all of the embroidery on all of the garments.
In the original Frozen, a lot of the embroidery work is displacement mapped or even just normal mapped, but in Frozen 2, all of the embroidery is painstakingly constructed from actual geometric curves <a href="https://dl.acm.org/doi/10.1145/3388767.3407360">[Liu et al. 2020]</a>, and as a result every bit of embroidery is rendered in incredible detail!</p>
<p>One particular thing in Frozen 2 that makes me especially happy is how all of the water looks in the film, and especially how the water looks in the dark seas sequence.
On Moana, we really struggled with getting whitewater and foam to look appropriately bright and white.
Since that bright white effect comes from high-order scattering in volumes and at the time we were still using our old volume rendering system that couldn’t handle high-order scattering well, the artists on Moana wound up having to rely on a lot of ingenious trickery to get whitewater and foam to look just okay.
I think Moana is a staggeringly beautiful film, but if you know where to look, you may be able to tell that the foam looks just a tad bit off.
On Frozen 2, however, we were able to do high-order scattering, and as a result, all of the whitewater and foam in the dark seas sequence looks just absolutely amazing.
No spoilers, but all I’ll say is that there’s another part in the movie that isn’t in any trailer where my jaw was just on the floor in terms of water rendering; you’ll know it when you see it.
A similar effect has been done before in a previous CG Disney Animation movie, but the effect in Frozen 2 is on a far grander, far more impressive, far more amazing scale <a href="https://dl.acm.org/doi/10.1145/3388767.3407333">[Tollec et al. 2020]</a>.</p>
<p>In addition to the rendering tech advancements we made on Frozen 2, there are a bunch of other cool technical initiatives that I’d recommend reading about!
Each of our films has its own distinct world and look, and the style requirements on Frozen 2 often required really cool close collaborations between the lighting and look departments and the rendering team; the “Show Yourself” sequence near the end of the film was a great example of the amazing work these collaborations can produce <a href="https://doi.org/10.1145/3388767.3407388">[Sathe et al. 2020]</a>.
Frozen 2 had a lot of characters that were actually complex effects, such as the Wind Spirit <a href="https://dl.acm.org/doi/10.1145/3388767.3407346">[Black et al. 2020]</a> and the Nokk water horse <a href="https://dl.acm.org/doi/10.1145/3388767.3407345">[Hutchins et al. 2020]</a>; these characters required tight collaborations between a whole swath of departments ranging from animation to simulation to look to effects to lighting.
Even the forest setting of the film required new tech advancements; we’ve made plenty of forests before, but integrating huge-scale effects into the forest resulted in some cool new workflows and techniques <a href="https://dl.acm.org/doi/10.1145/3388767.3409320">[Joseph et al. 2020]</a>.</p>
<p>To give a sense of just how gorgeous Frozen 2 looks, below are some stills from the movie, in no particular order, 100% rendered using Hyperion.
If you love seeing cutting edge rendering in action, I strongly encourage going to see Frozen 2 on the biggest screen you can find!
The film has wonderful songs, a fantastic story, and developed, complex, funny characters, and of course there is not a single frame in the movie that isn’t stunningly beautiful.</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_40.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_40.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_12.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_12.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_24.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_24.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_37.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_37.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_68.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_68.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_77.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_77.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_54.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_54.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_01.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_01.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_23.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_23.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_43.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_43.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_02.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_02.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_27.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_27.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_21.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_21.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_03.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_03.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_04.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_04.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_17.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_17.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_22.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_22.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_28.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_28.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_05.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_05.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_41.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_41.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_06.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_06.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_20.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_20.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_11.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_11.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_13.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_13.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_14.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_14.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_15.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_15.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_16.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_16.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_18.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_18.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_19.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_19.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_25.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_25.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_10.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_10.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_26.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_26.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_29.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_29.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_30.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_30.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_07.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_07.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_31.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_31.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_32.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_32.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_08.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_08.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_52.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_52.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_33.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_33.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_34.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_34.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_35.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_35.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_36.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_36.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_63.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_63.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_09.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_09.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_38.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_38.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_39.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_39.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_72.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_72.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_42.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_42.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_60.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_60.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_44.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_44.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_46.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_46.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_47.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_47.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_48.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_48.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_49.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_49.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_50.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_50.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_64.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_64.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_51.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_51.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_45.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_45.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_53.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_53.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_56.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_56.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_57.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_57.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_58.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_58.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_59.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_59.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_61.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_61.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_62.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_62.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_65.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_65.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_66.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_66.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_67.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_67.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_69.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_69.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_71.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_71.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_73.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_73.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_74.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_74.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_75.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_75.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_55.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_55.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_76.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_76.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_70.jpg"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_70.jpg" alt="" /></a></p>
<p>Here is the part of the credits with Disney Animation’s rendering team, kindly provided by Disney!
I always encourage sitting through the credits for movies, since everyone in the credits put so much hard work and passion into what you see onscreen, but I especially recommend it for Frozen 2 since there’s also a great post-credits scene.</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_credits.png"><img src="https://blog.yiningkarlli.com/content/images/2019/Nov/froz2/FROZ2_credits.png" alt="" /></a></p>
<p>All images in this post are courtesy of and the property of Walt Disney Animation Studios.</p>
<p><strong>References</strong></p>
<p>Cameron Black, Trent Correy, and Benjamin Fiske. 2020. <a href="https://dl.acm.org/doi/10.1145/3388767.3407346">Frozen 2: Creating the Wind Spirit</a>. In <em>ACM SIGGRAPH 2020 Talks</em>. 22:1-22:2.</p>
<p>Matt Jen-Yuan Chiang, Yining Karl Li, and Brent Burley. 2019. <a href="https://dl.acm.org/citation.cfm?id=3328172">Taming the Shadow Terminator</a>. In <em>ACM SIGGRAPH 2019 Talks</em>. 71:1-71:2.</p>
<p>Henrik Dahlberg, David Adler, and Jeremy Newlin. 2019. <a href="https://dl.acm.org/citation.cfm?id=3328150">Machine-Learning Denoising in Feature Film Production</a>. In <em>ACM SIGGRAPH 2019 Talks</em>. 21:1-21:2.</p>
<p>David Hutchins, Cameron Black, Marc Bryant, Richard Lehmann, and Svetla Radivoeva. 2020. <a href="https://dl.acm.org/doi/10.1145/3388767.3407345">“Frozen 2”: Creating the Water Horse </a>. In <em>ACM SIGGRAPH 2020 Talks</em>. 23:1-23:2.</p>
<p>Norman Moses Joseph, Vijoy Gaddipati, Benjamin Fiske, Marie Tollec, and Tad Miller. 2020. <a href="https://dl.acm.org/doi/10.1145/3388767.3409320">Frozen 2: Effects Vegetation Pipeline</a>. In <em>ACM SIGGRAPH 2020 Talks</em>. 7:1-7:2.</p>
<p>Peter Kutz, Ralf Habel, Yining Karl Li, and Jan Novák. 2017. <a href="https://doi.org/10.1145/3072959.3073665">Spectral and Decomposition Tracking for Rendering Heterogeneous Volumes</a>. <em>ACM Transactions on Graphics</em>. 36, 4 (2017), 111:1-111:16.</p>
<p>Ying Liu, Jared Wright, and Alexander Alvarado. 2020. <a href="https://dl.acm.org/doi/10.1145/3388767.3407360">Making Beautiful Embroidery for “Frozen 2”</a>. In <em>ACM SIGGRAPH 2020 Talks</em>. 73:1-73:2.</p>
<p>Thomas Müller. <a href="https://cgg.mff.cuni.cz/~jaroslav/papers/2019-path-guiding-course/index.htm">Practical Path Guiding in Production</a>. 2019. In <em>ACM SIGGRAPH 2019 Course Notes: <a href="https://cgg.mff.cuni.cz/~jaroslav/papers/2019-path-guiding-course/index.htm">Path Guiding in Production</a></em>. 37-50.</p>
<p>Thomas Müller, Markus Gross, and Jan Novák. 2017. <a href="https://doi.org/10.1111/cgf.13227">Practical Path Guiding for Efficient Light-Transport Simulation</a>. <em>Computer Graphics Forum</em>. 36, 4 (2017), 91-100.</p>
<p>Amol Sathe, Lance Summers, Matt Jen-Yuan Chiang, and James Newland. 2020. <a href="https://doi.org/10.1145/3388767.3407388">The Look and Lighting of “Show Yourself” in “Frozen 2”</a>. In <em>ACM SIGGRAPH 2020 Talks</em>. 71:1-71:2.</p>
<p>Marie Tollec, Sean Jenkins, Lance Summers, and Charles Cunningham-Scott. 2020. <a href="https://dl.acm.org/doi/10.1145/3388767.3407333">Deconstructing Destruction: Making and Breaking of ”Frozen 2”’s Dam</a>. In <em>ACM SIGGRAPH 2020 Talks</em>. 24:1-24:2.</p>
https://blog.yiningkarlli.com/2019/08/taming-the-shadow-terminator.html
SIGGRAPH 2019 Talk- Taming the Shadow Terminator
2019-08-01T00:00:00+00:00
2019-08-01T00:00:00+00:00
Yining Karl Li
<p>This year at SIGGRAPH 2019, Matt Jen-Yuan Chiang, Brent Burley, and I had a talk that presents a technique for smoothing out the harsh shadow terminator problem that often arises when high-frequency bump or normal mapping is used in ray tracing.
We developed this technique as part general development on <a href="https://www.disneyanimation.com/technology/innovations/hyperion">Disney’s Hyperion Renderer</a> for the production of Frozen 2.
This work is mostly Matt’s; Matt was very kind in allowing me to help out and play a small role on this project.</p>
<p>This work is contemporaneous with the recent work on the same shadow terminator problem that was carried out and <a href="https://link.springer.com/chapter/10.1007/978-1-4842-4427-2_12">published by Estevez et al. from Sony Pictures Imageworks</a> and published in <a href="https://www.realtimerendering.com/raytracinggems/">Ray Tracing Gems</a>.
We actually found out about the Estevez et al. technique at almost exactly the same time that we submitted our SIGGRAPH talk, which proved to be very fortunate, since after our talk was accepted, we were than able to update our short paper with additional comparisons between Estevez et al. and our technique.
I think this is a great example of how having multiple rendering teams in the field tackling similar problems and sharing results provides a huge benefit to the field as a whole- we now have two different, really good solutions to what used to be a big shading problem!</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Aug/header.png"><img src="https://blog.yiningkarlli.com/content/images/2019/Aug/preview/header.jpg" alt="A higher-res version of Figure 1 from the paper: (left) <a href="https://blog.yiningkarlli.com/content/images/2019/Aug/header_shadingnormals.png">shading normals</a> exhibiting the harsh shadow terminator problem, (center) <a href="https://blog.yiningkarlli.com/content/images/2019/Aug/header_chiang.png">our technique</a>, and (right) <a href="https://blog.yiningkarlli.com/content/images/2019/Aug/header_estevez.png">Estevez et al.'s technique</a>." /></a></p>
<p>Here is the paper abstract:</p>
<p><em>A longstanding problem with the use of shading normals is the discontinuity introduced into the cosine falloff where part of the hemisphere around the shading normal falls below the geometric surface.
Our solution is to add a geometrically derived shadowing function that adds minimal additional shadowing while falling smoothly to zero at the terminator.
Our shadowing function is simple, robust, efficient and production proven.</em></p>
<p>The paper and related materials can be found at:</p>
<ul>
<li><a href="https://www.yiningkarlli.com/projects/shadowterminator.html">Project Page (Author’s Version and Presentation Slides)</a></li>
<li><a href="https://dl.acm.org/doi/10.1145/3306307.3328172">Official Print Version (ACM Library)</a></li>
</ul>
<p>Matt Chiang presented the paper at SIGGRAPH 2019 in Log Angeles as part of the “Lucy in the Sky with Diamonds - Processing Visuals” Talks session.
A pdf version of the presentation slides, along with presenter notes, are available on my project page for the paper.
I’d also recommend getting the author’s version of the short paper instead of the official version as well, since the author’s version includes some typo fixes made after the official version was published.</p>
<p>Work on this project started early in the production of Frozen 2, when our look artists started to develop the shading of the dresses and costumes in Frozen 2.
Because intricate woven fabrics and patterns are an important part of the Scandinavian culture that Frozen 2 is inspired by, the shading in Frozen 2 pushed high-resolution high-frequency displacing and normal mapping further than we ever had before with Hyperion in order to make convincing looking textiles.
Because of how high-frequency the normal mapping was pushed, the bump/normal mapped shadow terminator problem became worse and worse and proved to be a major pain point for our look and lighting artists.
In the past, our look and lighting artists have worked around shadow terminator issues using a combination of techniques, such as falling back to full displacement, or using larger area lights to try to soften the shadow terminator.
However, these techniques can be problematic when they are in conflict with art direction, and force artists to think about an additional technical dimension when they otherwise would rather be focused on the artistry.</p>
<p>Our search for a solution began with Peter Kutz looking at <a href="https://dl.acm.org/doi/10.1145/3130800.3130806">“Microfacet-based Normal Mapping for Robust Monte Carlo Path Tracing” by Schüssler et al.</a>, which focused on addressing energy loss when rendering shading normals.
The Schüssler et al. 2017 technique solved the energy loss problem by constructing a microfacet surface comprised of <em>two</em> facets per shading point, instead the the usual one.
The secondary facet is used to account for things like inter-reflections between the primary and secondary facets.
However, the Schüssler et al. 2017 technique wound up not solving the shadow terminator problems we were facing; using their shadowing function produced a look that was too flat.</p>
<p>Matt Chiang then realized that the secondary microfacet approach could be used to solve the shadow terminator problem using a different secondary microfacet configuration; instead of using a vertical second facet as in Schüssler, Matt made the secondary facet perpendicular to the shading normal.
By making the secondary facet perpendicular, as a light source slowly moves towards the grazing angle relative to the microfacet surface, peak brightness is maintained when the light is parallel to the shading normal, while additional shadowing is introduced beyond the parallel angle.
This solution worked extremely well, and is the technique presented in our talk / short paper.</p>
<p>The final piece of the puzzle was addressing a visual discontinuity produced by Matt’s technique when the light direction reaches and moves beyond the shading normal.
Instead of falling smoothly to zero, the shape of the shadow terminator undergoes a hard shift from a cosing fall-off formed by the dot product of the shading normal and light direction to a linear fall-off.
Matt and I played with a number of different interpolation schemes to smooth out this transition, and eventually settled on a custom smooth-step function.
During this process, I made the observation that whatever blending function we used needed to introduce C1 continuity in order to remove the visual discontinuity.
This observation led Brent Burley to realize that instead of a complex custom smooth-step function, a simple Hermite interpolation would be enough; this Hermite interpolation is the one presented in the talk / short paper.</p>
<p>For a much more in-depth view at all of the above, complete with diagrams and figures and examples, I highly recommend looking at Matt’s presentation slides and presenter notes.</p>
<p>Here is a test render of the Iduna character’s costume from Frozen 2, from before we had this technique implemented in Hyperion.
The harsh shadow terminator produces an illusion that makes her arms and torso look boxier than the actual underlying geometry is:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Aug/iduna_hardterminator.png"><img src="https://blog.yiningkarlli.com/content/images/2019/Aug/iduna_hardterminator.png" alt="Iduna's costume without our shadow terminator technique. Note how boxy the arms and torso look." /></a></p>
<p>…and here is the same test render, but now with our soft shadow terminator fix implemented and enabled.
Note how her arms and torso now look properly rounded, instead of boxy!</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Aug/iduna_softterminator.png"><img src="https://blog.yiningkarlli.com/content/images/2019/Aug/iduna_softterminator.png" alt="Iduna's costume with our shadow terminator technique. The arms and torso look correctly rounded now." /></a></p>
<p>This technique is now enabled by default across the board in Hyperion, and any article of clothing or costume you see in Frozen 2 is using this technique.
So, through this project, we got to play a small role in making Elsa, Anna, Kristoff, and everyone else look like themselves!</p>
https://blog.yiningkarlli.com/2019/07/hyperion-papers.html
Hyperion Publications
2019-07-30T00:00:00+00:00
2019-07-30T00:00:00+00:00
Yining Karl Li
<p>Every year at SIGGRAPH (and sometimes at other points in the year), members of the Hyperion team inevitably get asked if there is any publicly available information about <a href="https://www.disneyanimation.com/technology/hyperion/">Disney’s Hyperion Renderer</a>.
The answer is: yes, there is actually a lot of publicly available information!</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Jul/FirstPagesv5.png"><img src="https://blog.yiningkarlli.com/content/images/2019/Jul/FirstPagesv5_prev.png" alt="Figure 1: Previews of the first page of every Hyperion-related publication from Disney Animation, Disney Research Studios, and other research partners." /></a></p>
<p>One amazing aspect of working at Walt Disney Animation Studios is the huge amount of support and encouragement we get from our managers and the wider studio for publishing and sharing our work with the wider academic world and industry.
As part of this sharing, the Hyperion team has had the opportunity to publish a number of papers over the years detailing various interesting techniques used in the renderer.</p>
<p>I think it’s very important to mention here that another one of my favorite aspects of working on the Hyperion team is the deep collaboration we get to engage in with our sister rendering team at <a href="https://studios.disneyresearch.com">Disney Research Studios</a> (formerly known as Disney Research Zürich).
The vast majority of the Hyperion team’s publications are joint works with Disney Research Studios, and I personally think it’s fair to say that all of Hyperion’s most interesting advanced features are just as much the result of research and work from Disney Research Studios as they are the result of our team’s own work.
Without a doubt, Hyperion, and by extension, our movies, would not be what they are today without Disney Research Studios.
Of course, we also collaborate closely with our sister rendering teams at <a href="https://www.pixar.com">Pixar Animation Studios</a> and <a href="https://www.ilm.com">Industrial Light & Magic</a> as well, and there are numerous examples where collaboration between all of these teams has advanced the state of the art in rendering for the whole industry.</p>
<p>So without further ado, below are all of the papers that the Hyperion team has published or worked on or had involvement with over the years, either by ourselves or with our counterparts at Disney Research Studios, Pixar, ILM, and other research groups.
If you’ve ever been curious to learn more about Disney’s Hyperion Renderer, here are 43 publications with a combined 441 pages of material!
For each paper, I’ll link to a preprint version, link to the official publisher’s version, and link any additional relevant resources for the paper.
I’ll also give the citation information, give a brief description, list the teams involved, and note how the paper is relevant to Hyperion.
This post is meant to be a living document; I’ll come back and update it down the line with future publications. Publications are listed in chronological order.</p>
<ol>
<li>
<p><strong>Ptex: Per-Face Texture Mapping for Production Rendering</strong></p>
<p><a href="https://www.linkedin.com/in/brent-burley-56972557/">Brent Burley</a> and <a href="https://www.linkedin.com/in/dylanlacewell/">Dylan Lacewell</a>. Ptex: Per-face Texture Mapping for Production Rendering. <em>Computer Graphics Forum (Proceedings of Eurographics Symposium on Rendering 2008)</em>, 27(4), June 2008.</p>
<ul>
<li><a href="https://drive.google.com/open?id=1EdMYHhs4h_ICcSgGfA4GzZoRNI_yVryA">Preprint Version</a></li>
<li><a href="https://doi.org/10.1111/j.1467-8659.2008.01253.x">Official Publisher’s Version</a></li>
<li><a href="http://ptex.us">Open Source Project</a></li>
</ul>
<p>Internal project from Disney Animation. This paper describes per-face textures, a UV-free way of texture mapping. Ptex is the texturing system used in Hyperion for all non-procedural texture maps. Every Disney Animation film made using Hyperion is textured entirely with Ptex. Ptex is also available in many commercial renderers, such as <a href="https://renderman.pixar.com">Pixar’s RenderMan</a>!</p>
</li>
<li>
<p><strong>Physically-Based Shading at Disney</strong></p>
<p><a href="https://www.linkedin.com/in/brent-burley-56972557/">Brent Burley</a>. Physically Based Shading at Disney. In <em>ACM SIGGRAPH 2012 Course Notes: Practical Physically-Based Shading in Film and Game Production</em>, August 2012.</p>
<ul>
<li><a href="https://drive.google.com/open?id=1SwEWQadyMPo5m49kIACoFq2R6q0bZJz7">Preprint Version</a> (Updated compared to official version)</li>
<li><a href="https://doi.org/10.1145/2343483.2343493">Official Publisher’s Version</a></li>
<li><a href="https://blog.selfshadow.com/publications/s2012-shading-course/">Physically Based Shading SIGGRAPH 2012 Course</a></li>
</ul>
<p>Internal project from Disney Animation. This paper describes the Disney BRDF, a physically principled shading model with a artist-friendly parameterization and layering system. The Disney BRDF is the basis of Hyperion’s entire shading system. The Disney BRDF has also gained widespread industry adoption the basis of a wide variety of physically based shading systems, and has influenced the design of shading systems in a number of other production renderers. Every Disney Animation film made using Hyperion is shaded using the Disney BSDF (an extended version of the Disney BRDF, described in a later paper).</p>
</li>
<li>
<p><strong>Sorted Deferred Shading for Production Path Tracing</strong></p>
<p><a href="https://www.linkedin.com/in/christian-eisenacher-477ab983/">Christian Eisenacher</a>, <a href="https://www.linkedin.com/in/gregory-nichols/">Gregory Nichols</a>, <a href="http://www.andyselle.com">Andrew Selle</a>, and <a href="https://www.linkedin.com/in/brent-burley-56972557/">Brent Burley</a>. Sorted Deferred Shading for Production Path Tracing. <em>Computer Graphics Forum (Proceedings of Eurographics Symposium on Rendering 2013)</em>, 32(4), June 2013.</p>
<ul>
<li><a href="https://drive.google.com/open?id=1zha14cniwtvy8Xkn2Jv9jE5Y1T50VSJS">Preprint Version</a></li>
<li><a href="https://doi.org/10.1111/cgf.12158">Official Publisher’s Version</a></li>
</ul>
<p>Internal project from Disney Animation. Won the Best Paper Award at EGSR 2013! This paper describes the sorted deferred shading architecture that is at the very core of Hyperion. Along with the previous two papers in this list, this is one of the foundational papers for Hyperion; every film rendered using Hyperion is rendered using this architecture.</p>
</li>
<li>
<p><strong>Residual Ratio Tracking for Estimating Attenuation in Participating Media</strong></p>
<p><a href="http://drz.disneyresearch.com/~jnovak/">Jan Novák</a>, <a href="http://www.andyselle.com">Andrew Selle</a>, and <a href="https://cs.dartmouth.edu/~wjarosz/">Wojciech Jarosz</a>. Residual Ratio Tracking for Estimating Attenuation in Participating Media. <em>ACM Transactions on Graphics (Proceedings of SIGGRAPH Asia 2014)</em>, 33(6), November 2014.</p>
<ul>
<li><a href="https://drive.google.com/open?id=1b1RkW3eFAgM-i6IZ_m0jmQcPfr7cOLEz">Preprint Version</a></li>
<li><a href="https://doi.org/10.1145/2661229.2661292">Official Publisher’s Version</a></li>
<li><a href="http://drz.disneyresearch.com/~jnovak/publications/RRTracking/">Project Page</a></li>
</ul>
<p>Joint project between Disney Research Studios and Disney Animation. This paper described a pair of new, complementary techniques for evaluating transmittance in heterogeneous volumes. These two techniques made up the core of Hyperion’s first and second generation volume rendering implementations, used from <em>Big Hero 6</em> up through <em>Moana</em>.</p>
</li>
<li>
<p><strong>Visualizing Building Interiors using Virtual Windows</strong></p>
<p><a href="https://www.linkedin.com/in/normanmosesjoseph/">Norman Moses Joseph</a>, <a href="https://www.imdb.com/name/nm0009853/">Brett Achorn</a>, <a href="https://www.linkedin.com/in/sean-jenkins-a1352062/">Sean D. Jenkins</a>, and <a href="https://www.linkedin.com/in/hank-driskill-1a7140165/">Hank Driskill</a>. Visualizing Building Interiors using Virtual Windows. In <em>ACM SIGGRAPH Asia 2014 Technical Briefs</em>, December 2014.</p>
<ul>
<li><a href="https://drive.google.com/file/d/1ugDBwIxmYKGCMOyfNSF2fwMhRX6BjR_g">Preprint Version</a></li>
<li><a href="https://doi.org/10.1145/2669024.2669029">Official Publisher’s Version</a></li>
</ul>
<p>Internal project from Disney Animation. This paper describes Hyperion’s “hologram shader”, which is used for creating the appearance of parallaxed, fully shaded, detailed building interiors without adding additional geometric complexity to a scene. This technique was developed for <em>Big Hero 6</em>. Be sure to check out the supplemental materials on the publisher site for a cool video breakdown of the technique.</p>
</li>
<li>
<p><strong>Path-space Motion Estimation and Decomposition for Robust Animation Filtering</strong></p>
<p><a href="https://graphics.ethz.ch/~hzimmer/">Henning Zimmer</a>, <a href="https://research.nvidia.com/person/fabrice-rousselle">Fabrice Rousselle</a>, <a href="http://rgl.epfl.ch/people/wjakob/">Wenzel Jakob</a>, <a href="http://zurich.disneyresearch.com/~owang/">Oliver Wang</a>, <a href="https://www.linkedin.com/in/david-adler-5ab7b21/">David Adler</a>, <a href="https://cs.dartmouth.edu/~wjarosz/">Wojciech Jarosz</a>, <a href="http://igl.ethz.ch/people/sorkine/">Olga Sorkine-Hornung</a>, and <a href="http://www.ahornung.net/">Alexander Sorkine-Hornung</a>. Path-space Motion Estimation and Decomposition for Robust Animation Filtering. <em>Computer Graphics Forum (Proceedings of Eurographics Symposium on Rendering 2015)</em>, 34(4), June 2015.</p>
<ul>
<li><a href="https://drive.google.com/open?id=19Me6xkA9jBIlydMMgEeC9Uor93MqBhtW">Preprint Version</a></li>
<li><a href="http://doi.org/10.1111/cgf.12685">Official Publisher’s Version</a></li>
<li><a href="https://cs.dartmouth.edu/~wjarosz/publications/zimmer15path.html">Project Page</a></li>
</ul>
<p>Joint project between Disney Research Studios, ETH Zürich, and Disney Animation. This paper describes a denoising technique suitable for animated sequences. Not directly used in Hyperion’s denoiser, but both inspired by and influential towards Hyperion’s first generation denoiser.</p>
</li>
<li>
<p><strong>Portal-Masked Environment Map Sampling</strong></p>
<p><a href="https://benedikt-bitterli.me">Benedikt Bitterli</a>, <a href="http://drz.disneyresearch.com/~jnovak/">Jan Novák</a>, and <a href="https://cs.dartmouth.edu/~wjarosz/">Wojciech Jarosz</a>. Portal-Masked Environment Map Sampling. <em>Computer Graphics Forum (Proceedings of Eurographics Symposium on Rendering 2015)</em>, 34(4), June 2015.</p>
<ul>
<li><a href="https://drive.google.com/open?id=14zzjee1MAhUPsQ2vUzPQyZFo-J8Gud7b">Preprint Version</a></li>
<li><a href="http://doi.org/10.1111/cgf.12674">Official Publisher’s Version</a></li>
<li><a href="https://benedikt-bitterli.me/pmems.html">Project Page</a></li>
</ul>
<p>Joint project between Disney Research Studios and Disney Animation. This paper describes an efficient method for importance sampling environment maps. This paper was actually derived from the technique Hyperion uses for importance sampling lights with IES profiles, which has been used on all films rendered using Hyperion.</p>
</li>
<li>
<p><strong>A Practical and Controllable Hair and Fur Model for Production Path Tracing</strong></p>
<p><a href="http://dl.acm.org/author_page.cfm?id=99658729701&coll=DL&dl=ACM&trk=0">Matt Jen-Yuan Chiang</a>, <a href="https://benedikt-bitterli.me">Benedikt Bitterli</a>, <a href="https://www.linkedin.com/in/chuck-tappan-40762450/">Chuck Tappan</a>, and <a href="https://www.linkedin.com/in/brent-burley-56972557/">Brent Burley</a>. A Practical and Controllable Hair and Fur Model for Production Path Tracing. In <em>ACM SIGGRAPH 2015 Talks</em>, August 2015.</p>
<ul>
<li><a href="https://drive.google.com/open?id=19k6mnZMJXmgDSwy1Hcb7fFdMstALjTto">Preprint Version</a></li>
<li><a href="http://doi.org/10.1145/2775280.2792559">Official Publisher’s Version</a></li>
</ul>
<p>Joint project between Disney Research Studios and Disney Animation. This short paper gives an overview of Hyperion’s fur and hair model, originally developed for use on Zootopia. A full paper was published later with more details. This fur/hair model is Hyperion’s fur/hair model today, used on every film beginning with <em>Zootopia</em> to present.</p>
</li>
<li>
<p><strong>Extending the Disney BRDF to a BSDF with Integrated Subsurface Scattering</strong></p>
<p><a href="https://www.linkedin.com/in/brent-burley-56972557/">Brent Burley</a>. Extending the Disney BRDF to a BSDF with Integrated Subsurface Scattering. In <em>ACM SIGGRAPH 2015 Course Notes: Physically Based Shading in Theory and Practice</em>, August 2015.</p>
<ul>
<li><a href="https://drive.google.com/open?id=1KJgmVRZqEI7rCdSSeT6_lZJerTTQ0AiH">Preprint Version</a></li>
<li><a href="https://doi.org/10.1145/2776880.2787670">Official Publisher’s Version</a></li>
<li><a href="https://blog.selfshadow.com/publications/s2015-shading-course">Physically Based Shading SIGGRAPH 2015 Course</a></li>
</ul>
<p>Internal project from Disney Animation. This paper describes the full Disney BSDF (sometimes referred to in the wider industry as Disney BRDF v2) used in Hyperion, and also describes a novel subsurface scattering technique called normalized diffusion subsurface scattering. The Disney BSDF is the shading model for everything ever rendered using Hyperion, and normalized diffusion was Hyperion’s subsurface model from <em>Big Hero 6</em> up through <em>Moana</em>. For a public open-source implementation of the Disney BSDF, check out <a href="https://github.com/mmp/pbrt-v3">PBRT v3</a>’s implementation. Also, check out <a href="https://renderman.pixar.com">Pixar’s RenderMan</a> for an implementation in a commercial renderer!</p>
</li>
<li>
<p><strong>Approximate Reflectance Profiles for Efficient Subsurface Scattering</strong></p>
<p><a href="https://www.seanet.com/~myandper/per.htm">Per H Christensen</a> and <a href="https://www.linkedin.com/in/brent-burley-56972557/">Brent Burley</a>. Approximate Reflectance Profiles for Efficient Subsurface Scattering. <em>Pixar Technical Memo</em>, #15-04, August 2015.</p>
<ul>
<li><a href="https://drive.google.com/open?id=1kJfJId-I5DjhUnHH-Q6fgsrNc7ZW1MIq">Preprint Version</a></li>
<li><a href="http://graphics.pixar.com/library/ApproxBSSRDF/">Official Pixar Research Version and Project Page</a></li>
<li><a href="https://www.seanet.com/~myandper/abstract/memo1504.htm">Updates and Errata</a></li>
</ul>
<p>Joint project between Pixar and Disney Animation. This paper presents several useful parameterizations for the normalized diffusion subsurface scattering model presented in the previous paper in this list. These parameterizations are used for the normalized diffusion implementation in <a href="https://rmanwiki.pixar.com/display/REN/PxrSurface">Pixar’s RenderMan 21</a> and later.</p>
</li>
<li>
<p><strong>Big Hero 6: Into the Portal</strong></p>
<p><a href="https://www.linkedin.com/in/david-hutchins-21a9507/">David Hutchins</a>, <a href="https://www.linkedin.com/in/olun-riley/">Olun Riley</a>, <a href="https://www.linkedin.com/in/popsopdop/">Jesse Erickson</a>, <a href="http://alexey.stomakhin.com">Alexey Stomakhin</a>, <a href="https://www.linkedin.com/in/ralf-habel-6a74bb2/">Ralf Habel</a>, and <a href="https://www.linkedin.com/in/michael-kaschalk-49b7683/">Michael Kaschalk</a>. Big Hero 6: Into the Portal. In <em>ACM SIGGRAPH 2015 Talks</em>, August 2015.</p>
<ul>
<li><a href="https://drive.google.com/open?id=1cCDmWf6pKDaIarDRK0YkARhm5_kQlF4_">Preprint Version</a></li>
<li><a href="https://doi.org/10.1145/2775280.2792521">Official Publisher’s Version</a></li>
</ul>
<p>Internal project from Disney Animation. This short paper describes some interesting volume rendering challenges that Hyperion faced during the production of <em>Big Hero 6</em>’s climax sequence, set in a volumetric fractal portal world.</p>
</li>
<li>
<p><strong>Level-of-Detail for Production-Scale Path Tracing</strong></p>
<p><a href="https://www.lgdv.tf.fau.de/person/magdalena-martinek">Magdalena Martinek</a>, <a href="https://www.linkedin.com/in/christian-eisenacher-477ab983/">Christian Eisenacher</a>, and <a href="https://www.lgdv.tf.fau.de/person/marc-stamminger/">Marc Stamminger</a>. Level-of-Detail for Production-Scale Path Tracing. In <em>VMV 2015: Proceedings of the 20th International Symposium on Vision, Modeling, and Visualization</em>, October 2015.</p>
<ul>
<li><a href="https://drive.google.com/file/d/1Z5OFw1liYDwV9-w-SngKnXBEqyFsEHeh/view?usp=sharing">Preprint Version</a></li>
<li><a href="https://doi.org/10.2312/vmv.20151260">Official Publisher’s Version</a></li>
</ul>
<p>Joint project between Disney Animation and the University of Erlangen-Nurmberg. This paper gives an overview of a SVO-based level-of-detail system for use in production path tracing. This system was originally prototyped in an early version of Hyperion and informed the automatic shading level-of-detail system that was used on <em>Big Hero 6</em>; automatic shading level-of-detail has since been removed from Hyperion.</p>
</li>
<li>
<p><strong>A Practical and Controllable Hair and Fur Model for Production Path Tracing</strong></p>
<p><a href="http://dl.acm.org/author_page.cfm?id=99658729701&coll=DL&dl=ACM&trk=0">Matt Jen-Yuan Chiang</a>, <a href="https://benedikt-bitterli.me">Benedikt Bitterli</a>, <a href="https://www.linkedin.com/in/chuck-tappan-40762450/">Chuck Tappan</a>, and <a href="https://www.linkedin.com/in/brent-burley-56972557/">Brent Burley</a>. A Practical and Controllable Hair and Fur Model for Production Path Tracing. <em>Computer Graphics Forum (Proceedings of Eurographics 2016)</em>, 35(2), May 2016.</p>
<ul>
<li><a href="https://drive.google.com/open?id=1cVxBWddi2yClj_A_bca_emRduPJ6GN8Q">Preprint Version</a></li>
<li><a href="https://doi.org/10.1111/cgf.12830">Official Publisher’s Version</a></li>
<li><a href="https://benedikt-bitterli.me/pchfm/">Project Page</a></li>
<li><a href="https://www.pbrt.org/hair.pdf">Implementation Guide by Matt Pharr</a></li>
</ul>
<p>Joint project between Disney Research Studios and Disney Animation. This paper gives an overview of Hyperion’s fur and hair model, originally developed for use on <em>Zootopia</em>. This fur/hair model is Hyperion’s fur/hair model today, used on every film beginning with <em>Zootopia</em> to present. This paper is now also implemented in the open source <a href="https://github.com/mmp/pbrt-v3/blob/master/src/materials/hair.h">PBRT v3</a> renderer, and also serves as the basis of the hair/fur shader in Chaos Group’s <a href="https://www.chaosgroup.com/blog/v-ray-next-the-science-behind-the-new-hair-shader">V-Ray Next</a> renderer.</p>
</li>
<li>
<p><strong>Subdivision Next-Event Estimation for Path-Traced Subsurface Scattering</strong></p>
<p><a href="https://www.linkedin.com/in/david-koerner-41233611">David Koerner</a>, <a href="http://drz.disneyresearch.com/~jnovak/">Jan Novák</a>, <a href="https://www.linkedin.com/in/peterkutz/">Peter Kutz</a>, <a href="https://www.linkedin.com/in/ralf-habel-6a74bb2/">Ralf Habel</a>, and <a href="https://cs.dartmouth.edu/~wjarosz/">Wojciech Jarosz</a>. Subdivision Next-Event Estimation for Path-Traced Subsurface Scattering. In <em>Proceedings of EGSR 2016, Experimental Ideas & Implementations</em>, June 2016.
2016-06-24,</p>
<ul>
<li><a href="https://drive.google.com/open?id=1iMwNqPr-l-_xTViWqxXYIuP8S_he7t8k">Preprint Version</a></li>
<li><a href="https://doi.org/10.2312/sre.20161214">Official Publisher’s Version</a></li>
<li><a href="http://drz.disneyresearch.com/~jnovak/publications/SNEE/index.html">Project Page</a></li>
</ul>
<p>Joint project between Disney Research Studios, University of Stuttgart, Dartmouth College, and Disney Animation. This paper describes a method for accelerating brute force path traced subsurface scattering; this technique was developed during early experimentation in making path traced subsurface scattering practical for production in Hyperion.</p>
</li>
<li>
<p><strong>Nonlinearly Weighted First-Order Regression for Denoising Monte Carlo Renderings</strong></p>
<p><a href="https://benedikt-bitterli.me">Benedikt Bitterli</a>, <a href="https://research.nvidia.com/person/fabrice-rousselle">Fabrice Rousselle</a>, <a href="http://sglab.kaist.ac.kr/~bcmoon/">Bochang Moon</a>, <a href="http://www.j4lley.com/">José A. Iglesias-Guitian</a>, <a href="https://www.linkedin.com/in/david-adler-5ab7b21/">David Adler</a>, <a href="http://www.disneyresearch.com/people/kenny-mitchel/">Kenny Mitchell</a>, <a href="https://cs.dartmouth.edu/~wjarosz/">Wojciech Jarosz</a>, and <a href="http://drz.disneyresearch.com/~jnovak/">Jan Novák</a>. Nonlinearly Weighted First-Order Regression for Denoising Monte Carlo Renderings. <em>Computer Graphics Forum (Proceedings of Eurographics Symposium on Rendering 2016)</em>, 35(4), July 2016.</p>
<ul>
<li><a href="https://drive.google.com/open?id=1cwtHef8gq5m-oKbc2yKDY3jwnbJB1iLQ">Preprint Version</a></li>
<li><a href="https://doi.org/10.1111/cgf.12954">Official Publisher’s Version</a></li>
<li><a href="https://benedikt-bitterli.me/nfor/">Project Page</a></li>
</ul>
<p>Joint project between Disney Research Studios, Edinburgh Napier University, Dartmouth College, and Disney Animation. This paper describes a high-quality, stable denoising technique based on a thorough analysis of previous technique. This technique was developed during a larger project to develop a state-of-the-art successor to Hyperion’s first generation denoiser.</p>
</li>
<li>
<p><strong>Practical and Controllable Subsurface Scattering for Production Path Tracing</strong></p>
<p><a href="http://dl.acm.org/author_page.cfm?id=99658729701&coll=DL&dl=ACM&trk=0">Matt Jen-Yuan Chiang</a>, <a href="https://www.linkedin.com/in/peterkutz/">Peter Kutz</a>, and <a href="https://www.linkedin.com/in/brent-burley-56972557/">Brent Burley</a>. Practical and Controllable Subsurface Scattering for Production Path Tracing. In <em>ACM SIGGRAPH 2016 Talks</em>, July 2016.</p>
<ul>
<li><a href="https://drive.google.com/open?id=1YzdsAbG60dCUkq6xo_HH8nBseILfHfZW">Preprint Version</a></li>
<li><a href="https://doi.org/10.1145/2897839.2927433">Official Publisher’s Version</a></li>
</ul>
<p>Internal project from Disney Animation. This short paper describes the novel parameterization and multi-wavelength sampling strategy used to make path traced subsurface scattering practical for production. Both of these are implemented in Hyperion’s path traced subsurface scattering system and have been in use on all shows beginning with <em>Olaf’s Frozen Adventure</em> to present.</p>
</li>
<li>
<p><strong>Efficient Rendering of Heterogeneous Polydisperse Granular Media</strong></p>
<p><a href="https://tom94.net">Thomas Müller</a>, <a href="https://graphics.ethz.ch/~mpapas/">Marios Papas</a>, <a href="https://la.disneyresearch.com/people/markus-gross/">Markus Gross</a>, <a href="https://cs.dartmouth.edu/~wjarosz/">Wojciech Jarosz</a>, and <a href="http://drz.disneyresearch.com/~jnovak/">Jan Novák</a>. Efficient Rendering of Heterogeneous Polydisperse Granular Media. <em>ACM Transactions on Graphics (Proceedings of SIGGRAPH Asia 2016)</em>, 35(6), November 2016.</p>
<ul>
<li><a href="https://drive.google.com/open?id=1qFwr6_JL29uextdtyNurOFQId0CahvVc">Preprint Version</a></li>
<li><a href="https://doi.org/10.1145/2980179.2982429">Official Publisher’s Version</a></li>
<li><a href="https://cs.dartmouth.edu/~wjarosz/publications/muller16efficient.html">Project Page</a></li>
</ul>
<p>External project from Disney Research Studios, ETH Zürich, and Dartmouth College, inspired in part by production problems encountered at Disney Animation related to rendering things like sand, snow, etc. This technique uses shell transport functions to accelerate path traced rendering of massive assemblies of grains. <a href="https://tom94.net">Thomas Müller</a> implemented an experimental version of this technique in Hyperion, along with an interesting extension for applying the shell transport theory to volume rendering.</p>
</li>
<li>
<p><strong>Practical Path Guiding for Efficient Light-Transport Simulation</strong></p>
<p><a href="https://tom94.net">Thomas Müller</a>, <a href="https://la.disneyresearch.com/people/markus-gross/">Markus Gross</a>, and <a href="http://drz.disneyresearch.com/~jnovak/">Jan Novák</a>. Practical Path Guiding for Efficient Light-Transport Simulation. <em>Computer Graphics Forum (Proceedings of Eurographics Symposium on Rendering 2017)</em>, 36(4), July 2017.</p>
<ul>
<li><a href="https://drive.google.com/open?id=1xJeK76y7BjWHMHpIL31o08f9eJzytGNU">Preprint Version</a> (Updated compared to official version)</li>
<li><a href="https://doi.org/10.1111/cgf.13227">Official Publisher’s Version</a></li>
<li><a href="http://drz.disneyresearch.com/~jnovak/publications/PathGuide/index.html">Project Page</a></li>
</ul>
<p>External joint project between Disney Research Studios and ETH Zürich, inspired in part by challenges with handling complex light transport efficiently in Hyperion. Won the Best Paper Award at EGSR 2017! This paper describes a robust, unbiased technique for progressively learning complex indirect illumination in a scene during a render and intelligently guiding paths to better sample difficult indirect illumination effects. Implemented in Hyperion, along with a number of interesting improvements documented in a later paper. In use on <em>Frozen 2</em> and future films.</p>
</li>
<li>
<p><strong>Kernel-predicting Convolutional Networks for Denoising Monte Carlo Renderings</strong></p>
<p><a href="http://www.ece.ucsb.edu/~sbako/">Steve Bako</a>, <a href="https://tvogels.nl/">Thijs Vogels</a>, <a href="https://www.inf.ethz.ch/personal/mcbrian/">Brian McWilliams</a>, <a href="http://graphics.pixar.com/people/mmeyer/">Mark Meyer</a>, <a href="http://drz.disneyresearch.com/~jnovak/">Jan Novák</a>, <a href="https://graphics.pixar.com/library/indexAuthorAlex_Harvill.html">Alex Harvill</a>, <a href="http://www.ece.ucsb.edu/~psen/">Pradeep Sen</a>, <a href="http://graphics.pixar.com/people/derose/">Tony DeRose</a>, and <a href="https://research.nvidia.com/person/fabrice-rousselle">Fabrice Rousselle</a>. Kernel-predicting Convolutional Networks for Denoising Monte Carlo Renderings. <em>ACM Transactions on Graphics (Proceedings of SIGGRAPH 2017)</em>, 36(4), July 2017.</p>
<ul>
<li><a href="https://drive.google.com/open?id=18jrs2MPiZ5UUqNiSlrerzzb5JX1ba_zM">Preprint Version</a></li>
<li><a href="https://doi.org/10.1145/3072959.3073708">Official Publisher’s Version</a></li>
<li><a href="http://drz.disneyresearch.com/~jnovak/publications/KPCN/index.html">Project Page</a></li>
</ul>
<p>External joint project between University of California Santa Barbara, Disney Research Studios, ETH Zürich, and Pixar, developed as part of the larger effort to develop a successor to Hyperion’s first generation denoiser. This paper describes a supervised learning approach for denoising filter kernels using deep convolutional neural networks. This technique is the basis of the modern Disney-Research-developed second generation deep-learning denoiser in use by the rendering teams at Pixar and ILM, and by the Hyperion iteam at Disney Animation.</p>
</li>
<li>
<p><strong>Production Volume Rendering</strong></p>
<p><a href="https://www.linkedin.com/in/jfong">Julian Fong</a>, <a href="http://magnuswrenninge.com">Magnus Wrenninge</a>, <a href="https://fpsunflower.github.io/ckulla/">Christopher Kulla</a>, and <a href="https://www.linkedin.com/in/ralf-habel-6a74bb2/">Ralf Habel</a>. Production Volume Rendering. In <em>ACM SIGGRAPH 2017 Courses</em>, July 2017.</p>
<ul>
<li><a href="https://drive.google.com/file/d/1eFr_4IKzt796Ns4Iv3OjR3ni0Y7QigP5/view?usp=drivesdk">Preprint Version</a> (Updated compared to official version)</li>
<li><a href="https://doi.org/10.1145/3084873.3084907">Official Publisher’s Version</a></li>
<li><a href="https://graphics.pixar.com/library/ProductionVolumeRendering/index.html">Production Volume Rendering SIGGRAPH 2017 Course</a></li>
</ul>
<p>Joint publication from Pixar, Sony Pictures Imageworks, and Disney Animation. This course covers volume rendering in modern path tracing renderers, from basic theory all the way to practice. The last chapter details the inner workings of Hyperion’s first and second generation transmittance estimation based volume rendering system, used from <em>Big Hero 6</em> up through <em>Moana</em>.</p>
</li>
<li>
<p><strong>Spectral and Decomposition Tracking for Rendering Heterogeneous Volumes</strong></p>
<p><a href="https://www.linkedin.com/in/peterkutz/">Peter Kutz</a>, <a href="https://www.linkedin.com/in/ralf-habel-6a74bb2/">Ralf Habel</a>, <a href="https://www.yiningkarlli.com">Yining Karl Li</a>, and <a href="http://drz.disneyresearch.com/~jnovak/">Jan Novák</a>. Spectral and Decomposition Tracking for Rendering Heterogeneous Volumes. <em>ACM Transactions on Graphics (Proceedings of SIGGRAPH 2017)</em>, 36(4), July 2017.</p>
<ul>
<li><a href="https://drive.google.com/file/d/198A1h93ZE7SuKEidx7FCwspJkAWqYG7e/view?usp=drivesdk">Preprint Version</a></li>
<li><a href="https://doi.org/10.1145/3072959.3073665">Official Publisher’s Version</a></li>
<li><a href="https://www.yiningkarlli.com/projects/specdecomptracking.html">Project Page</a></li>
</ul>
<p>Joint project between Disney Research Studios and Disney Animation. This paper describes two complementary new null-collision tracking techniques: decomposition tracking and spectral tracking. The paper also introduces to computer graphics an extended integral formulation of null-collision algorithms, originally developed in the field of reactor physics. These two techniques are the basis of Hyperion’s modern third generation null-collision tracking based volume rendering system, in use beginning on <em>Olaf’s Frozen Adventure</em> to present.</p>
</li>
<li>
<p><strong>The Ocean and Water Pipeline of Disney’s Moana</strong></p>
<p><a href="https://www.linkedin.com/in/seanpalmer/">Sean Palmer</a>, <a href="https://www.imdb.com/name/nm3376120/">Jonathan Garcia</a>, <a href="https://www.linkedin.com/in/sara-drakeley-37290/">Sara Drakeley</a>, <a href="https://www.linkedin.com/in/patrick-kelly-1424b86/">Patrick Kelly</a>, and <a href="https://www.linkedin.com/in/ralf-habel-6a74bb2/">Ralf Habel</a>. The Ocean and Water Pipeline of Disney’s Moana. In <em>ACM SIGGRAPH 2017 Talks</em>, July 2017.</p>
<ul>
<li><a href="https://drive.google.com/file/d/1q4dum1dBhKTBK6fDqiIX-Bm9JrppAamm/view?usp=drivesdk">Preprint Version</a></li>
<li><a href="https://doi.org/10.1145/3084363.3085067">Official Publisher’s Version</a></li>
</ul>
<p>Internal project from Disney Animation. This short paper describes the water pipeline developed for <em>Moana</em>, including the level set compositing and rendering system that was implemented in Hyperion. This system has since found additional usage on shows since <em>Moana</em>.</p>
</li>
<li>
<p><strong>Recent Advancements in Disney’s Hyperion Renderer</strong></p>
<p><a href="https://www.linkedin.com/in/brent-burley-56972557/">Brent Burley</a>, <a href="https://www.linkedin.com/in/david-adler-5ab7b21/">David Adler</a>, <a href="http://dl.acm.org/author_page.cfm?id=99658729701&coll=DL&dl=ACM&trk=0">Matt Jen-Yuan Chiang</a>, <a href="https://www.linkedin.com/in/ralf-habel-6a74bb2/">Ralf Habel</a>, <a href="https://www.linkedin.com/in/patrick-kelly-1424b86/">Patrick Kelly</a>, <a href="https://www.linkedin.com/in/peterkutz/">Peter Kutz</a>, <a href="https://www.yiningkarlli.com">Yining Karl Li</a>, and <a href="https://www.linkedin.com/in/daniel-teece-2650358/">Daniel Teece</a>. Recent Advancements in Disney’s Hyperion Renderer. In <em>ACM SIGGRAPH 2017 Course Notes: Path Tracing in Production Part 1</em>, August 2017.</p>
<ul>
<li><a href="https://drive.google.com/file/d/1kFpp_7I8vH8LHsf1Si94pqMkHxwinMSU/view?usp=drivesdk">Preprint Version</a> (Updated compared to official version)</li>
<li><a href="https://doi.org/10.1145/3084873.3084904">Official Publisher’s Version</a></li>
<li><a href="https://jo.dreggn.org/path-tracing-in-production/2017/index.html">Path Tracing in Production SIGGRAPH 2017 Course</a></li>
</ul>
<p>Publication from Disney Animation. This paper describes various advancements in Hyperion since <em>Big Hero 6</em> up through <em>Moana</em>, with a particular focus towards replacing multiple scattering approximations with true, brute-force path-traced solutions for both better artist workflows and improved visual quality.</p>
</li>
<li>
<p><strong>Denoising with Kernel Prediction and Asymmetric Loss Functions</strong></p>
<p><a href="https://tvogels.nl/">Thijs Vogels</a>, <a href="https://research.nvidia.com/person/fabrice-rousselle">Fabrice Rousselle</a>, <a href="https://www.inf.ethz.ch/personal/mcbrian/">Brian McWilliams</a>, <a href="https://la.disneyresearch.com/people/gerhard-rothlin/">Gerhard Rothlin</a>, <a href="https://graphics.pixar.com/library/indexAuthorAlex_Harvill.html">Alex Harvill</a>, <a href="https://www.linkedin.com/in/david-adler-5ab7b21/">David Adler</a>, <a href="http://graphics.pixar.com/people/mmeyer/">Mark Meyer</a>, and <a href="http://drz.disneyresearch.com/~jnovak/">Jan Novák</a>. Denoising with Kernel Prediction and Asymmetric Loss Functions. <em>ACM Transactions on Graphics (Proceedings of SIGGRAPH 2018)</em>, 37(4), August 2017.</p>
<ul>
<li><a href="https://drive.google.com/open?id=1qAu5DTDfxPPCFyGGzyoG4ggnz7877BEB">Preprint Version</a></li>
<li><a href="https://doi.org/10.1145/3197517.3201388">Official Publisher’s Version</a></li>
<li><a href="http://drz.disneyresearch.com/~jnovak/publications/KPAL/index.html">Project Page</a></li>
</ul>
<p>Joint project between Disney Research Studios, Pixar, and Disney Animation. This paper describes a variety of improvements and extensions made to the 2017 <em>Kernel-predicting Convolutional Networks for Denoising Monte Carlo Renderings</em> paper; collectively, these improvements comprise the modern Disney-Research-developed second generation deep-learning denoiser in use in production at Pixar, ILM, and Disney Animation. At Disney Animation, used experimentally on <em>Ralph Breaks the Internet</em> and in full production beginning with <em>Frozen 2</em>.</p>
</li>
<li>
<p><strong>Plausible Iris Caustics and Limbal Arc Rendering</strong></p>
<p><a href="http://dl.acm.org/author_page.cfm?id=99658729701&coll=DL&dl=ACM&trk=0">Matt Jen-Yuan Chiang</a> and <a href="https://www.linkedin.com/in/brent-burley-56972557/">Brent Burley</a>. Plausible Iris Caustics and Limbal Arc Rendering. <em>ACM SIGGRAPH 2018 Talks</em>, August 2018.</p>
<ul>
<li><a href="https://drive.google.com/open?id=1Wibzqi9JIb4-DvXUyYKVfrbfrhu1bpQs">Preprint Version</a></li>
<li><a href="https://doi.org/10.1145/3214745.3214751">Official Publisher’s Version</a></li>
</ul>
<p>Internal project from Disney Animation. This paper describes a technique for rendering realistic, physically based eye caustics using manifold next-event estimation combined with a plausible procedural geometric eye model. This realistic eye model is implemented in Hyperion and used on all projects beginning with <em>Encanto</em>.</p>
</li>
<li>
<p><strong>The Design and Evolution of Disney’s Hyperion Renderer</strong></p>
<p><a href="https://www.linkedin.com/in/brent-burley-56972557/">Brent Burley</a>, <a href="https://www.linkedin.com/in/david-adler-5ab7b21/">David Adler</a>, <a href="http://dl.acm.org/author_page.cfm?id=99658729701&coll=DL&dl=ACM&trk=0">Matt Jen-Yuan Chiang</a>, <a href="https://www.linkedin.com/in/hank-driskill-1a7140165/">Hank Driskill</a>, <a href="https://www.linkedin.com/in/ralf-habel-6a74bb2/">Ralf Habel</a>, <a href="https://www.linkedin.com/in/patrick-kelly-1424b86/">Patrick Kelly</a>, <a href="https://www.linkedin.com/in/peterkutz/">Peter Kutz</a>, <a href="https://www.yiningkarlli.com">Yining Karl Li</a>, and <a href="https://www.linkedin.com/in/daniel-teece-2650358/">Daniel Teece</a>. The Design and Evolution of Disney’s Hyperion Renderer. <em>ACM Transactions on Graphics</em>, 37(3), August 2018.</p>
<ul>
<li><a href="https://drive.google.com/open?id=1RbRr_rMJ1CIpcGsGWO4iuZKZ76utgMcd">Preprint Version</a></li>
<li><a href="https://doi.org/10.1145/3182159">Official Publisher’s Version</a></li>
<li><a href="https://www.yiningkarlli.com/projects/hyperiondesign.html">Project Page</a></li>
</ul>
<p>Publication from Disney Animation. This paper is a systems architecture paper for the entirety of Hyperion. The paper describes the history of Disney’s Hyperion Renderer, the internal architecture, various systems such as shading, volumes, many-light sampling, emissive geometry, path simplification, fur rendering, photon-mapped caustics, subsurface scattering, and more. The paper also describes various challenges that had to be overcome for practical production use and artistic controllability. This paper covers everything in Hyperion beginning from <em>Big Hero 6</em> up through <em>Ralph Breaks the Internet</em>.</p>
</li>
<li>
<p><strong>Clouds Data Set</strong></p>
<p><a href="https://www.disneyanimation.com">Walt Disney Animation Studios</a>. Clouds Data Set, August 2018.</p>
<ul>
<li><a href="https://www.disneyanimation.com/resources/clouds/">Official Page</a></li>
<li><a href="https://disney-animation.s3.amazonaws.com/uploads/production/data_set_asset/6/asset/License_Cloud.pdf">License</a></li>
</ul>
<p>Publicly released data set for rendering research, by Disney Animation. This data set was produced by our production artists as part of the development process for Hyperion’s modern third generation null-collision tracking based volume rendering system.</p>
</li>
<li>
<p><strong><em>Moana</em> Island Scene Data Set</strong></p>
<p><a href="https://www.disneyanimation.com">Walt Disney Animation Studios</a>. <em>Moana</em> Island Scene Data Set, August 2018.</p>
<ul>
<li><a href="https://www.disneyanimation.com/resources/moana-island-scene/">Official Page</a></li>
<li><a href="https://disney-animation.s3.amazonaws.com/uploads/production/data_set_asset/4/asset/License_Moana.pdf">License</a></li>
</ul>
<p>Publicly released data set for rendering research, by Disney Animation.
This data set is an actual production scene from <em>Moana</em>, originally rendered using Hyperion and ported to PBRT v3 for the public release. This data set gives a sense of the typical scene complexity and rendering challenges that Hyperion handles every day in production.</p>
</li>
<li>
<p><strong>Denoising Deep Monte Carlo Renderings</strong></p>
<p><a href="https://rgl.epfl.ch/people/dvicini">Delio Vicini</a>, <a href="https://www.linkedin.com/in/david-adler-5ab7b21/">David Adler</a>, <a href="http://drz.disneyresearch.com/~jnovak/">Jan Novák</a>, <a href="https://research.nvidia.com/person/fabrice-rousselle">Fabrice Rousselle</a>, and <a href="https://www.linkedin.com/in/brent-burley-56972557/">Brent Burley</a>. Denoising Deep Monte Carlo Renderings. <em>Computer Graphics Forum</em>, 38(1), February 2019.</p>
<ul>
<li><a href="https://drive.google.com/file/d/1n904HlzXQx_ahiRruyCh9KTQjCLZ9lDM/view?usp=sharing">Preprint Version</a></li>
<li><a href="https://doi.org/10.1111/cgf.13533">Official Publisher’s Version</a></li>
<li><a href="http://drz.disneyresearch.com/~jnovak/publications/DeepZDenoising/index.html">Project Page</a></li>
</ul>
<p>Joint project between Disney Research Studios and Disney Animation. This paper presents a technique for denoising deep (meaning images with multiple depth bins per pixel) renders, for use with deep-compositing workflows. This technique was developed as part of general denoising research from Disney Research Studios and the Hyperion team.</p>
</li>
<li>
<p><strong>The Challenges of Releasing the <em>Moana</em> Island Scene</strong></p>
<p><a href="https://www.linkedin.com/in/rasmus-tamstorf-22835a1/">Rasmus Tamstorf</a> and <a href="https://www.linkedin.com/in/heather-pritchett-8067592/">Heather Pritchett</a>. The Challenges of Releasing the <em>Moana</em> Island Scene. In <em>Proceedings of EGSR 2019, Industry Track</em>, July 2019.</p>
<ul>
<li><a href="https://drive.google.com/open?id=18jLb3XNqXCvi2R7Yyb2E2aCdJ26zBBF7">Preprint Version</a></li>
<li><a href="https://doi.org/10.2312/sr.20191223">Official Publisher’s Version</a></li>
</ul>
<p>Short paper from Disney Animation’s research department, discussing some of the challenges involved in preparing a production Hyperion scene for public release. The Hyperion team provided various support and advice to the larger studio effort to release the <em>Moana</em> Island Scene.</p>
</li>
<li>
<p><strong>Practical Path Guiding in Production</strong></p>
<p><a href="https://tom94.net">Thomas Müller</a>. Practical Path Guiding in Production. In <em>ACM SIGGRAPH 2019 Course Notes: Path Guiding in Production</em>, July 2019.</p>
<ul>
<li><a href="https://drive.google.com/open?id=1Dxa2Wm4j2Hv40KIUK3K_yg_v-acOU9rt">Preprint Version</a></li>
<li><a href="https://doi.org/10.1145/3305366.3328091">Official Publisher’s Version</a></li>
<li><a href="https://jo.dreggn.org/path-tracing-in-production/2019/index.html">Path Guiding in Production SIGGRAPH 2019 Course</a></li>
</ul>
<p>Joint project between Disney Research Studios and Disney Animation. This paper presents a number of improvements and extensions made to <em>Practical Path Guiding</em> developed by in Hyperion by <a href="https://tom94.net">Thomas Müller</a> and the Hyperion team. In use in production on <em>Frozen 2</em>.</p>
</li>
<li>
<p><strong>Machine-Learning Denoising in Feature Film Production</strong></p>
<p><a href="https://henrikdahlberg.github.io">Henrik Dahlberg</a>, <a href="https://www.linkedin.com/in/david-adler-5ab7b21/">David Adler</a>, and <a href="https://www.linkedin.com/in/jeremy-newlin-07a87946/">Jeremy Newlin</a>. Machine-Learning Denoising in Feature Film Production. In <em>ACM SIGGRAPH 2019 Talks</em>, July 2019.</p>
<ul>
<li><a href="https://drive.google.com/open?id=1CdUC9caWNSShHNvIj4kge7BWQczXWr79">Preprint Version</a></li>
<li><a href="https://doi.org/10.1145/3306307.3328150">Official Publisher’s Version</a></li>
</ul>
<p>Joint publication from Pixar, Industrial Light & Magic, and Disney Animation. Describes how the modern Disney-Research-developed second generation deep-learning denoiser was deployed into production at Pixar, ILM, and Disney Animation.</p>
</li>
<li>
<p><strong>Taming the Shadow Terminator</strong></p>
<p><a href="http://dl.acm.org/author_page.cfm?id=99658729701&coll=DL&dl=ACM&trk=0">Matt Jen-Yuan Chiang</a>, <a href="https://www.yiningkarlli.com">Yining Karl Li</a>, and <a href="https://www.linkedin.com/in/brent-burley-56972557/">Brent Burley</a>. Taming the Shadow Terminator. In <em>ACM SIGGRAPH 2019 Talks</em>, August 2019.</p>
<ul>
<li><a href="https://drive.google.com/open?id=1Yb6GUP3pIuNiH9Xgq2P0L99V3JAQ7emM">Preprint Version</a> (Updated compared to official version)</li>
<li><a href="https://doi.org/10.1145/3306307.3328172">Official Publisher’s Version</a></li>
<li><a href="https://www.yiningkarlli.com/projects/shadowterminator.html">Project Page</a></li>
</ul>
<p>Internal project from Disney Animation. This short paper describes a solution to the long-standing “shadow terminator” problem associated with using shading normals. The technique in this paper is implemented in Hyperion and has been in use in production starting on <em>Frozen 2</em> through present.</p>
</li>
<li>
<p><strong>On Histogram-Preserving Blending for Randomized Texture Tiling</strong></p>
<p><a href="https://www.linkedin.com/in/brent-burley-56972557/">Brent Burley</a>. On Histogram-Preserving Blending for Randomized Texture Tiling. <em>Journal of Computer Graphics Techniques</em>, 8(4), November 2019.</p>
<ul>
<li><a href="https://drive.google.com/open?id=1kiMQUCcX_tEyQXWtsAPWZLVZTQt6OL_i">Preprint Version</a></li>
<li><a href="http://www.jcgt.org/published/0008/04/02/">Official Publisher’s Version</a></li>
</ul>
<p>Internal project from Disney Animation. This paper describes some modiciations to the histogram-preserving hex-tiling algorithm of Heitz and Neyret; these modifications make implementing the Heitz and Neyret technique easier and more efficient. This paper describes Hyperion’s implementation of the technique, in use in production starting on <em>Frozen 2</em> through present.</p>
</li>
<li>
<p><strong>The Look and Lighting of “Show Yourself” in “Frozen 2”</strong></p>
<p><a href="https://dl.acm.org/author/Sathe,%20Amol">Amol Sathe</a>, <a href="https://dl.acm.org/author/Summers,%20Lance">Lance Summers</a>, <a href="http://dl.acm.org/author_page.cfm?id=99658729701&coll=DL&dl=ACM&trk=0">Matt Jen-Yuan Chiang</a>, and <a href="https://dl.acm.org/author/Newland,%20James">James Newland</a>. The Look and Lighting of “Show Yourself” in “Frozen 2”. In <em>ACM SIGGRAPH 2020 Talks</em>, August 2020.</p>
<ul>
<li><a href="https://drive.google.com/file/d/1XVyhzCP_RDusyrfrsKlR8hIuq0fs_WJF">Preprint Version</a></li>
<li><a href="https://doi.org/10.1145/3388767.3407388">Official Publisher’s Version</a></li>
</ul>
<p>Internal project from Disney Animation. This paper describes the process that went into achieving the final look and lighting of the “Show Yourself” sequence in <em>Frozen 2</em>, including a new tabulation-based approach implemented in Hyperion to maintain energy conservation in rough dielectric reflection and transmission.</p>
</li>
<li>
<p><strong>Practical Hash-based Owen Scrambling</strong></p>
<p><a href="https://www.linkedin.com/in/brent-burley-56972557/">Brent Burley</a>. Practical Hash-based Owen Scrambling. <em>Journal of Computer Graphics Techniques</em>, 9(4), December 2020.</p>
<ul>
<li><a href="https://drive.google.com/file/d/1-avUab_y8UZaM9UlbX95OcXZyMysKFKH">Preprint Version</a></li>
<li><a href="http://www.jcgt.org/published/0009/04/01/">Official Publisher’s Version</a></li>
</ul>
<p>Internal project from Disney Animation. This paper describes a new version of Owen scrambling for the Sobol sequence that is both simple to implement, efficient to evaluate, and broadly applicable to various problems.</p>
</li>
<li>
<p><strong>Unbiased Emission and Scattering Importance Sampling For Heterogeneous Volumes</strong></p>
<p><a href="http://rendering-memo.blogspot.com/">Wei-Feng Wayne Huang</a>, <a href="https://www.linkedin.com/in/peterkutz/">Peter Kutz</a>, <a href="https://www.yiningkarlli.com">Yining Karl Li</a>, and <a href="http://dl.acm.org/author_page.cfm?id=99658729701&coll=DL&dl=ACM&trk=0">Matt Jen-Yuan Chiang</a>. Unbiased Emission and Scattering Importance Sampling For Heterogeneous Volumes. In <em>ACM SIGGRAPH 2021 Talks</em>, August 2021.</p>
<ul>
<li><a href="https://drive.google.com/file/d/1YTBp11HBC-TbrRCu_Aoq42eiFVdxFaYy">Preprint Version</a></li>
<li><a href="https://doi.org/10.1145/3450623.3464644">Official Publisher’s Version</a></li>
<li><a href="https://www.yiningkarlli.com/projects/emissionscattervolumes.html">Project Page</a></li>
</ul>
<p>Internal project from Disney Animation. This paper describes a pair of new unbiased distance-sampling methods for production volume path tracing, with a specific focus on sampling emission and scattering. First used on <em>Raya and the Last Dragon</em>.</p>
</li>
<li>
<p><strong>The Atmosphere of Raya and the Last Dragon</strong></p>
<p><a href="https://dl.acm.org/author/Bryant,%20Marc">Marc Bryant</a>, <a href="https://dl.acm.org/author/DeYoung,%20Ryan">Ryan DeYoung</a>, <a href="http://rendering-memo.blogspot.com/">Wei-Feng Wayne Huang</a>, <a href="https://dl.acm.org/author/Longson,%20Joe">Joe Longson</a>, and <a href="https://dl.acm.org/author/Villegas,%20Noel">Noel Villegas</a>. The Atmosphere of Raya and the Last Dragon. In <em>ACM SIGGRAPH 2021 Talks</em>, August 2021.</p>
<ul>
<li><a href="https://drive.google.com/file/d/1ucK1j2mgJpoFf3hvt3-QyAiZyppkqf6p">Preprint Version</a></li>
<li><a href="https://doi.org/10.1145/3450623.3464676">Official Publisher’s Version</a></li>
</ul>
<p>Internal project from Disney Animation. This paper describes the various rendering and workflow improvements that went into rendering atmospheric volumes to produce the highly atmospheric lighting in <em>Raya and the Last Dragon</em>.</p>
</li>
<li>
<p><strong>Practical Multiple-Scattering Sheen Using Linearly Transformed Cosines</strong></p>
<p><a href="https://tizianzeltner.com">Tizian Zeltner</a>, <a href="https://www.linkedin.com/in/brent-burley-56972557/">Brent Burley</a>, and <a href="http://dl.acm.org/author_page.cfm?id=99658729701&coll=DL&dl=ACM&trk=0">Matt Jen-Yuan Chiang</a>. Practical Multiple-Scattering Sheen Using Linearly Transformed Cosines. In <em>ACM SIGGRAPH 2022 Talks</em>, August 2022.</p>
<ul>
<li><a href="https://drive.google.com/file/d/13LDVa5pYckJMRnHE9ZxIbdSfRriHlPW9/view?usp=sharing">Preprint Version</a></li>
<li><a href="https://doi.org/10.1145/3532836.3536240">Official Publisher’s Version</a></li>
<li><a href="https://tizianzeltner.com/projects/Zeltner2022Practical/">Project Page</a></li>
</ul>
<p>Joint project between École Polytechnique Fédérale de Lausanne (EPFL) and Disney Animation. This paper descibes the new multiple-scattering sheen model used in the Disney Principled BSDF starting with the production of <em>Strange World</em>.</p>
</li>
<li>
<p><strong>“Encanto” - Let’s Talk About Bruno’s Visions</strong></p>
<p><a href="https://www.linkedin.com/in/corey-butler-96aa492/">Corey Butler</a>, <a href="https://www.linkedin.com/in/brent-burley-56972557/">Brent Burley</a>, <a href="http://rendering-memo.blogspot.com/">Wei-Feng Wayne Huang</a>, <a href="https://www.yiningkarlli.com">Yining Karl Li</a>, and <a href="https://www.linkedin.com/in/benjamin-min-huang-94b3011/">Benjamin Huang</a>. “Encanto” - Let’s Talk About Bruno’s Visions. In <em>ACM SIGGRAPH 2022 Talks</em>, August 2022.</p>
<ul>
<li><a href="https://drive.google.com/file/d/1IZOeJrZYciqWaIfQLJr7WOt9AAxH6jzi/view?usp=sharing">Preprint Version</a></li>
<li><a href="https://doi.org/10.1145/3532836.3536269">Official Publisher’s Version</a></li>
<li><a href="https://www.yiningkarlli.com/projects/teleportshader.html">Project Page</a></li>
</ul>
<p>Internal project from Disney Animation. This paper describes the process of creating the holographic prophecy shards from <em>Encanto</em>, including a new teleportation shader in Hyperion that was developed specifically to support this effect.</p>
</li>
<li>
<p><strong>Fracture-Aware Tessellation of Subdivision Surfaces</strong></p>
<p><a href="https://www.linkedin.com/in/brent-burley-56972557/">Brent Burley</a> and <a href="https://www.linkedin.com/in/fjrodriguez/">Francisco Rodriguez</a>. Fracture-Aware Tessellation of Subdivision Surfaces. In <em>ACM SIGGRAPH 2022 Talks</em>, August 2022.</p>
<ul>
<li><a href="https://drive.google.com/file/d/1MS8XehTmdHNHPwHm19t776QjB7owoO12/view?usp=sharing">Preprint Version</a></li>
<li><a href="https://doi.org/10.1145/3532836.3536262">Official Publisher’s Version</a></li>
</ul>
<p>Internal project from Disney Animation. This paper describes a new tessellation algorithm for fractured subdivision surfaces, used as part of Disney Animation’s destruction FX pipeline and implemented in Hypeprion. First used in production on <em>Encanto</em>.</p>
</li>
<li>
<p><strong>Progressive Null-Tracking for Volumetric Rendering</strong></p>
<p><a href="https://www.linkedin.com/in/zackary-misso/">Zackary Misso</a>, <a href="https://www.yiningkarlli.com">Yining Karl Li</a>, <a href="https://www.linkedin.com/in/brent-burley-56972557/">Brent Burley</a>, <a href="https://www.linkedin.com/in/daniel-teece-2650358/">Daniel Teece</a>, and <a href="https://cs.dartmouth.edu/~wjarosz/index.html">Wojciech Jarosz</a>. Progressive Null Tracking for Volumetric Rendering. <em>SIGGRAPH ‘23: ACM SIGGRAPH 2023 Conference Proceedings</em>. Article 31, August 2023.</p>
<ul>
<li><a href="https://drive.google.com/file/d/11YsHMnJvUhINBpTabGFi48-j69A47Iw_/view?usp=sharing">Preprint Version</a></li>
<li><a href="http://doi.org/10.1145/3588432.3591557">Official Publisher’s Version</a></li>
<li><a href="https://cs.dartmouth.edu/~wjarosz/publications/misso23progressive.html">Project Page</a></li>
</ul>
<p>Joint project between Dartmouth College and Disney Animation. This paper describes a new method to progressively learn bounding majorants when using null-tracking techniques to perform unbiased rendering of general heterogeneous volumes with unknown bounding majorants.</p>
</li>
<li>
<p><strong>Splat: Developing a ‘Strange’ Shader</strong></p>
<p><a href="https://www.linkedin.com/in/klitaker/">Kendall Litaker</a>, <a href="https://www.linkedin.com/in/brent-burley-56972557/">Brent Burley</a>, <a href="https://www.linkedin.com/in/dan-lipson-2ab84916b/">Dan Lipson</a>, and <a href="https://www.linkedin.com/in/mason-khoo-3b490562/">Mason Khoo</a>. Splat: Developing a ‘Strange’ Shader. In <em>ACM SIGGRAPH 2023 Talks</em>, August 2023.</p>
<ul>
<li><a href="https://drive.google.com/file/d/1FY7H-7JmBVL5ZsINGXMP0ourN-fHLBTT/view?usp=share_link">Preprint Version</a></li>
<li><a href="https://doi.org/10.1145/3587421.3595424">Official Publisher’s Version</a></li>
</ul>
<p>Internal project from Disney Animation. This paper describes the unusual challenges encountered when developing the unique shading and look for the Splat character from <em>Strange World</em>.</p>
</li>
</ol>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/Jul/hyperion_logo.png"><img src="https://blog.yiningkarlli.com/content/images/2019/Jul/hyperion_logo.png" alt="Figure 2: Hyperion logo, modeled by Disney Animation artist Chuck Tappan and rendered in Disney's Hyperion Renderer." /></a></p>
<p>Again, this post is meant to be a living document; any new publications with involvement from the Hyperion team will be added here.
Of course, the Hyperion team is not the only team at Disney Animation that regularly publishes; for a full list of publications from Disney Animation, see the <a href="https://www.disneyanimation.com/technology/publications">official Disney Animation publications page</a>.
The <a href="https://www.technology.disneyanimation.com">Disney Animation Technology website</a> is also worth keeping an eye on if you want to keep up on what our engineers and TDs are working on!</p>
<p>If you’re just getting started and want to learn more about rendering in general, the must-read text that every rendering engineer has on their desk or bookshelf is <a href="http://www.pbr-book.org">Physically Based Rendering 3rd Edition</a> by Matt Pharr, Wenzel Jakob, and Greg Humphreys (now available online completely for free!).
Also, the de-facto standard beginner’s text today is the <a href="https://www.amazon.com/gp/product/B01B5AODD8">Ray Tracing in One Weekend</a> series by Peter Shirley, which provides a great, gentle, practical introduction to ray tracing, and is also available completely for free.
Also take a look at <a href="http://www.realtimerendering.com/book.html">Real-Time Rendering 4th Edition</a>, <a href="http://www.realtimerendering.com/raytracinggems/">Ray Tracing Gems</a> (also available online for free), <a href="http://graphicscodex.com">The Graphics Codex</a> by Morgan McGuire, and Eric Haines’s <a href="http://www.realtimerendering.com/raytracing.html">Ray Tracing Resources page</a>.</p>
<p>Many other amazing rendering teams at both large studios and commercial vendors also publish regularly, and I highly recommend keeping up with all of their work too!
For a good starting point into exploring the wider world of production rendering, check out the <a href="https://dl.acm.org/citation.cfm?id=3243123">ACM Transactions on Graphics Special Issue on Production Rendering</a>, which is edited by Matt Pharr and contains extensive, detailed systems papers on <a href="https://dl.acm.org/citation.cfm?id=3182162">Pixar’s RenderMan</a>, <a href="https://dl.acm.org/citation.cfm?id=3182161">Weta Digital’s Manuka</a>, <a href="https://dl.acm.org/citation.cfm?id=3182160">Solid Angle (Autodesk)’s Arnold</a>, <a href="https://dl.acm.org/citation.cfm?id=3180495">Sony Picture Imageworks’ Arnold</a>, and of course <a href="https://dl.acm.org/citation.cfm?id=3182159">Disney Animation’s Hyperion</a>.
A sixth paper that I would group with five above is the High Performance Graphics 2017 paper detailing the architecture of <a href="http://doi.org/10.1145/3105762.3105768">DreamWorks Animation’s MoonRay</a>.</p>
<p>For even further exploration, extensive course notes are available from SIGGRAPH courses every year. Particularly good recurring courses to look at from past years are the Path Tracing in Production course (<a href="https://jo.dreggn.org/path-tracing-in-production/2017/index.html">2017</a>, <a href="https://jo.dreggn.org/path-tracing-in-production/2018/index.html">2018</a>, <a href="https://jo.dreggn.org/path-tracing-in-production/2019/index.html">2019</a>), the absolutely legendary Physically Based Shading course (<a href="http://renderwonk.com/publications/s2010-shading-course/">2010</a>, <a href="https://blog.selfshadow.com/publications/s2012-shading-course">2012</a>, <a href="https://blog.selfshadow.com/publications/s2013-shading-course">2013</a>, <a href="https://blog.selfshadow.com/publications/s2014-shading-course">2014</a>, <a href="https://blog.selfshadow.com/publications/s2015-shading-course">2015</a>, <a href="https://blog.selfshadow.com/publications/s2016-shading-course">2016</a>, <a href="https://blog.selfshadow.com/publications/s2017-shading-course/">2017</a>), the various incarnations of a volume rendering course (<a href="https://magnuswrenninge.com/productionvolumerendering">2011</a>, <a href="https://graphics.pixar.com/library/ProductionVolumeRendering/">2017</a>, <a href="https://cs.dartmouth.edu/~wjarosz/publications/novak18monte-sig.html">2018</a>), and now due to the dawn of ray tracing in games, <a href="http://advances.realtimerendering.com">Advances in Real-Time Rendering</a> and <a href="https://openproblems.realtimerendering.com">Open Problems in Real-Time Rendering</a>.
Also, Stephen Hill typically collects links to all publicly available course notes, slides, source code, and more for SIGGRAPH each year after the conference on <a href="https://blog.selfshadow.com">his blog</a>; both his blog and the blogs listed on the sidebar of his website are essentially mandatory reading in the rendering world.
Also, interesting rendering papers are always being published in journals and at conferences.
The major journals to check are <a href="https://tog.acm.org">ACM Transactions on Graphics (TOG)</a>, <a href="https://www.eg.org/wp/eurographics-publications/cgf/">Computer Graphics Forum (CGF)</a>, and the <a href="http://www.jcgt.org">Journal of Computer Graphics Techniques (JCGT)</a>; the major academic conferences where rendering stuff appears are SIGGRAPH, SIGGRAPH Asia, EGSR (Eurographics Symposium on Rendering), HPG (High Performance Graphics), MAM (Workshop on Material Appearance Modeling), EUROGRAPHICS, and i3D (ACM SIGGRAPH Symposium on Interactive 3D Graphics and Games); another three industry conferences where interesting stuff often appears are DigiPro, GDC (Game Developers Conference) and GTC (Graphics Technology Conference).
A complete listing of the contents for all of these conferences every year, along with links to preprints, is <a href="http://kesen.realtimerendering.com">compiled by Ke-Sen Huang</a>.</p>
<p>A large number of people have contributed directly to Hyperion’s development since the beginning of the project, in a variety of capacities ranging from core developers to TDs and support staff and all the way to notable interns. In no particular order, including both present and past: Daniel Teece, Brent Burley, David Adler, Yining Karl Li, Mark Lee, Charlotte Zhu, Brian Green, Andrew Bauer, Lea Reichardt, Mackenzie Thompson, Wei-Feng Wayne Huang, Matt Jen-Yuan Chen, Joe Schutte, Andrew Gartner, Jennifer Yu, Peter Kutz, Ralf Habel, Patrick Kelly, Gregory Nichols, Andrew Selle, Christian Eisenacher, Jan Novák, Ben Spencer, Doug Lesan, Lisa Young, Tami Valdez, Andrew Fisher, Noah Kagan, Benedikt Bitterli, Thomas Müller, Tizian Zeltner, Zackary Misso, Magdalena Martinek, Mathijs Molenaar, Laura Lediav, Guillaume Loubet, David Koerner, Simon Kallweit, Gabor Liktor, Ulrich Muller, Norman Moses Joseph, Stella Cheng, Marc Cooper, Tal Lancaster, and Serge Sretschinsky.
Our closest research partners at Disney Research Studios, Pixar Animation Studios, Industrial Light & Magic, and elsewhere include (in no particular order): Marios Papas, Marco Manzi, Tiziano Portenier, Rasmus Tamstorf, Gerhard Roethlin, Per Christensen, Julian Fong, Mark Meyer, André Mazzone, Wojciech Jarosz, Fabrice Rouselle, Christophe Hery, Ryusuke Villemin, and Magnus Wrenninge.
Invaluable support from studio leadership over the years has been provided by (again, in no particular order): Nick Cannon, Munira Tayabji, Bettina Martin, Laura Franek, Collin Larkins, Golriz Fanai, Rajesh Sharma, Chuck Tappan, Sean Jenkins, Darren Robinson, Alex Nijmeh, Hank Driskill, Kyle Odermatt, Adolph Lusinsky, Ernie Petti, Kelsey Hurley, Tad Miller, Mark Hammel, Mohit Kallianpur, Brian Leach, Josh Staub, Steve Goldberg, Scott Kersavage, Andy Hendrickson, Dan Candela, Ed Catmull, and many others.
Of course, beyond this enormous list, there is an even more enormous list of countless artists, technical directors, production supervisors, and other technology development teams at Disney Animation who motivated Hyperion, participated in its development, and contributed to its success.
If anything in this post has caught your interest, keep an eye out for open position listings on <a href="https://www.disneyanimation.com/careers">DisneyAnimation.com</a>; maybe these lists can one day include you!</p>
<p>Finally, here is a list of all publicly released and announced projects to date made using Disney’s Hyperion Renderer:</p>
<p><strong>Feature Films</strong>: <a href="https://www.disneyplus.com/movies/big-hero-6/4AozFbXy3sPw">Big Hero 6</a> (2014), <a href="https://www.disneyplus.com/movies/zootopia/1QOxldhm1sKg">Zootopia</a> (2016), <a href="https://www.disneyplus.com/movies/moana/70GoJHflgHH9">Moana</a> (2016), <a href="https://www.disneyplus.com/movies/ralph-breaks-the-internet/33T1xWWWLhFR">Ralph Breaks the Internet</a> (2018), <a href="https://www.disneyplus.com/movies/frozen-2/28vdy71kJrjb">Frozen 2</a> (2019), <a href="https://www.disneyplus.com/movies/raya-and-the-last-dragon/6dyengbx3iYK">Raya and the Last Dragon</a> (2021), <a href="https://www.disneyplus.com/movies/encanto/33q7DY1rtHQH">Encanto</a> (2021), <a href="https://www.disneyplus.com/movies/strange-world/1OVzv6hnhOFm">Strange World</a> (2022), <a href="https://movies.disney.com/wish">Wish</a> (2023)</p>
<p><strong>Shorts and Featurettes</strong>: <a href="https://www.disneyplus.com/movies/feast/3LXsUWltFatX">Feast</a> (2014), <a href="https://www.disneyplus.com/movies/frozen-fever/5xsCGQz3eJRq">Frozen Fever</a> (2015), <a href="https://www.disneyplus.com/movies/inner-workings/2am4tRzFOOXl">Inner Workings</a> (2016), <a href="https://www.imdb.com/title/tt6467284/">Gone Fishing</a> (2017), <a href="https://www.disneyplus.com/movies/olafs-frozen-adventure/5zrFDkAANpLi">Olaf’s Frozen Adventure</a> (2017), <a href="https://www.disneyplus.com/movies/myth-a-frozen-tale/1N00Fn9eajzi">Myth: A Frozen Tale</a><sup>1</sup> (2019), <a href="https://www.disneyplus.com/movies/once-upon-a-snowman/2tBSdZv6bB4L">Once Upon a Snowman</a> (2020), <a href="https://www.disneyplus.com/movies/us-again/3KPeVueXrxck">Us Again</a> (2021), <a href="https://www.disneyplus.com/movies/far-from-the-tree/4LKsV18kWS9G">Far From the Tree</a> (2021), <a href="https://www.disneyplus.com/movies/once-upon-a-studio/2lskBMjkAn3w">Once Upon A Studio</a> (2023)</p>
<p><strong>Animated Series</strong>: <a href="https://www.youtube.com/playlist?list=PLxnVeUnlga-Eg3hSTyV2GXjiJYdjQl2nt">At Home With Olaf</a> (2020), <a href="https://www.disneyplus.com/series/olaf-presents/6nKDva3ZVCvC">Olaf Presents</a> (2021), <a href="https://www.disneyplus.com/series/baymax/1D141qnxDHLI">Baymax!</a> (2022), <a href="https://www.disneyplus.com/series/zootopia/2CB7CKG729Ou">Zootopia+</a> (2022)</p>
<p><strong>Short Circuit Shorts</strong>: <a href="https://www.disneyplus.com/series/walt-disney-animation-studios-short-circuit-experimental-films/3S2DLVtMPA7V">Exchange Student</a> (2020), <a href="https://www.disneyplus.com/series/walt-disney-animation-studios-short-circuit-experimental-films/3S2DLVtMPA7V">Just a Thought</a> (2020), <a href="https://www.disneyplus.com/series/walt-disney-animation-studios-short-circuit-experimental-films/3S2DLVtMPA7V">Jing Hua</a> (2020), <a href="https://www.disneyplus.com/series/walt-disney-animation-studios-short-circuit-experimental-films/3S2DLVtMPA7V">Elephant in the Room</a> (2020), <a href="https://www.disneyplus.com/series/walt-disney-animation-studios-short-circuit-experimental-films/3S2DLVtMPA7V">Puddles</a> (2020), <a href="https://www.disneyplus.com/series/walt-disney-animation-studios-short-circuit-experimental-films/3S2DLVtMPA7V">Lightning in a Bottle</a> (2020), <a href="https://www.disneyplus.com/series/walt-disney-animation-studios-short-circuit-experimental-films/3S2DLVtMPA7V">Zenith</a> (2020), <a href="https://www.disneyplus.com/series/walt-disney-animation-studios-short-circuit-experimental-films/3S2DLVtMPA7V">Drop</a> (2020), <a href="https://www.disneyplus.com/series/walt-disney-animation-studios-short-circuit-experimental-films/3S2DLVtMPA7V">Fetch</a> (2020), <a href="https://www.disneyplus.com/series/walt-disney-animation-studios-short-circuit-experimental-films/3S2DLVtMPA7V">Downtown</a> (2020), <a href="https://www.disneyplus.com/series/walt-disney-animation-studios-short-circuit-experimental-films/3S2DLVtMPA7V">Hair-Jitsu</a> (2020), <a href="https://www.disneyplus.com/series/walt-disney-animation-studios-short-circuit-experimental-films/3S2DLVtMPA7V">The Race</a> (2020), <a href="https://www.disneyplus.com/series/walt-disney-animation-studios-short-circuit-experimental-films/3S2DLVtMPA7V">Lucky Toupée</a> (2020), <a href="https://www.disneyplus.com/series/walt-disney-animation-studios-short-circuit-experimental-films/3S2DLVtMPA7V">Cycles</a><sup>2</sup> (2020), <a href="https://twitter.com/disneyanimation/status/1149743115130920960?lang=en">A Kite’s Tale</a><sup>2</sup> (2020), <a href="https://www.disneyplus.com/series/walt-disney-animation-studios-short-circuit-experimental-films/3S2DLVtMPA7V">Going Home</a> (2021), <a href="https://www.disneyplus.com/series/walt-disney-animation-studios-short-circuit-experimental-films/3S2DLVtMPA7V">Crosswalk</a> (2021), <a href="https://www.disneyplus.com/series/walt-disney-animation-studios-short-circuit-experimental-films/3S2DLVtMPA7V">Songs to Sing in the Dark</a> (2021), <a href="https://www.disneyplus.com/series/walt-disney-animation-studios-short-circuit-experimental-films/3S2DLVtMPA7V">No. 2 to Kettering</a> (2021)</p>
<p><strong>Intern Shorts</strong>: <a href="https://ohmy.disney.com/insider/2017/10/19/you-must-watch-this-beautiful-short-created-by-walt-disney-animation-interns/">Ventana</a> (2017), <a href="https://ohmy.disney.com/news/2018/12/05/voila-walt-disney-animation-studios-interns/">Voilà</a> (2018), <a href="https://ohmy.disney.com/movies/2019/09/19/watch-maestro-a-beautiful-short-from-this-years-walt-disney-animation-studios-interns/">Maestro</a> (2019), <a href="https://twitter.com/DisneyAnimJobs/status/1448007879257067520">June Bug</a> (2021)</p>
<p><strong>Filmmaker Co-op Shorts</strong>: <a href="https://www.imdb.com/title/tt7592274/">Weeds</a> (2017)</p>
<p><sup>1</sup> VR project running on Unreal Engine, with shading and textures baked out of Disney’s Hyperion Renderer.</p>
<p><sup>2</sup> VR project running on Unity, with shading and textures baked out of Disney’s Hyperion Renderer.</p>
https://blog.yiningkarlli.com/2019/05/nested-dielectrics.html
Nested Dielectrics
2019-05-21T00:00:00+00:00
2019-05-21T00:00:00+00:00
Yining Karl Li
<p>A few years ago, I wrote <a href="https://blog.yiningkarlli.com/2015/06/attenuated-transmission.html">a post about attenuated transmission</a> and what I called “deep attenuation” at the time- refraction and transmission through multiple mediums embedded inside of each other, a.k.a. what is usually called nested dielectrics.
What I called “deep attenuation” in that post is, in its essence, just pure interface tracking using a stack.
This post is meant as a revisit and update of that post; I’ll talk about the problems with the ad-hoc pure interface tracking technique I came up with in that previous post and discuss the proper priority-based nested dielectric technique <a href="https://www.tandfonline.com/doi/abs/10.1080/10867651.2002.10487555">[Schmidt and Budge 2002]</a> that Takua uses today.</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/May/nested_ice.0.png"><img src="https://blog.yiningkarlli.com/content/images/2019/May/preview/nested_ice.0.jpg" alt="Figure 1: Ice cubes floating in tea inside of a glass teacup, rendered in Takua Renderer using priority-based nested dielectrics." /></a></p>
<p>In my 2015 post, I included a diagram showing the overlapping boundaries required to model ice cubes in a drink in a glass, but I didn’t actually include a render of that scenario!
In retrospect, the problems with the 2015 post would have become obvious to me more quickly if I had actually done a render like that diagram.
Figure 1 shows an actual “ice cubes in a drink in a glass” scene, rendered correctly using Takua Renderer’s implementation of priority-based nested dielectrics.
For comparison, Figure 2 shows what Takua produces using the approach in the 2015 post; there are a number of obvious bizarre problems!
In Figure 2, the ice cubes don’t properly refract the tea behind and underneath them, and the ice cubes under the liquid surface aren’t visible at all.
Also, where the surface of the tea interfaces with the glass teacup, there is a odd bright ring.
Conversely, Figure 1 shows a correct liquid-glass interface without a bright ring, shows proper refraction through the ice cubes, and correctly shows the ice cubes under the liquid surface.</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/May/nested_ice_old.0.png"><img src="https://blog.yiningkarlli.com/content/images/2019/May/preview/nested_ice_old.0.jpg" alt="Figure 2: The same scene as in Figure 1, rendered using Takua's old interface tracking system. A number of bizarre physically inaccurate problems are present." /></a></p>
<p><strong>Problems with only Interface Tracking</strong></p>
<p>So what exactly is wrong with using only interface tracking without priorities?
First, let’s quickly summarize how my old interface tracking implementation worked.
Note that here we refer to the side of a surface a ray is currently on as the <em>incident</em> side, and the other side of the surface as the <em>transmit</em> side.
For each path, keep a stack of which Bsdfs the path has encountered:</p>
<ul>
<li>When a ray enters a surface, push the encountered surface onto the stack.</li>
<li>When a ray exits a surface, scan the stack from the top down and pop the first instance of a surface in the stack matching the encountered surface.</li>
<li>When hitting the front side of a surface, the incident properties comes from the top of the stack (or is the empty default if the stack is empty), and the transmit properties comes from surface being intersected.</li>
<li>When hitting the back side of a surface, the incident properties comes from the surface being intersected, and the transmit properties comes from the top of the stack (or is the empty default if the stack is empty).</li>
<li>Only push/pop onto the stack when a refraction/transmission event occurs.</li>
</ul>
<p>Next, as an example, imagine a case where which surface a ray currently in is ambiguous.
A common example of this case is when two surfaces are modeled as being slightly overlapping, as is often done when modeling liquid inside of a glass since modeling perfectly coincident surfaces in CG is either extremely difficult or impossible due to floating point precision problems.
Even if we could model perfectly coincident surfaces, rendering perfectly coincident surfaces without artifacts is similarly extremely difficult or impossible, also due to floating point precision problems.
Figure 3 shows a diagram of how a glass containing water and ice cubes is commonly modeled; in Figure 3, the ambiguous regions are where the water surface is inside of the glass and inside of the ice cube.
When a ray enters this overlapping region, it is not clear whether we should treat the ray as being inside the water or inside if the glass (or ice)!</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/May/nested_diagram_old.png"><img src="https://blog.yiningkarlli.com/content/images/2019/May/nested_diagram_old.png" alt="Figure 3: A diagram of a path through a glass containing water and ice cubes, using only interface tracking without priorities." /></a></p>
<p>Using the pure interface tracking algorithm from my old blog post, below is what happens at each path vertex along the path illustrated in Figure 3.
In this example, we define the empty default to be air.</p>
<ol>
<li>Enter Glass.
<ul>
<li>Incident/transmit IOR: Air/Glass.</li>
<li>Push Glass onto stack. Stack after event: (Glass).</li>
</ul>
</li>
<li>Enter Water.
<ul>
<li>Incident/transmit IOR: Glass/Water.</li>
<li>Push Water onto stack. Stack after event: (Water, Glass).</li>
</ul>
</li>
<li>Exit Glass.
<ul>
<li>Incident/transmit IOR: Glass/Water.</li>
<li>Remove Glass from stack. Stack: (Water).</li>
</ul>
</li>
<li>Enter Ice.
<ul>
<li>Incident/transmit IOR: Water/Ice.</li>
<li>Push Ice onto stack. Stack: (Ice, Water).</li>
</ul>
</li>
<li>Exit Water.
<ul>
<li>Incident/transmit IOR: Water/Ice.</li>
<li>Remove Water from stack. Stack: (Ice).</li>
</ul>
</li>
<li>Exit Ice.
<ul>
<li>Incident/transmit IOR: Ice/Air.</li>
<li>Remove Ice from stack. Stack: empty.</li>
</ul>
</li>
<li>Enter Water.
<ul>
<li>Incident/transmit IOR: Air/Water.</li>
<li>Push Water onto stack. Stack after event: (Water).</li>
</ul>
</li>
<li>Enter Glass.
<ul>
<li>Incident/transmit IOR: Water/Glass.</li>
<li>Push Glass onto stack. Stack after event: (Glass, Water).</li>
</ul>
</li>
<li>Reflect off Water.
<ul>
<li>Incident/transmit IOR: Water/Glass.</li>
<li>No change to stack. Stack after event: (Glass, Water).</li>
</ul>
</li>
<li>Reflect off Glass.
<ul>
<li>Incident/transmit IOR: Glass/Glass.</li>
<li>No change to stack. Stack after event: (Glass, Water).</li>
</ul>
</li>
<li>Exit Water.
<ul>
<li>Incident/transmit IOR: Water/Glass.</li>
<li>Remove Water from stack. Stack after event: (Glass).</li>
</ul>
</li>
<li>Exit Glass.
<ul>
<li>Incident/transmit IOR: Glass/Air.</li>
<li>Remove Glass from stack. Stack after event: empty.</li>
</ul>
</li>
</ol>
<p>Observe events 3 and 5, where the same index of refraction boundary is encountered as in the previous event.
These double events are where some of the weirdness in Figure 2 comes from; specifically the bright ring at the liquid-glass surface interface and the incorrect refraction through the ice cube.
These double events are not actually physically meaningful; in reality, a ray could never be both inside of a glass surface and inside of a water surface simultaneously.
Figure 4 shows a simplified version of the tea cup example above, without ice cubes; even then, the double event still causes a bright ring at the liquid-glass surface interface.
Also note how when following the rules from my old blog post, event 10 becomes a nonsense event where the incident and transmit IOR are the same.
The fix for this case is to modify the rules so that when a ray exits a surface, the transmit properties come from the first surface on the stack that isn’t the same as the incident surface, but even with this fix, the reflection at event 10 is still physically impossible.</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/May/nested_old.0.png"><img src="https://blog.yiningkarlli.com/content/images/2019/May/preview/nested_old.0.jpg" alt="Figure 4: Tea inside of a glass cup, rendered using Takua Renderer's old interface tracking system. Note the bright ring at the liquid-glass surface interface, produced by a physically incorrect double-refraction event." /></a></p>
<p>Really what we want is to model overlapping surfaces, but then in overlapping areas, be able to specify which surface a ray should think it is actually inside of.
Essentially, this functionality would make overlapping surfaces behave like boolean operators; we would be able to specify that the ice cubes in Figure 3 “cut out” a space from the water they overlap with, and the glass cut out a space from the water as well.
This way, the double events never occur since rays wouldn’t see the second event in each pair of double events.
One solution that immediately comes to mind is to simply consider whatever surface is at the top of the interface tracking stack as being the surface we are currently inside, but this causes an even worse problem: the order of surfaces that a ray thinks it is in becomes dependent on what surfaces a ray encounters first, which depends on the direction and location of each ray!
This produces an inconsistent view of the world across different rays.
Instead, a better solution is provided by priority-based nested dielectrics <a href="https://www.tandfonline.com/doi/abs/10.1080/10867651.2002.10487555">[Schmidt and Budge 2002]</a>.</p>
<p><strong>Priority-Based Nested Dielectrics</strong></p>
<p>Priority-based nested dielectrics work by assigning priority values to geometry, with the priority values determining which piece of geometry “wins” when a ray is in a region of space where multiple pieces of geometry overlap.
A priority value is just a single number assigned as an attribute to a piece of geometry or to a shader; the convention established by the paper is that lower numbers indicate higher priority.
The basic algorithm in <a href="https://www.tandfonline.com/doi/abs/10.1080/10867651.2002.10487555">[Schmidt and Budge 2002]</a> works using an <em>interior list</em>, which is conceptually similar to an interface tracking stack.
The interior list is exactly what it sounds like: a list of all of the surfaces that a path has entered but not exited yet.
Unlike the interface tracking stack though, the interior list doesn’t necessarily have to be a stack or have any particular ordering, although implementing it as a list always sorted by priority may provide some minor practical advantages.
When a ray hits a surface during traversal, the following rules apply:</p>
<ul>
<li>If the surface has a higher or equal priority (so lower or equal priority number) than anything else on the interior list, the result is a <em>true hit</em> and a intersection has occured. Proceed with regular shading and Bsdf evaluation.</li>
<li>If the surface has a lower priority (so higher priority number) than the highest-priority value on the interior list, the result is a <em>false hit</em> and no intersection has occured. Ignore the intersection and continue with ray traversal.</li>
<li>If the hit is a false hit OR if the hit both is a true hit and results in a refraction/transmission event:
<ul>
<li>Add the surface to the interior list if the ray is entering the surface.</li>
<li>Remove the surface from the interior list if the ray is exiting the surface.</li>
</ul>
</li>
<li>For a true hit the produces a reflection event, don’t add the surface to the interior list.</li>
</ul>
<p>Note that this approach only works with surfaces that are enclosed manifolds; that is, every surface defines a finite volume.
When a ray exits a surface, the surface it is exiting must already be in the interior list; if not, then the interior list can become corrupted and the renderer may start thinking that paths are in surfaces that they are not actually in (or vice verse).
Also note that a ray can only ever enter into a higher-priority surface through finding a true hit, and can only enter into a lower-priority surface by exiting a higher-priority surface and removing the higher-priority surface from the interior list.
At each true hit, we can figure out the properties of the incident and transmit sides by examining the interior list.
If hitting the front side of a surface, before we update the interior list, the surface we just hit provides the transmit properties and the highest-priority surface on the interior list provides the incident properties.
If hitting the back side of a surface, before we update the interior list, the surface we just hit provides the incident properties and the second-highest-priority surface on the interior list provides the transmit properties.
Alternatively, if the interior list only contains one surface, then the transmit properties come from the empty default.
Importantly, if a ray hits a surface with no priority value set, that surface should always count as a true hit.
This way, we can embed non-transmissive objects inside of transmissive objects and have everything work automatically.</p>
<p>Figure 5 shows the same scenario as in Figure 3, but now with priority values assigned to each piece of geometry.
The path depicted in Figure 5 uses the priority-based interior list; dotted lines indicate parts of a surface that produce false hits due to being embedded within a higher-priority surface:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/May/nested_diagram_new.png"><img src="https://blog.yiningkarlli.com/content/images/2019/May/nested_diagram_new.png" alt="Figure 5: The same setup as in Figure 3, but now using priority values. The path is calculated using a priority-based interior list." /></a></p>
<p>The empty default air surrounding everything is defined as having an infinitely high priority value, which means a lower priority than any surface in the scene.
Using the priority-based interior list, here are the events that occur at each intersection along the path in Figure 5:</p>
<ol>
<li>Enter Glass.
<ul>
<li>Glass priority (1) is higher than ambient air (infinite), so TRUE hit.</li>
<li>Incident/transmit IOR: Air/Glass.</li>
<li>True hit, so evaluate Bsdf and produce refraction event.</li>
<li>Interior list after event: (Glass:1). Inside surface after event: Glass.</li>
</ul>
</li>
<li>Enter Water.
<ul>
<li>Water priority (2) is lower than highest priority in interior list (1), so FALSE hit.</li>
<li>Incident/transmit IOR: N/A.</li>
<li>False hit, so do not evaluate Bsdf and just continue straight.</li>
<li>Interior list after event: (Glass:1, Water:2). Inside surface after event: Glass.</li>
</ul>
</li>
<li>Exit Glass.
<ul>
<li>Glass priority (1) is equal to the highest priority in interior list (1), so TRUE hit.</li>
<li>Incident/transmit IOR: Glass/Water.</li>
<li>True hit, so evaluate Bsdf and produce refraction event. Remove Glass from interior list.</li>
<li>Interior list after event: (Water:2). Inside surface after event: Water.</li>
</ul>
</li>
<li>Enter Ice.
<ul>
<li>Ice priority (0) is higher than the highest priority in interior list (2), so TRUE hit.</li>
<li>Incident/transmit IOR: Water/Ice.</li>
<li>True hit, so evaluate Bsdf and produce refraction event.</li>
<li>Interior list after event: (Water:2, Ice:0). Inside surface after event: Ice.</li>
</ul>
</li>
<li>Exit Water.
<ul>
<li>Ice priority (0) is higher than the highest priority in interior list (2), so TRUE hit.</li>
<li>Incident/transmit IOR: N/A.</li>
<li>False hit, so do not evaluate Bsdf and just continue straight. Remove Water from interior list.</li>
<li>Interior list after event: (Ice:0). Inside surface after event: Ice.</li>
</ul>
</li>
<li>Exit Ice.
<ul>
<li>Ice priority is only surface in the interior list, so TRUE hit.</li>
<li>Incident/transmit IOR: Ice/Air.</li>
<li>True hit, so evaluate Bsdf and produce refraction event. Remove Ice from interior list.</li>
<li>Interior list after event: empty. Inside surface after event: air.</li>
</ul>
</li>
<li>Enter Water.
<ul>
<li>Water priority (2) is higher than ambient air (infinite), so TRUE hit.</li>
<li>Incident/transmit IOR: Air/Water.</li>
<li>True hit, so evaluate Bsdf and produce refraction event.</li>
<li>Interior list after event: (Water:2). Inside surface after event: Water.</li>
</ul>
</li>
<li>Enter Glass.
<ul>
<li>Glass priority (1) is higher than the highest priority in interior list (2), so TRUE hit.</li>
<li>Incident/transmit IOR: Water/Glass.</li>
<li>True hit, so evaluate Bsdf and produce refraction event.</li>
<li>Interior list after event: (Water:2, Glass:1). Inside surface after event: Glass.</li>
</ul>
</li>
<li>Exit Water.
<ul>
<li>Water priority (2) is lower than highest priority in interior list (1), so FALSE hit.</li>
<li>Incident/transmit IOR: N/A.</li>
<li>False hit, so do not evaluate Bsdf and just continue straight.</li>
<li>Interior list after event: (Glass:1). Inside surface after event: Glass.</li>
</ul>
</li>
<li>Reflect off Glass.
<ul>
<li>Glass priority (1) is equal to the highest priority in interior list (1), so TRUE hit.</li>
<li>Incident/transmit IOR: Glass/Air.</li>
<li>True hit, so evaluate Bsdf and produce reflection event.</li>
<li>Interior list after event: (Glass:1). Inside surface after event: Glass.</li>
</ul>
</li>
<li>Enter Water.
<ul>
<li>Water priority (2) is lower than highest priority in interior list (1), so FALSE hit.</li>
<li>Incident/transmit IOR: N/A.</li>
<li>False hit, so do not evaluate Bsdf and just continue straight.</li>
<li>Interior list after event: (Glass:1, Water:2). Inside surface after event: Glass.</li>
</ul>
</li>
<li>Reflect off Glass.
<ul>
<li>Glass priority (1) is equal to the highest priority in interior list (1), so TRUE hit.</li>
<li>Incident/transmit IOR: Glass/Water.</li>
<li>True hit, so evaluate Bsdf and produce reflection event.</li>
<li>Interior list after event: (Glass:1, Water:2). Inside surface after event: Glass.</li>
</ul>
</li>
<li>Exit Water.
<ul>
<li>Water priority (2) is lower than highest priority in interior list (1), so FALSE hit.</li>
<li>Incident/transmit IOR: N/A.</li>
<li>False hit, so do not evaluate Bsdf and just continue straight.</li>
<li>Interior list after event: (Glass:1). Inside surface after event: Glass.</li>
</ul>
</li>
<li>Exit Glass.
<ul>
<li>Glass priority (1) is equal to the highest priority in interior list (1), so TRUE hit.</li>
<li>Incident/transmit IOR: Glass/Air.</li>
<li>True hit, so evaluate Bsdf and produce refraction event. Remove Glass from interior list.</li>
<li>Interior list after event: empty. Inside surface after event: air.</li>
</ul>
</li>
</ol>
<p>The entire above sequence of events is physically plausible, and produces no weird double-events!
Using priority-based nested dielectrics, Takua generates the correct images in Figure 1 and Figure 6.
Note how in Figure 6 below, the liquid appears to come right up against the glass, without any bright boundary artifacts or anything else.</p>
<p>For actually implementing priorty-based nested dielectrics in a ray tracing renderer, I think there are two equally plausible places in the renderer where the implementation can take place.
The first and most obvious location is as part of standard light transport integration or shading system.
The integrator would be in charge of checking for false hits and tracing continuation rays through false hit geometry.
A second, slightly less obvious location is actually as part of ray traversal through the scene itself.
Including handling of false hits in the traversal system can be more efficient than handling it in the integrator since the false hit checks could be done in the middle of a single BVH tree traversal, whereas handling false hits by firing continuation rays requires a new BVH tree traversal for each false hit encountered.
Also, handling false hits in the traversal system removes some complexity from the integrator.
However, the downside to handling false hits in the traversal system is that it requires plumbing all of the interior list data and logic into the traversal system, which sets up something of a weird backwards dependency between the traversal and shading/integration systems.
I wound up choosing to implement priority-based nested dielectrics in the integration system in Takua, simply to avoid having to do complex, weird plumbing back into the traversal system.
Takua uses priority-based nested dielectrics in all integrators, including unidirectional path tracing, BDPT, PPM, and VCM, and also uses the nested dielectrics system to handle transmittance along bidirectional connections through attenuating mediums.</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/May/nested_new.0.png"><img src="https://blog.yiningkarlli.com/content/images/2019/May/preview/nested_new.0.jpg" alt="Figure 6: The same tea in a glass cup scene as in Figure 4, rendered correctly using Takua's priority-based nested dielectrics implementation." /></a></p>
<p>Even though the technique has “nested <em>dielectrics</em>” in the title, this technique is not in principle limited to only dielectrics.
In Takua, I now use this technique to handle all transmissive cases, including for both dielectric surfaces and for surfaces with diffuse transmission.
Also, in addition to just determining the incident and transmit IORs, Takua uses this system to also determine things like what kind of participating medium a ray is currently inside of in order to calculate attenuation.
This technique appears to be more or less the industry standard today; implementations are available for at least <a href="https://rmanwiki.pixar.com/display/REN/Nested+Dielectrics">Renderman</a>, <a href="https://github.com/Psyop/jf-nested-dielectric">Arnold</a>, <a href="https://www.sidefx.com/docs/houdini/render/nested.html">Mantra</a>, and <a href="https://support.nextlimit.com/display/mxdocsv3/Nested+dielectrics">Maxwell Render</a>.</p>
<p>As a side note, during the course of this work, I also upgraded Takua’s attenuation system to use ratio tracking <a href="https://dl.acm.org/citation.cfm?id=2661292">[Novák et al. 2014]</a> instead of ray marching when doing volumetric lookups.
This change results in an important improvement to the attenuation system: ratio tracking provides an <em>unbiased</em> estimate of transmittance, whereas ray marching is inherently biased due to being a quadrature-based technique.</p>
<p>Figures 7 and 8 show a fancier scene of liquid pouring into a glass with some ice cubes and such.
This scene is the Glass of Water scene from <a href="https://benedikt-bitterli.me">Benedikt Bitterli</a>’s rendering resources page <a href="https://benedikt-bitterli.me/resources/">[Bitterli 2016]</a>, modified with brighter lighting on a white backdrop and with red liquid.
I also had to modify the scene so that the liquid overlaps the glass slightly; providing a clearer read for the liquid-glass interface is why I made the liquid red.
One of the neat features of this scene are the cracks modeled <em>inside</em> of the ice cubes; the cracks are non-manifold geometry.
To render them correctly, I applied a shader with glossy refraction to the crack geometry but did not set a priority value for them; this works correctly because the cracks, being non-manifold, don’t have a concept of inside or outside anyway, so they should not participate in any interior list considerations.</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/May/waterpour.cam0.0.png"><img src="https://blog.yiningkarlli.com/content/images/2019/May/preview/waterpour.cam0.0.jpg" alt="Figure 7: Cranberry juice pouring into a glass with ice cubes, rendered using Takua's priority-based nested dielectrics. The scene is from Benedikt Bitterli's rendering resources page." /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2019/May/waterpour.cam1.0.png"><img src="https://blog.yiningkarlli.com/content/images/2019/May/preview/waterpour.cam1.0.jpg" alt="Figure 8: A different camera angle of the scene from Figure 7. The scene is from Benedikt Bitterli's rendering resources page." /></a></p>
<p><strong>References</strong></p>
<p>Benedikt Bitterli. 2016. <a href="https://benedikt-bitterli.me/resources/">Rendering Resources</a>. Retrieved from <a href="https://benedikt-bitterli.me/resources/">https://benedikt-bitterli.me/resources/</a>.</p>
<p>Jan Novák, Andrew Selle and Wojciech Jarosz. 2014. <a href="https://dl.acm.org/citation.cfm?id=2661292">Residual Ratio Tracking for Estimating Attenuation in Participating Media</a>. <em>ACM Transactions on Graphics</em>. 33, 6 (2014), 179:1-179:11.</p>
<p>Charles M. Schmidt and Brian Budge. 2002. <a href="https://www.tandfonline.com/doi/abs/10.1080/10867651.2002.10487555">Simple Nested Dielectrics in Ray Traced Images</a>. <em>Journal of Graphics Tools</em>. 7, 2 (2002), 1–8.</p>
<p><strong>Some Blog Update Notes</strong></p>
<p>For the past few years, my blog posts covering personal work have trended towards gignormous epic articles tackling huge subjects published only once or twice a year, such as with the <a href="https://blog.yiningkarlli.com/2018/10/bidirectional-mipmap.html">bidirectional mipmapping post</a> and its promised but still unfinished part 2.
Unfortunately, I’m not the fastest writer when working on huge posts, since writing those posts often involves significant learning and multiple iterations of implementation and testing on my part.
Over the next few months, I’m aiming to write more posts similar to this one, covering some relatively smaller topics, so that I can get posts coming out a bit more frequently while I continue to work on several upcoming, gignormous posts on long-promised topics.
Or at least, that’s the plan… we’ll see!</p>
https://blog.yiningkarlli.com/2018/11/wir2.html
Ralph Breaks the Internet
2018-11-15T00:00:00+00:00
2018-11-15T00:00:00+00:00
Yining Karl Li
<p>The <a href="http://www.disneyanimation.com/">Walt Disney Animation Studios</a> film for 2018 is <a href="https://disneyanimation.com/projects/ralphbreakstheinternet2">Ralph Breaks the Internet</a>, which is the sequel to 2012’s <a href="https://disneyanimation.com/projects/wreckitralph">Wreck-It Ralph</a>.
Over the past two years, I’ve been fortunate enough to work on a number of improvements to Disney’s <a href="http://www.disneyanimation.com/technology/innovations/hyperion">Hyperion Renderer</a> for Ralph Breaks the Internet; collectively, these improvements make up perhaps the biggest jump in rendering capabilities that Hyperion has seen since the original deployment of Hyperion on <a href="https://disneyanimation.com/projects/bighero6">Big Hero 6</a>.
I got my third Disney Animation credit on Ralph Breaks the Internet!</p>
<p>Over the past two years, the Hyperion team has publicly presented a number of major development efforts and research advancements.
Many of these advancements were put into experimental use on <a href="https://blog.yiningkarlli.com/2017/11/olafs-frozen-adventure.html">Olaf’s Frozen Adventure</a> last year, but Ralph Breaks the Internet is the first time we’ve put all of these new capabilities and features into full-scale production together.
I was fortunate enough to be fairly deeply involved in several of these efforts (specifically, traversal improvements and volume rendering).
One of my favorite things about working at Disney Animation is how production and technology partner together to make our films; we truly would not have been able to pull off any of Hyperion’s new advancements without production’s constant support and willingness to try new things in the name of advancing the artistry of our films.</p>
<p>Ralph Breaks the Internet is our first feature film to use Hyperion’s new spectral and decomposition tracking <a href="ttps://doi.org/10.1145/3072959.3073665">[Kutz et al. 2017]</a> based null-collision volume rendering system exclusively.
Originally we had planned to use the new volume rendering system side-by-side with Hyperion’s previous residual ratio tracking <a href="https://doi.org/10.1145/2661229.2661292">[Novák 2014]</a> based volume rendering system <a href="https://doi.org/10.1145/3084873.3084907">[Fong 2017]</a>, but the results from the new system were so compelling that the show decided to switch over to the new volume rendering exclusively, which in turn allowed us to deprecate and remove the old volume rendering system ahead of schedule.
This new volume rendering system is the culmination of two years of work from Ralf Habel, Peter Kutz, Patrick Kelly, and myself.
We had the enormous privilege of working with a large number of FX and lighting artists to develop, test, and refine this new system; specifically, I want to call out Jesse Erickson, Henrik Falt, and Alex Nijmeh for really championing the new volume rendering system and encouraging and supporting its development.
We also owe an enormous amount to the rest of the Hyperion development team, which gave us the time and resources to spent two years building a new volume rendering system essentially from scratch.
Finally, I want to underscore that the research and underpins our new volume rendering system was conducted jointly between us and Disney Research Zürich, and that this could not have happened without our colleagues at Disney Research Zürich (specifically, Jan Novák and Marios Papas); I think this entire project has been a huge shining example of the value and importance of having a dedicated blue-sky research division.
Every explosion and cloud and dust plume and every bit of fog and atmospherics you see in Ralf Breaks in the Internet was rendered using the new volume rendering system!
Interestingly, we actually found that while the new volume rendering system is much faster and much more efficient at rendering dense volumes (and especially volumes with lots of high-order scattering) compared to the old system, the new system actually has some difficulty rendering thin volumes such as mist and atmospheric fog.
This isn’t be surprising, since thin volumes require better transmittance sampling over better distance sampling and null collision volume rendering is really optimized for distance sampling.
We were able to work with production to come up with workarounds for this problem on Ralph Breaks the Internet, but this area is definitely a good topic for future research.</p>
<p>Ralph Breaks the Internet is also our first feature film to move to exclusively using brute force path-traced subsurface scattering <a href="https://doi.org/10.1145/2897839.2927433">[Chiang 2016]</a> for all characters, as a replacement for Hyperion’s previous normalized diffusion based subsurface scattering <a href="https://doi.org/10.1145/2776880.2787670">[Burley 2015]</a>.
This feature was tested on Olaf’s Frozen Adventure in a limited capacity, but Ralph Breaks the Internet is the first time we’ve switched path-traced subsurface to being to default subsurface mode in the renderer.
Matt Chiang, Peter Kutz, and Brent Burley put a lot of effort into developing new sampling techniques to reduce color noise in subsurface scattering, and also into developing a new parameterization that closely matched Hyperion’s normalized diffusion parameterization, which allowed artists to basically just flip a switch between normalized diffusion and path-traced subsurface and get a predictable, similar result.
Many more details on Hyperion’s path-traced subsurface implementation are in our recent system architecture paper <a href="https://dl.acm.org/citation.cfm?id=3182159">[Burley 2018]</a>.
In addition to making characters we already know, such as Ralph and Vanellope, look better and more detailed, path-traced subsurface scattering also proved critical to hitting the required looks for new characters, such as the slug-like Double Dan character.</p>
<p>When Ralph and Vanellope first enter the world of the internet, there are several establishing shots showing vast vistas of the enormous infinite metropolis that the film depicts the internet as.
Early in production, some render tests of the internet metropolis proved to be extremely challenging due to the sheer amount of geometry in the scene.
Although instancing was used extensively, the way the scenes had to be built in our production pipeline meant that Hyperion wasn’t able to leverage the instancing in the scene as efficiently as we would have liked.
Additionally, the way the instance groups were structured made traversal in Hyperion less ideal than it could have been.
After encountering smaller-scale versions of the same problems on Moana, Peter Kutz and I had arrived at an idea that we called “multiple entry points”, which basically lets Hyperion blur the lines between top and bottom level BVHs in a two-level BVH structure.
By inserting mid-level nodes from bottom level BVHs in to the top-level BVH, Hyperion can produce a much more efficient top-level BVH, dramatically accelerating rendering of large instance groups and other difficult-to-split pieces of large geometry, such as groundplanes.
This idea is very similar to BVH rebraiding <a href="https://doi.org/10.1145/3105762.3105776">[Benthin et al. 2017]</a>, but we arrived at our approach independently before the publication of BVH rebraiding.
After initial testing on Olaf’s Frozen Adventure proved promising, we enabled multiple entry points by default for the entirety of Ralph Breaks the Internet.
Additionally, Dan Teece developed a powerful automatic geometry de-duplication system, which allows Hyperion to reclaim large amounts of memory in cases where multiple instance groups are authored with separate copies of the same master geometry.
Greg Nichols and I also developed a new multithreading strategy for handling Hyperion’s ultra-wide batched ray traversal, which significantly improved Hyperion’s multithreaded scalability during traversal to near-linear scaling with number of cores.
All of these geometry and traversal improvements collectively meant that by the main production push for the show, render times for the large internet vista shots had dropped from being by far the highest in the show to being indistinguishable from any other normal shot.
These improvements also proved to be timely, since the internet set was just the beginning of massive-scale geometry and instancing on Ralph Breaks the Internet; solving the render efficiency problems for the internet set also made other large-scale instancing sequences, such as the Ralphzilla battle <a href="https://doi.org/10.1145/3306307.3328179">[Byun et al. 2019]</a> at the end of the film and the massive crowds <a href="https://doi.org/10.1145/3306307.3328185">[Richards et al. 2019]</a> in the internet, easier to render.</p>
<p>Another major advancement we made on Ralph Breaks the Internet, in collaboration with Disney Research Zürich and our sister studio Pixar Animation Studios, is a new machine-learning based denoiser.
To the best of my knowledge, Disney Animation was one of the first studios with a successful widescale deployment of a production denoiser on Big Hero 6.
The Hyperion denoiser used from Big Hero 6 through Olaf’s Frozen Adventure is a hand-tuned denoiser based on and influenced by <a href="https://doi.org/10.1145/2366145.2366213">[Li et al. 2012]</a> and <a href="https://doi.org/10.1111/cgf.12219">[Rousselle et al. 2012]</a>, and has since been adopted by the Renderman team as the production denoiser that ships with Renderman today.
Midway through production on Ralph Breaks the Internet, David Adler from the Hyperion team in collaboration with Fabrice Rousselle, Jan Novák, Gerhard Röthilin, and others from Disney Research Zürich were able to deploy a new, next-generation machine-learning based denoiser <a href="https://doi.org/10.1145/3197517.3201388">[Vogels et al. 2018]</a>
Developed primarily by Disney Research Zürich, the new machine-learning denoiser allowed us to cut render times by up to 75% in some cases.
This example is yet another case of basic scientific research at Disney Research leading to unexpected but enormous benefits to production in all of the wider Walt Disney Company’s various animation studios!</p>
<p>In addition to everything above, many more smaller improvements were made in all areas of Hyperion for Ralph Breaks the Internet. Dan Teece developed a really cool “edge” shader module, which was used to create all of the silhouette edge glows in the internet world, and Dan also worked closely with FX artists to develop render-side support for various fracture and destruction workflows <a href="https://doi.org/10.1145/3214745.3214814">[Harrower et al. 2018]</a>. Brent Burley developed several improvements to Hyperion’s depth of field support, including a realistic cat’s eye bokeh effect.
Finally, as always, the production of Ralph Breaks the Internet has inspired many more future improvements to Hyperion that I can’t write about yet, since they haven’t been published yet.</p>
<p>The original Wreck-It Ralph is one of my favorite modern Disney movies, and I think Ralph Breaks the Internet more than lives up to the original.
The film is smart and hilarious while maintaining the depth that made the first Wreck-It Ralph so good.
Ralph and Vanellope are just as lovable as before and grow further as characters, and all of the new characters are really awesome (Shank and Yesss and the film’s take on the Disney princesses are particular favorites of mine).
More importantly for a rendering blog though, the film is also just gorgeous to look at.
With every film, the whole studio takes pride in pushing the envelope even further in terms of artistry, craftsmanship, and sheer visual beauty.
The number of environments and settings in Ralph Breaks the Internet is enormous and highly varied; the internet is depicted as a massive city that pushed the limits on how much visual complexity we can render (and from our previous three feature films, we can already render an unbelievable amount!), old locations from the first Wreck-It Ralph are revisited with exponentially more visual detail and richness than before, and there’s even a full on musical number with theatrical lighting somewhere in there!</p>
<p>Below are some stills from the movie, in no particular order, 100% rendered using Hyperion.
If you want to see more, or if you just want to see a really great movie, go see Ralph Breaks the Internet on the biggest screen you can find!
There are a TON of easter eggs in the film to look out for, and I highly recommend sticking around after the credits for this one.</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_00.jpg"><img src="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_00.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_01.jpg"><img src="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_01.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_02.jpg"><img src="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_02.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_03.jpg"><img src="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_03.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_04.jpg"><img src="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_04.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_05.jpg"><img src="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_05.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_06.jpg"><img src="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_06.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_07.jpg"><img src="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_07.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_08.jpg"><img src="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_08.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_09.jpg"><img src="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_09.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_37.jpg"><img src="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_37.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_10.jpg"><img src="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_10.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_11.jpg"><img src="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_11.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_12.jpg"><img src="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_12.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_46.jpg"><img src="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_46.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_13.jpg"><img src="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_13.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_14.jpg"><img src="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_14.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_15.jpg"><img src="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_15.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_41.jpg"><img src="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_41.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_16.jpg"><img src="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_16.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_17.jpg"><img src="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_17.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_28.jpg"><img src="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_28.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_29.jpg"><img src="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_29.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_32.jpg"><img src="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_32.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_31.jpg"><img src="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_31.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_18.jpg"><img src="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_18.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_19.jpg"><img src="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_19.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_20.jpg"><img src="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_20.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_22.jpg"><img src="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_22.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_30.jpg"><img src="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_30.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_23.jpg"><img src="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_23.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_24.jpg"><img src="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_24.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_38.jpg"><img src="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_38.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_25.jpg"><img src="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_25.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_26.jpg"><img src="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_26.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_21.jpg"><img src="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_21.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_33.jpg"><img src="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_33.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_48.jpg"><img src="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_48.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_45.jpg"><img src="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_45.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_34.jpg"><img src="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_34.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_35.jpg"><img src="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_35.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_36.jpg"><img src="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_36.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_39.jpg"><img src="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_39.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_40.jpg"><img src="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_40.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_42.jpg"><img src="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_42.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_43.jpg"><img src="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_43.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_44.jpg"><img src="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_44.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_47.jpg"><img src="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_47.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_27.jpg"><img src="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_27.jpg" alt="" /></a></p>
<p>Here is the part of the credits with Disney Animation’s rendering team!
Also, Ralph Breaks the Internet was my wife Harmony Li’s first credit at Disney Animation (she previously was at Pixar)!
This frame is kindly provided by Disney.
Every person you see in the credits worked really hard to make Ralph Breaks the Internet an amazing film!</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_credits.jpg"><img src="https://blog.yiningkarlli.com/content/images/2018/Nov/WIR2_credits.jpg" alt="" /></a></p>
<p>All images in this post are courtesy of and the property of Walt Disney Animation Studios.</p>
<p><strong>References</strong></p>
<p>Carsten Benthin, Sven Woop, Ingo Wald, and Attila T. Áfra. 2017. <a href="https://doi.org/10.1145/3105762.3105776">Improved Two-Level BVHs using Partial Re-Braiding</a>. In <em>HPG ‘17 (Proceedings of High Performance Graphics)</em>. 7:1-7:8.</p>
<p>Brent Burley. <a href="https://doi.org/10.1145/2776880.2787670">Extending the Disney BRDF to a BSDF with Integrated Subsurface Scattering</a>. 2015. In <em>ACM SIGGRAPH 2015 Course Notes: <a href="https://blog.selfshadow.com/publications/s2015-shading-course">Physically Based Shading in Theory and Practice</a></em>.</p>
<p>Brent Burley, David Adler, Matt Jen-Yuan Chiang, Hank Driskill, Ralf Habel, Patrick Kelly, Peter Kutz, Yining Karl Li, and Daniel Teece. 2018. <a href="https://dl.acm.org/citation.cfm?id=3182159">The Design and Evolution of Disney’s Hyperion Renderer</a>. <em>ACM Transactions on Graphics</em>. 37, 3 (2018), 33:1-33:22.</p>
<p>Dong Joo Byun, Alberto Luceño Ros, Alexander Moaveni, Marc Bryant, Joyce Le Tong, and Moe El-Ali. 2019. <a href="https://doi.org/10.1145/3306307.3328179">Creating Ralphzilla: Moshpit, Skeleton Library and Automation Framework</a>. In <em>ACM SIGGRAPH 2019 Talks</em>. 66:1-66:2.</p>
<p>Matt Jen-Yuan Chiang, Peter Kutz, and Brent Burley. 2016. <a href="https://doi.org/10.1145/2897839.2927433">Practical and Controllable Subsurface Scattering for Production Path Tracing</a>. In <em>ACM SIGGRAPH 2016 Talks</em>. 49:1-49:2.</p>
<p>Julian Fong, Magnus Wrenninge, Christopher Kulla, and Ralf Habel. 2017. <a href="https://doi.org/10.1145/3084873.3084907">Production Volume Rendering</a>. In <em>ACM SIGGRAPH 2017 Courses</em>.</p>
<p>Will Harrower, Pete Kyme, Ferdi Scheepers, Michael Rice, Marie Tollec, and Alex Moaveni. 2018. <a href="https://doi.org/10.1145/3214745.3214814">SimpleBullet: Collaborating on a Modular Destruction Toolkit</a>. In <em>ACM SIGGRAPH 2018 Talks</em>. 79:1-79:2.</p>
<p>Peter Kutz, Ralf Habel, Yining Karl Li, and Jan Novák. 2017. <a href="https://doi.org/10.1145/3072959.3073665">Spectral and Decomposition Tracking for Rendering Heterogeneous Volumes</a>. <em>ACM Transactions on Graphics</em>. 36, 4 (2017), 111:1-111:16.</p>
<p>Tzu-Mao Li, Yu-Ting Wu, and Yung-Yu Chiang. 2012. <a href="https://doi.org/10.1145/2366145.2366213">SURE-based Optimization for Adaptive Sampling and Reconstruction</a>. <em>ACM Transactions on Graphics</em>. 31, 6 (2012), 194:1-1949:</p>
<p>Jan Novák, Andrew Selle, and Wojciech Jarosz. 2014. <a href="https://doi.org/10.1145/2661229.2661292">Residual Ratio Tracking for Estimating Attenuation in Participating Media</a>. <em>ACM Transactions on Graphics</em>. 33, 6 (2014), 179:1-179:11.</p>
<p>Josh Richards, Joyce Le Tong, Moe El-Ali, and Tuan Nguyen. 2019. <a href="https://doi.org/10.1145/3306307.3328185">Optimizing Large Scale Crowds in Ralph Breaks the Internet</a>. In <em>ACM SIGGRAPH 2019 Talks</em>. 65:1-65:2.</p>
<p>Fabrice Rousselle, Marco Manzi, and Matthias Zwicker. 2013. <a href="https://doi.org/10.1111/cgf.12219">Robust Denoising using Feature and Color Information</a>. <em>Computer Graphics Forum</em>. 32, 7 (2013), 121-130.</p>
<p>Thijs Vogels, Fabrice Rousselle, Brian McWilliams, Gerhard Röthlin, Alex Harvill, David Adler, Mark Meyer, and Jan Novák. 2018. <a href="https://doi.org/10.1145/3197517.3201388">Denoising with Kernel Prediction and Asymmetric Loss Functions</a>. <em>ACM Transactions on Graphics</em>. 37, 4 (2018), 124:1-124:15.</p>
https://blog.yiningkarlli.com/2018/10/bidirectional-mipmap.html
Mipmapping with Bidirectional Techniques
2018-10-25T00:00:00+00:00
2018-10-25T00:00:00+00:00
Yining Karl Li
<p>One major feature that differentiates production-capable renderers from hobby or research renderers is a texture caching system.
A well-implemented texture caching system is what allows a production renderer to render scenes with potentially many TBs of textures, but in a reasonable footprint that fits in a few dozen or a hundred-ish GB of RAM.
Pretty much every production renderer today has a robust texture caching system; Arnold famously derives a significant amount of performance from an extremely efficient texture cache implementation, and Vray/Corona/Renderman/Hyperion/etc. all have their own, similarly efficient systems.</p>
<p>In this post and the next few posts, I’ll write about how I implemented a tiled, mipmapped texture caching system in my hobby renderer, Takua Renderer.
I’ll also discuss some of the interesting challenges I ran into along the way.
This post will focus on the mipmapping part of the system.
Building a tiled mipmapping system that works well with bidirectional path tracing techniques was particularly difficult, for reasons I’ll discuss later in this post.
I’ll also review the academic literature on ray differentials and mipmapping with path tracing, and I’ll take a look at what several different production renderers do.
The scene I’ll use as an example in this post is a custom recreation of a forest scene from Evermotion’s Archmodels 182, rendered entirely using Takua Renderer (of course):</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2018/Oct/forest.cam0.0.jpg"><img src="https://blog.yiningkarlli.com/content/images/2018/Oct/preview/forest.cam0.0.jpg" alt="Figure 1: A forest scene in the morning, rendered using Takua Renderer. 6 GB of textures on disk accessed using a 1 GB in-memory texture cache." /></a></p>
<p><strong>Intro: Texture Caches and Mipmaps</strong></p>
<p>Texture caching is typically coupled with some form of a tiled, mipmapped <a href="https://dl.acm.org/citation.cfm?id=801126">[Williams 1983]</a> texture system; the texture cache holds specific tiles of an image that were accessed, as opposed to an entire texture.
These tiles are typically lazy-loaded on demand into a cache <a href="https://graphics.pixar.com/library/TOD/">[Peachey 1990]</a>, which means the renderer only needs to pay the memory storage cost for only parts of a texture that the renderer actually accesses.</p>
<p>The remainder of this section and the next section of this post are a recap of what mipmaps are, mipmap level selection, and ray differentials for the less experienced reader.
I also discuss a bit about what techniques various production renderers are known to use today.
If you already know all of this stuff, I’d suggest skipping down a bit to the section titled “Ray Differentials and Bidirectional Techniques”.</p>
<p>Mipmapping works by creating multiple resolutions of a texture, and for a given surface, only loading the last resolution level where the frequency detail falls below the Nyquist limit when viewed from the camera.
Since textures are often much more high resolution than the final framebuffer resolution, mipmapping means the renderer can achieve huge memory savings, since for objects further away from the camera, most loaded mip levels will be significantly lower resolution than the original texture.
Mipmaps start with the original full resolution texture as “level 0”, and then each level going up from level 0 is half the resolution of the previous level.
The highest level is the level at which the texture can no longer be halved in resolution again.</p>
<p>Below is an example of a mipmapped texture.
The texture below is the diffuse albedo texture for the fallen log that is in the front of the scene in Figure 1, blocking off the path into the woods.
On the left side of Figure 2 is level 1 of this texture (I have omitted level 0 both for image size reasons and because the original texture is from a commercial source, which I don’t have the right to redistribute in full resolution).
On the right side, going from the top on down, are levels 2 through 11 of the mipmap.
I’ll talk about the “tiled” part in a later post.</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2018/Oct/texture_miplevels.jpg"><img src="https://blog.yiningkarlli.com/content/images/2018/Oct/preview/texture_miplevels.jpg" alt="Figure 2: A mipmapped texture. Level 1 of the mipmap is shown on the left, levels 2 through 11 are shown on the right. Level 0 is not shown here. A bit of terminology that is often confusing: the lowest mipmap level is the highest resolution level, while the highest mipmap level is the lowest resolution level." /></a></p>
<p>Before diving into details, I need to make a major note: I’m not going to write too much about texture filtering for now, mainly because I haven’t done much with texture filtering in Takua at all.
Mipmapping was originally invented as an elegant solution to the problem of expensive texture filtering in rasterized rendering; when a texture had detail that was more high frequency than the distance between neighboring pixels in the framebuffer, aliasing would occur when the texture was sampled.
Mipmaps are typically generated with pre-computed filtering for mip levels above the original resolution, allowing for single texture samples to appear antialiased.
For a comprehensive discussion of texture filtering, how it relates to mipmaps, and more advanced techniques, see <a href="http://www.pbr-book.org/3ed-2018/Texture/Image_Texture.html#MIPMaps">section 10.4.3 in Physically Based Rendering 3rd Edition</a> <a href="http://www.pbr-book.org">[Pharr et al. 2016]</a>.</p>
<p>For now, Takua just uses a point sampler for all texture filtering; my interest in mipmaps is mostly for memory efficiency and texture caching instead of filtering.
My thinking is that in a path tracer that is going to generate hundreds or even thousands of paths for each framebuffer pixel, the need for single-sample antialiasing becomes somewhat lessened, since we’re already basically supersampling.
Good texture filtering is still ideal of course, but being lazy and just relying on supersampling to get rid of texture aliasing in primary visibility is… not necessarily the worst short-term solution in the world.
Furthermore, relying on just point sampling means each texture sample only requires two texture lookups: one from the integer mip level and one from the integer mip level below the continuous float mip level at a sample point (see the next section for more on this).
Using only two texture lookups per texture sample is highly efficient due to minimized memory access and minimized branching in the code.
Interestingly, the Moonray team at Dreamworks Animation arrived at more or less the same conclusion <a href="https://dl.acm.org/citation.cfm?doid=3105762.3105768">[Lee et al. 2017]</a>; they point out in their paper that geometric complexity, for all intents and purposes, has an infinite frequency, whereas pre-filtered mipmapped textures are already band limited.
As a result, the number of samples required to resolve geometric aliasing should be more than enough to also resolve any texture aliasing.
The Moonray team found that this approach works well enough to be their default mode in production.</p>
<p><strong>Mipmap Level Selection and Ray Differentials</strong></p>
<p>The trickiest part of using mipmapped textures is figuring out what mipmap level to sample at any given point.
Since the goal is to find a mipmap level with a frequency detail as close to the texture sampling rate as possible, we need to have a sense of what the texture sampling rate at a given point in space relative to the camera will be.
More precisely, we want the differential of the surface parameterization (a.k.a. how uv space is changing) with respect to the image plane.
Since the image plane is two-dimensional, we will end up with a differential for each uv axis with respect to each axis of the image plane; we call these differentials dudx/dvdx and dudy/dvdy, where u/v are uv coordinates and x/y are image plane pixel coordinates.
Calculating these differentials is easy enough in a rasterizer: for each image plane pixel, take the texture coordinate of the fragment and subtract with the texture coordinates of the neighboring fragments to get the gradient of the texture coordinates with respect to the image plane (a.k.a. screen space), and then scale by the texture resolution.
Once we have dudx/dvdx and dudy/dvdy, for a non-fancy box filter all we have to do to get the mipmap level is take the longest of these gradients and calculate its logarithm base 2.
A code snippet might look something like this:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>float mipLevelFromDifferentialSurface(const float dudx,
const float dvdx,
const float dudy,
const float dvdy,
const int maxMipLevel) {
float width = max(max(dudx, dvdx), max(dudy, dvdy));
float level = float(maxMipLevel) + log2(width);
return level;
}
</code></pre></div></div>
<p>Notice that the level value is a continuous float.
Usually, instead of rounding level to an integer, a better approach is to sample both of the integer mipmap levels above and below the continuous level and blend between the two values using the fractional part of level.
Doing this blending helps immensely with smoothing transitions between mipmap levels, which can become very important when rendering an animated sequence with camera movement.</p>
<p>In a ray tracer, however, figuring out dudx/dvdx and dudy/dvdy is not as easy as in a rasterizer.
If we are only considering primary rays, we can do something similar to the rasterization case: fire a ray from a given pixel and fire rays from the neighboring pixels, and calculate the gradient of the texture coordinates with respect to screen space (the screen space partial derivatives) by examining the hit points of each neighboring ray that hits the same surface as the primary ray.
This approach rapidly falls apart though, for the following reasons and more:</p>
<ul>
<li>If a ray hits a surface but none of its neighboring rays hit the same surface, then we can’t calculate any differentials and must fall back to point sampling the lowest mip level. This isn’t a problem in the rasterization case, since rasterization will run through all of the polygons that make up a surface, but in the ray tracing case, we only know about surfaces that we actually hit with a ray.</li>
<li>For secondary rays, we would need to trace secondary bounces not just for a given pixel’s ray, but also its neighboring rays. Doing so would be necessary since, depending on the bsdf at a given surface, the distance between the main ray and its neighbor rays can change arbitrarily. Tracing this many additional rays quickly becomes prohibitively expensive; for example, if we are considering four neighbors per pixel, we are now tracing five times as many rays as before.</li>
<li>We would also have to continue to guarantee that neighbor secondary rays continue hitting the same surface as the main secondary ray, which will become arbitrarily difficult as bxdf lobes widen or narrow.</li>
</ul>
<p>A better solution to these problems is to use <em>ray differentials</em> <a href="https://graphics.stanford.edu/papers/trd/">[Igehy 1999]</a>, which is more or less just a ray along with the partial derivative of the ray with respect to screen space.
Thinking of a ray differential as essentially similar to a ray with a width or a cone, similar to beam tracing <a href="https://dl.acm.org/citation.cfm?id=808588">[Heckbert and Hanrahan 1984]</a>, pencil tracing <a href="https://dl.acm.org/citation.cfm?id=37408">[Shinya et al. 1987]</a>, or cone tracing <a href="https://dl.acm.org/citation.cfm?id=808589">[Amanatides 1984]</a>, is not entirely incorrect, but ray differentials are a bit more nuanced than any of the above.
With ray differentials, instead of tracing a bunch of independent neighbor rays with each camera ray, the idea is to reconstruct dudx/dvdy and dudy/dvdy at each hit point using simulated offset rays that are reconstructed using the ray’s partial derivative.
Ray differentials are generated alongside camera rays; when a ray is traced from the camera, offset rays are generated for a single neighboring pixel vertically and a single neighboring pixel horizontally in the image plane.
Instead of tracing these offset rays independently, however, we always assume they are at some angular width from main ray.
When the main ray hits a surface, we need to calculate for later use the differential of the surface at the intersection point with respect to uv space, which is called dpdu and dpdv.
Different surface types will require different functions to calculate dpdu and dpdv; for a triangle in a triangle mesh, the code requires the position and uv coordinates at each vertex:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>DifferentialSurface calculateDifferentialSurfaceForTriangle(const vec3& p0,
const vec3& p1,
const vec3& p2,
const vec2& uv0,
const vec2& uv1,
const vec2& uv2) {
vec2 duv02 = uv0 - uv2;
vec2 duv12 = uv1 - uv2;
float determinant = duv02[0] * duv12[1] - duv02[1] * duv12[0];
vec3 dpdu, dpdv;
vec3 dp02 = p0 - p2;
vec3 dp12 = p1 - p2;
if (abs(determinant) == 0.0f) {
vec3 ng = normalize(cross(p2 - p0, p1 - p0));
if (abs(ng.x) > abs(ng.y)) {
dpdu = vec3(-ng.z, 0, ng.x) / sqrt(ng.x * ng.x + ng.z * ng.z);
} else {
dpdu = vec3(0, ng.z, -ng.y) / sqrt(ng.y * ng.y + ng.z * ng.z);
}
dpdv = cross(ng, dpdu);
} else {
float invdet = 1.0f / determinant;
dpdu = (duv12[1] * dp02 - duv02[1] * dp12) * invdet;
dpdv = (-duv12[0] * dp02 + duv02[0] * dp12) * invdet;
}
return DifferentialSurface(dpdu, dpdv);
}
</code></pre></div></div>
<p>Calculating surface differentials does add a small bit of overhead to the renderer, but the cost can be minimized with some careful work.
The naive approach to surface differentials is to calculate them with every intersection point and return them as part of the hit point information that is produced by ray traversal.
However, this computation is wasted if the shading operation for a given hit point doesn’t actually end up doing any texture lookups.
In Takua, surface differentials are calculated on demand at texture lookup time instead of at ray intersection time; this way, we don’t have to pay the computational cost for the above function unless we actually need to do texture lookups.
Takua also supports multiple uv sets per mesh, so the above function is parameterized by uv set ID, and the function is called once for each uv set that a texture specifies.
Surface differentials are also cached within a shading operation per hit point, so if a shader does multiple texture lookups within a single invocation, the required surface differentials don’t need to be redundantly calculated.</p>
<p>Sony Imageworks’ variant of Arnold (we’ll refer to it as SPI Arnold to disambiguate from Solid Angle’s Arnold) does something even more advanced <a href="https://dl.acm.org/citation.cfm?id=3180495">[Kulla et al. 2018]</a>.
Instead of the above explicit surface differential calculation, SPI Arnold implements an automatic differentiation system utilizing dual arithmetic <a href="https://www.tandfonline.com/doi/abs/10.1080/10867651.2004.10504901">[Piponi 2004]</a>.
SPI Arnold extensively utilizes OSL for shading; this means that they are able to trace at runtime what dependencies a particular shader execution path requires, and therefore when a shader needs any kind of derivative or differential information.
The calls to the automatic differentiation system are then JITed into the shader’s execution path, meaning shader authors never have to be aware of how derivatives are computed in the renderer.
The SPI Arnold team’s decision to use dual arithmetic based automatic differentiation is influenced by lessons they had previously learned with BMRT’s finite differencing system, which required lots of extraneous shading computations for incoherent ray tracing <a href="https://www.tandfonline.com/doi/abs/10.1080/10867651.1996.10487462">[Gritz and Hahn 1996]</a>.
At least for my purposes, though. I’ve found that the simpler approach I have taken in Takua is sufficiently negligible in both final overhead and code complexity that I’ll probably skip something like the SPI Arnold approach for now.</p>
<p>Once we have the surface differential, we can then approximate the local surface geometry at the intersection point with a tangent plane, and intersect the offset rays with the tangent plane.
To find the corresponding uv coordinates for the offset ray tangent plane intersection planes, dpdu/dpdv, the main ray intersection point, and the offset ray intersection points can be used to establish a linear system.
Solving this linear system leads us directly to dudx/dudy and dvdx/dvdy; for the exact mathematical details and explanation, see <a href="http://www.pbr-book.org/3ed-2018/Texture/Sampling_and_Antialiasing.html">section 10.1 in Physically Based Rendering 3rd Edition</a>.
The actual code might look something like this:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>// This code is heavily aped from PBRT v3; consult the PBRT book for details!
vec4 calculateScreenSpaceDifferential(const vec3& p, // Surface intersection point
const vec3& n, // Surface normal
const vec3& origin, // Main ray origin
const vec3& rDirection, // Main ray direction
const vec3& xorigin, // Offset x ray origin
const vec3& rxDirection, // Offset x ray direction
const vec3& yorigin, // Offset y ray origin
const vec3& ryDirection, // Offset y ray direction
const vec3& dpdu, // Surface differential w.r.t. u
const vec3& dpdv // Surface differential w.r.t. v
) {
// Compute offset-ray intersection points with tangent plane
float d = dot(n, p);
float tx = -(dot(n, xorigin) - d) / dot(n, rxDirection);
vec3 px = origin + tx * rxDirection;
float ty = -(dot(n, yorigin) - d) / dot(n, ryDirection);
vec3 py = origin + ty * ryDirection;
vec3 dpdx = px - p;
vec3 dpdy = py - p;
// Compute uv offsets at offset-ray intersection points
// Choose two dimensions to use for ray offset computations
ivec2 dim;
if (std::abs(n.x) > std::abs(n.y) && std::abs(n.x) > std::abs(n.z)) {
dim = ivec2(1,2);
} else if (std::abs(n.y) > std::abs(n.z)) {
dim = ivec2(0,2);
} else {
dim = ivec2(0,1);
}
// Initialize A, Bx, and By matrices for offset computation
mat2 A;
A[0][0] = ds.dpdu[dim[0]];
A[0][1] = ds.dpdv[dim[0]];
A[1][0] = ds.dpdu[dim[1]];
A[1][1] = ds.dpdv[dim[1]];
vec2 Bx(px[dim[0]] - p[dim[0]], px[dim[1]] - p[dim[1]]);
vec2 By(py[dim[0]] - p[dim[0]], py[dim[1]] - p[dim[1]]);
float dudx, dvdx, dudy, dvdy;
// Solve two linear systems to get uv offsets
auto solveLinearSystem2x2 = [](const mat2& A, const vec2& B, float& x0, float& x1) -> bool {
float det = A[0][0] * A[1][1] - A[0][1] * A[1][0];
if (abs(det) < (float)constants::EPSILON) {
return false;
}
x0 = (A[1][1] * B[0] - A[0][1] * B[1]) / det;
x1 = (A[0][0] * B[1] - A[1][0] * B[0]) / det;
if (std::isnan(x0) || std::isnan(x1)) {
return false;
}
return true;
};
if (!solveLinearSystem2x2(A, Bx, dudx, dvdx)) {
dudx = dvdx = 0.0f;
}
if (!solveLinearSystem2x2(A, By, dudy, dvdy)) {
dudy = dvdy = 0.0f;
}
return vec4(dudx, dvdx, dudy, dvdy);
}
</code></pre></div></div>
<p>Now that we have dudx/dudy and dvdx/dvdy, getting the proper mipmap level works just like in the rasterization case.
The above approach is exactly what I have implemented in Takua Renderer for camera rays.
Similar to surface differentials, Takua caches dudx/dudy and dvdx/dvdy computations per shader invocation per hit point, so that multiple textures utilizing the same uv set dont’t require multiple redundant calls to the above function.</p>
<p>The ray derivative approach to mipmap level selection is basically the standard approach in modern production rendering today for camera rays.
PBRT <a href="http://pbrt.org">[Pharr et al. 2016]</a>, Mitsuba <a href="http://www.mitsuba-renderer.org/">[Jakob 2010]</a>, and Solid Angle’s version of Arnold <a href="https://dl.acm.org/citation.cfm?id=3182160">[Georgiev et al. 2018]</a> all use ray differential systems based on this approach for camera rays.
Renderman <a href="https://dl.acm.org/citation.cfm?id=3182162">[Christensen et al. 2018]</a> uses a simplified version of ray differentials that only tracks two floats per ray, instead of the four vectors needed to represent a full ray differential.
Renderman tracks a width at each ray’s origin, and a spread value representing the linear rate of change of the ray width over a unit distance.
This approach does not encode as much information as the full ray derivative approach, but nonetheless ends up being sufficient since in a path tracer, every pixel essentially ends up being supersampled.
Hyperion <a href="https://dl.acm.org/citation.cfm?id=3182159">[Burley et al. 2018]</a> uses a similarly simplified scheme.</p>
<p>A brief side note: being able to calculate the differential for surface normals with respect to screen space is useful for bump mapping, among other things, and the calculation is directly analogous to the pseudocode above for calculateDifferentialSurfaceForTriangle() and calculateScreenSpaceDifferential(), just with surface normals substituted in for surface positions.</p>
<p><strong>Ray Differentials and Path Tracing</strong></p>
<p>We now know how to calculate filter footprints using ray differentials for camera rays, which is great, but what about secondary rays?
Without ray differentials for secondary rays, path tracing texture access behavior degrades severely, since secondary rays have to fall back to point sampling textures at the lowest mip level.
A number of different schemes exist for calculating filter footprints and mipmap levels for secondary rays; here are a few that have been presented in literature and/or are known to be in use in modern production renderers:</p>
<p><a href="https://graphics.stanford.edu/papers/trd/">Igehy [1999]</a> demonstrates how to propagate ray differentials through perfectly specular reflection and refraction events, which boil down to some simple extensions to the basic math for optical reflection and refraction.
However, we still need a means for handling glossy (so really, non-zero surface roughness), which requires an extended version of ray differentials.
<em>Path differentials</em> <a href="http://graphics.cs.kuleuven.be/publications/PATHDIFF/">[Suykens and Willems 2001]</a> consider more than just partial derivatives for each screen space pixel footprint; with path differentials, partial derivatives can also be taken at each scattering event along a number of dimensions.
As an example, for handling a arbitrarily shaped BSDF lobe, new partial derivatives can be calculated along some parameter of the lobe that describes the shape of the lobe, which takes the form of a bunch of additional scattering directions around the main ray’s scattering direction.
If we imagine looking down the main scattering direction and constructing a convex hull around the additional scattering directions, the result is a polygonal footprint describing the ray differential over the scattering event.
This footprint can then be approximated by finding the major and minor axis of the polygonal footprint.
While the method is general enough to handle arbitrary factors impacting ray directions, unfortunately it can be fairly complex and expensive to compute in practice, and differentials for some types of events can be very difficult to derive.
For this reason, none of the major production renderers today actually use this approach.
However, there is a useful observation that can be drawn from path differentials: generally, in most cases, primary rays have narrow widths and secondary rays have wider widths <a href="https://diglib.eg.org/handle/10.2312/8776">[Christensen et al. 2003]</a>; this observation is the basis of the ad-hoc techniques that most production renderers utilize.</p>
<p>Recently, research has appeared that provides an entirely different, more principled approach to selecting filter footprints, based on <em>covariance tracing</em> <a href="https://dl.acm.org/citation.cfm?id=2487239">[Belcour et al. 2013]</a>.
The high-level idea behind covariance tracing is that local light interaction effects such as transport, occlusion, roughness, etc. can all be encoded as 5D covariance matrices, which in turn can be used to determine ideal sampling rates.
Covariance tracing builds an actual, implementable rendering algorithm on top of earlier, mostly theoretical work on light transport frequency analysis <a href="https://dl.acm.org/citation.cfm?id=1073320">[Durand et al. 2005]</a>.
<a href="https://dl.acm.org/citation.cfm?id=2487239">Belcour et al. [2017]</a> presents an extension to covariance tracing for calculating filter footprints for arbitrary shading effects, including texture map filtering.
The covariance-tracing based approach differs from path differentials in two key areas.
While both approaches operate in path space, path differentials are much more expensive to compute than the covariance-tracing based technique; path differential complexity scales quadratically with path length, while covariance tracing only ever carries a single covariance matrix along a path for a given effect.
Also, path differentials can only be generated starting from the camera, whereas covariance tracing works from the camera and the light; in the next section, we’ll talk about why this difference is critically important.</p>
<p>Covariance tracing based techniques have a lot of promise, and are the best known approach to date for for selecting filter footprints along a path.
The original covariance tracing paper had some difficulty with handling high geometric complexity; covariance tracing requires a voxelized version of the scene for storing local occlusion covariance information, and covariance estimates can degrade severely if the occlusion covariance grid is not high resolution enough to capture small geometric details.
For huge production scale scenes, geometric complexity requirements can make covariance tracing either slow due to huge occlusion grids, or degraded in quality due to insufficiently large occlusion grids.
However, the voxelization step is not as much of a barrier to practicality as it may initially seem.
For covariance tracing based filtering, visibility can be neglected, so the entire scene voxelization step can be skipped; <a href="https://dl.acm.org/citation.cfm?id=2487239">Belcour et al. [2017]</a> demonstrates how.
Since covariance tracing based filtering can be used with the same assumptions and data as ray differentials but is both superior in quality and more generalizable than ray differentials, I would not be surprised to see more renderers adopt this technique over time.</p>
<p>As of present, however, instead of using any of the above techniques, pretty much all production renderers today use various ad-hoc methods for tracking ray widths for secondary rays.
SPI Arnold tracks accumulated roughness values encountered by a ray: if a ray either encounters a diffuse event or reaches a sufficiently high accumulated roughness value, SPI Arnold automatically goes to basically the highest available MIP level <a href="https://dl.acm.org/citation.cfm?id=3180495">[Kulla et al. 2018]</a>.
This scheme produces very aggressive texture filtering, but in turn provides excellent texture access patterns.
Solid Angle Arnold similarly uses an ad-hoc microfacet-inspired heuristic for secondary rays <a href="https://dl.acm.org/citation.cfm?id=3182160">[Georgiev et al. 2018]</a> .
Renderman handles reflection and refraction using something similar to <a href="https://graphics.stanford.edu/papers/trd/">Igehy [1999]</a>, but modified for the single-float-width ray differential representation that Renderman uses <a href="https://dl.acm.org/citation.cfm?id=3182162">[Christensen et al. 2018]</a>.
For glossy and diffuse events, Renderman uses an empirically determined heuristic where higher ray width spreads are driven by lower scattering direction pdfs.
Weta’s Manuka has a unified roughness estimation system built into the shading system, which uses a mean cosine estimate for figuring out ray differentials <a href="https://dl.acm.org/citation.cfm?id=3182161">[Fascione et al. 2018]</a>.</p>
<p>Generally, roughness driven heuristics seem to work reasonably well in production, and the actual heuristics don’t actually have to be too complicated!
In an experimental branch of PBRT, Matt Pharr found that a simple heuristic that uses a ray differential covering roughly 1/25th of the hemisphere for diffuse events and 1/100th of the hemisphere for glossy events generally worked reasonably well <a href="https://www.pbrt.org/texcache.pdf">[Pharr 2017]</a>.</p>
<p><strong>Ray Differentials and Bidirectional Techniques</strong></p>
<p>So far everything we’ve discussed has been for unidirectional path tracing that starts from the camera.
What about ray differentials and mip level selection for paths starting from a light, and by extension, for bidirectional path tracing techniques?
Unfortunately, nobody has a good, robust solution for calculating ray differentials for light path!
Calculating ray differentials for light paths is fundamentally something of an ill defined problem: a ray differential has to be calculated with respect to a screen space pixel footprint, which works fine for camera paths since the first ray starts from the camera, but for light paths, the <em>last</em> ray in the path is the one that reaches the camera.
With light paths, we have something of a chicken-and-egg problem; there is no way to calculate anything with respect to a screen space pixel footprint until a light path has already been fully constructed, but the shading computations required to construct the path are the computations that want differential information in the first place.
Furthermore, even if we did have a good way to calculate a starting ray differential from a light, the corresponding path differential can’t become as wide as in the case of a camera path, since at any given moment the light path might scatter towards the camera and therefore needs to maintain a footprint no wider than a single screen space pixel.</p>
<p>Some research work has gone into this question, but more work is needed on this topic.
The previously discussed covariance tracing based technique <a href="https://dl.acm.org/citation.cfm?id=2487239">[Belcour et al. 2017]</a> does allow for calculating an ideal texture filtering width and mip level once a light path is fully constructed, but again, the real problem is that footprints need to be available during path construction, not afterwards.
With bidirectional path tracing, things get even harder.
In order to keep a bidirectional path unbiased, all connections between camera and light path vertices must be consistent in what mip level they sample; however, this is difficult since ray differentials depend on the scattering events at each path vertex.
<a href="https://dl.acm.org/citation.cfm?id=2487239">Belcour et al. [2017]</a> demonstrates how important
consistent texture filtering between two vertices is.</p>
<p>Currently, only a handful of production renderers have extensive support for bidirectional techniques; of the ones that do, the most common solution to calculating ray differentials for bidirectional paths is… simply not to at all.
Unfortunately, this means bidirectional techniques must rely on point sampling the lowest mip level, which defeats the whole point of mipmapping and destroys texture caching performance.
The Manuka team alludes to using ray differentials for photon map gather widths in VCM and notes that these ray differentials are implemented as part of their manifold next event estimation system <a href="https://dl.acm.org/citation.cfm?id=3182161">[Fascione et al. 2018]</a>, but there isn’t enough detail in their paper to be able to figure out how this actually works.</p>
<p><strong>Camera-Based Mipmap Level Selection</strong></p>
<p>Takua has implementations of standard bidirectional path tracing, progressive photon mapping, and VCM, and I wanted mipmapping to work with all integrator types in Takua.
I’m interested in using Takua to render scenes with very high complexity levels using advanced (often bidirectional) light transport algorithms, but reaching production levels of shading complexity without a mipmapped texture cache simply is not possible without crazy amounts of memory (where crazy is defined as in the range of dozens to hundreds of GB of textures or more).
However, for the reasons described above, standard ray differential based techniques for calculating mip levels weren’t going to work with Takua’s bidirectional integrators.</p>
<p>The lack of a ray differential solution for light paths left me stuck for some time, until late in 2017, when I got to read an early draft of what eventually became the Manuka paper <a href="https://dl.acm.org/citation.cfm?id=3182161">[Fascione et al. 2018]</a> in the ACM Transactions on Graphics special issue on production rendering.
I highly recommend reading all five of the production renderer system papers in the ACM TOG special issue.
However, if you’re already generally familiar with how a modern PBRT-style renderer works and only have time to read one paper, I would recommend the Manuka paper simply because Manuka’s architecture and the set of trade-offs and choices made by the Manuka team are so different from what every other modern PBRT-style production path tracer does.
What I eventually implemented in Takua is directly inspired by Manuka, although it’s not what Manuka actually does (I think).</p>
<p>The key architectural feature that differentiates Manuka from Arnold/Renderman/Vray/Corona/Hyperion/etc. is its <em>shade-before-hit</em> architecture.
I should note here that in this context, shade refers to the pattern generation part of shading, as opposed to the bsdf evaluation/sampling part of shading.
The brief explanation (you really should go read the full paper) is that Manuka completely decouples pattern generation from path construction and path sampling, as opposed to what all other modern path tracers do.
Most modern path tracers use <em>shade-on-hit</em>, which means pattern generation is lazily evaluated to produce bsdf parameters upon demand, such as when a ray hits a surface.
In a shade on hit architecture, pattern generation and path sampling are interleaved, since path sampling depends on the results of pattern generation.
Separating out geometry processing from path construction is fairly standard in most modern production path tracers, meaning subdivision/tessellation/displacement happens before any rays are traced, and displacement usually involves some amount of pattern generation.
However, no other production path tracer separates out <em>all</em> of pattern generation from path sampling the way Manuka does.
At render startup, Manuka runs geometry processing, which dices all input geometry into micropolygon grids, and then runs pattern generation on all of the micropolygons.
The result of pattern generation is a set of bsdf parameters that are baked into the micropolygon vertices.
Manuka then builds a BVH and proceeds with normal path tracing, but at each path vertex, instead of having to evaluate shading graphs and do texture lookups to calculate bsdf parameters, the bsdf parameters are looked up directly from the pre-calculated cached values baked into the micropolygon vertices.
Put another way, Manuka is a path tracer with a REYES-style shader execution model <a href="https://dl.acm.org/citation.cfm?id=37414">[Cook et al. 1987]</a> instead of a PBRT-style shader execution model; Manuka preserves the grid-based shading coherence from REYES while also giving more flexibility to path sampling and light transport, which no longer have to worry about pattern generation making shading slow.</p>
<p>So how does any of this relate to the bidirectional path tracing mip level selection problem?
The answer is: in a shade-before-hit architecture, by the time the renderer is tracing light paths, there is no need for mip level selection because <em>there are no texture lookups required anymore during path sampling</em>.
During path sampling, Manuka evaluates bsdfs at each hit point using pre-shaded parameters that are bilinearly interpolated from the nearest micropolygon vertices; all of the texture lookups were already done in the pre-shade phase of the renderer!
In other words, at least in principle, a Manuka-style renderer can entirely sidestep the bidirectional path tracing mip level selection problem (although I don’t know if Manuka actually does this or not).
Also, in a shade-before-hit architecture, there are no concerns with biasing bidirectional path tracing from different camera/light path vertex connections seeing different mip levels.
Since all mip level selection and texture filtering decisions take place before path sampling, the view of the world presented to path sampling is always consistent.</p>
<p>Takua is not a shade-before-hit renderer though, and for a variety of reasons, I don’t plan on making it one.
Shade-before-hit presents a number of tradeoffs which are worthwhile in Manuka’s case because of the problems and requirements the Manuka team aimed to solve and meet, but Takua is a hobby renderer aimed at something very different from Manuka.
The largest drawback of shade-before-hit is the startup time associated with having to pre-shade the entire scene; this startup time can be quite large, but in exchange, the total render time can be faster as path sampling becomes more efficient.
However, in a number of workflows, the time to a full render is not nearly as important as the time to a minimum sample count at which point an artistic decision can be made on a noisy image; beyond this point, full render time is less important as long as it is within a reasonable ballpark.
Takua currently has a fast startup time and reaches a first set of samples quickly, and I wanted to keep this behavior.
As a result, the question then became: in a shade-on-hit architecture, is there a way to emulate shade-before-hit’s consistent view of the world, where texture filtering decisions are separated from path sampling?</p>
<p>The approach I arrived at is to drive mip level selection based on only a world-space distance-to-camera metric, with no dependency at all on the incoming ray at a given hit point.
This approach is… not even remotely novel; in a way, this approach is probably the most obvious solution of all, but it took me a long time and a circuitous path to arrive at for some reason.
Here’s the high-level overview of how I implemented a camera-based mip level selection technique:</p>
<ol>
<li>At render startup time, calculate a ray differential for each pixel in the camera’s image plane. The goal is to find the narrowest differential in each screen space dimension x and y. Store this piece of information for later.</li>
<li>At each ray-surface intersection point, calculate the differential surface.</li>
<li>Create a ‘fake’ ray going from the camera’s origin position to the current intersection point, with a ray differential equal to the minimum differential in each direction found in step 1.</li>
<li>Calculate dudx/dudy and dvdx/dvdy using the usual method presented above, but using the fake ray from step 3 instead of the actual ray.</li>
<li>Calculate the mip level as usual from dudx/dudy and dvdx/dvdy.</li>
</ol>
<p>The rational for using the narrowest differentials in step 1 is to guarantee that texture frequency remains sub-pixel for the all pixels in screen space, even if that means that we might be sampling some pixels at a higher resolution mip level than whatever screen space pixel we’re accumulating radiance too.
In this case, being overly conservative with our mip level selection is preferable to visible texture blurring from picking a mip level that is too low resolution.</p>
<p>Takua uses the above approach for all path types, including light paths in the various bidirectional integrators.
Since the mip level selection is based entirely on distance-to-camera, as far as the light transport integrators are concerned, their view of the world is entirely consistent.
As a result, Takua is able to sidestep the light path ray differential problem in much the same way that a shade-before-hit architecture is able to.
There are some particular implementation details that are slightly complicated by Takua having support for multiple uv sets per mesh, but I’ll write about multiple uv sets in a later post.
Also, there is one notable failure scenario, which I’ll discuss more in the results section.</p>
<p><strong>Results</strong></p>
<p>So how well does camera-based mipmap level selection work compared to a more standard approach based on path differentials or ray widths from the incident ray?
Typically in a production renderer, mipmaps work in conjunction with tiled textures, where tiles are a fixed size (unless a tile is in a mipmap level with a total resolution smaller than the tile resolution).
Therefore, the useful metric to compare is how many texture tiles each approach access throughout the course of a render; the more an approach accesses higher mipmap levels (meaning lower resolution mipmap levels), the fewer tiles in total should be accessed since lower resolution mipmap levels have fewer tiles.</p>
<p>For unidirectional path tracing from the camera, we can reasonably expect the camera-based approach to perform less well than a path differential or ray width technique (which I’ll call simply ‘ray-based’).
In the camera-based approach, every texture lookup has to use a footprint corresponding to approximately a single screen space pixel footprint, whereas in a more standard ray-based approach, footprints get wider with each successive bounce, leading to access to higher mipmap levels.
Depending on how aggressively ray widths are widened at diffuse and glossy events, ray-based approaches can quickly reach the highest mipmap levels and essentially spend the majority of the render only accessing high mipmap levels.</p>
<p>For bidirectional integrators though, the camera-based techinque has the major advantage of being able to provide reasonable mipmap levels for both camera and light paths, whereas the more standard ray-based approaches have to fall back to point sampling the lowest mipmap level for light paths.
As a result, for bidirectional paths we can expect that a ray-based approach should perform somewhere in between how a ray-based approach performs in the unidirectional case and how point sampling only the lowest mipmap level performs in the unidirectional case.</p>
<p>As a baseline, I also implemented a ray-based approach with a relatively aggressive widening heuristic for glossy and diffuse events.
For the forest scene from Figure 1, I got the following results at 1920x1080 resolution with 16 samples per pixel.
I compared unidirectional path tracing from the camera and standard bidirectional path tracing; statistics are presented as total number of texture tiles accessed divided by total number of texture tiles across all mipmap levels.
The lower the percentage, the better:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>16 SPP 1920x1080 Unidirectional (PT)
No mipmapping: 314439/745394 tiles (42.18%)
Ray-based level selection: 103206/745394 tiles (13.84%)
Camera-based level selection: 104764/745394 tiles (14.05%)
16 SPP 1920x1080 Bidirectional (BDPT)
No mipmapping: 315452/745394 tiles (42.32%)
Ray-based level selection: 203491/745394 tiles (27.30%)
Camera-based level selection: 104858/745394 tiles (14.07%)
</code></pre></div></div>
<p>As expected, in the unidirectional case, the camera-based approach accesses slightly more tiles than the ray-based approach, and both approaches significantly outperform point sampling the lowest mipmap level.
In the bidirectional case, the camera-based approach accesses slightly more tiles than in the unidirectional case, while the ray-based approach performs somewhere between the ray-based approach in unidirectional and point sampling the lowest mipmap level in unidirectional.
What surprised me is how close the camera-based approach performed compared to the ray-based approach in the unidirectional case, especially since I chose a fairly aggresive widening heuristic (essentially a more aggressive version of the same heuristic that Matt Pharr uses in the texture cached branch of PBRTv3).</p>
<p>To help with visualizing what mipmap levels are being accessed, I implemented a new AOV in Takua that assigns colors to surfaces based on what mipmap level is accessed.
With camera-based mipmap level selection, this AOV shows simply what mipmap level is accessed by all rays that hit a given point on a surface.
Each mipmap level is represented by a different color, with support up to 12 mipmap levels.
The following two images show accessed mipmap level at 1080p and 2160p (4K); note how the 2160p render accesses more lower mipmap levels than the 1080p render.
The pixel footprints in the higher resolution render are smaller when projected into world space since more pixels have to pack into the same field of view.
The key below each image shows what mipmap level each color corresponds to:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2018/Oct/forest_texcache.1080.png"><img src="https://blog.yiningkarlli.com/content/images/2018/Oct/preview/forest_texcache.1080.jpg" alt="Figure 3: Mipmap levels accessed for the forest scene from Figure 1, rendered at 1920x1080 resolution." /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2018/Oct/forest_texcache.4k.png"><img src="https://blog.yiningkarlli.com/content/images/2018/Oct/preview/forest_texcache.4k.jpg" alt="Figure 4: Mipmap levels accessed for the forest scene from Figure 2, rendered at 3840x2160 resolution. Note how since the render is higher resolution and therefore pixel footprints are smaller for the same field of view, lower mipmap levels are accessed more frequently compared to Figure 3." /></a></p>
<p>In general, everything looks as we would expect it to look in a working mipmapping system!
Surface points farther away from the camera are generally accessing higher mipmap levels, and surface points closer to the camera are generally accessing lower mipmap levels.
The ferns in the front of the frame near the camera access higher mipmap levels than the big fallen log in the center of the frame even though the ferns are closer to the camera because the textures for each leaf are extremely high resolution and the fern leaves are very small in screen-space.
Surfaces that are viewed at highly glancing angles from the camera tend to access higher mipmap levels than surfaces that are camera-facing; this effect is easiest to see on the rocks in bottom front of the frame.
The interesting sudden shift in mipmap level on some of the tree trunks comes from the tree trunks using two diffrent uv sets; the lower part of each tree trunk uses a different texture than the upper part, and the main textures are blended using a blending mask in a different uv space from the main textures; since the differential surface depends in part on the uv parameterization, different uv sets can result in different mipmap level selection behavior.</p>
<p>I also added a debug mode to Takua that tracks mipmap level access per texture sample.
In this mode, for a given texture, the renderer splats into an image the lowest acceessed mipmap level for each texture sample.
The result is sort of a heatmap that can be overlaid on the original texture’s lowest mipmap level to see what parts of texture are sampled at what resolution.
Figure 5 shows one of these heatmaps for the texture on the fallen log in the center of the frame:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2018/Oct/texture_rawaccess.png"><img src="https://blog.yiningkarlli.com/content/images/2018/Oct/preview/texture_rawaccess.png" alt="Figure 5: Mipmap level access patterns for the texture in Figure 2. Colors correspond to mipmap levels using the same key as in Figures 3 and 4. Dark grey indicates areas of the texture that were not sampled at all." /></a></p>
<p>Just like in Figures 3 and 4, we can see that renders at higher resolutions will tend to access lower mipmap levels more frequently.
Also, we can see that the vast majority of the texture is never sampled at all; with a tiled texture caching system where tiles are loaded on demand, this means there are a large number of texture tiles that we never bother to load at all.
In cases like Figure 5, not loading unused tiles provides enormous memory savings compared to if we just loaded an entire non-mipmapped texture.</p>
<p>So far using a camera-based approach to mipmap level selection combined with just point sampling at each texture sample has held up very well in Takua!
In fact, the <a href="https://blog.yiningkarlli.com/2018/02/scandinavian-room-scene.html">Scandinavian Room</a> scene from earlier this year was rendered using the mipmap approach described in this post as well.
There is, however, a relatively simple type of scene that Takua’s camera-based approach fails badly at handling: refraction near the camera.
If a lens is placed directly in front of the camera that significantly magnifies part of the scene, a purely world-space metric for filter footprints can result in choosing mipmap levels that are too high, which translates to visible texture blurring or pixelation.
I don’t have anything implemented to handle this failure case right now.
One possible solution I’ve thought about is to initially trace a set of rays from the camera using traditional ray differential propogation for specular objects, and cache the resultant mipmap levels in the scene.
Then, during the actual renders, the renderer could compare the camera-based metric from the nearest N cached metrics to infer if a lower mipmap level is needed than what the camera-based metric produces.
However, such a system would add significant cost to the mipmap level selection logic, and there are a number of implementation complications to consider.
I do wonder how Manuka handles the “lens in front of a camera” case as well, since the shade-before-hit paradigm also fails on this scenario for the same reasons.</p>
<p>Long term, I would like to spend more time looking in to (and perhaps implementing) a covariance tracing based approach.
While Takua currently gets by with just point sampling, filtering becomes much more important for other effects, such as glinty microfacet materials, and covariance tracing based filtering seems to be the best currently known solution for these cases.</p>
<p>In an upcoming post, I’m aiming to write about how Takua’s texture caching system works in conjunction with the mipmapping system described in this post.
As mentioned earlier, I’m also planning a (hopefully) short-ish post about supporting multiple uv sets, and how that impacts a mipmapping and texture caching system.</p>
<p><strong>Additional Renders</strong></p>
<p>Finally, since this has been a very text-heavy post, here are some bonus renders of the same forest scene under different lighting conditions.
When I was setting up this scene for Takua, I tried a number of different lighting conditions and settled on the one in Figure 1 for the main render, but some of the alternatives were interesting too.
In a future post, I’ll show a bunch of interesting renders of this scene from different camera angles, but for now, here is the forest at different times of day:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2018/Oct/forest_overcast.0.jpg"><img src="https://blog.yiningkarlli.com/content/images/2018/Oct/preview/forest_overcast.0.jpg" alt="Figure 6: The forest early on an overcast morning." /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2018/Oct/forest_morning.0.jpg"><img src="https://blog.yiningkarlli.com/content/images/2018/Oct/preview/forest_morning.0.jpg" alt="Figure 7: The forest later on a sunnier morning." /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2018/Oct/forest_noon.0.jpg"><img src="https://blog.yiningkarlli.com/content/images/2018/Oct/preview/forest_noon.0.jpg" alt="Figure 8: The forest at noon on a sunny blue sky day." /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2018/Oct/forest_sunset.0.jpg"><img src="https://blog.yiningkarlli.com/content/images/2018/Oct/preview/forest_sunset.0.jpg" alt="Figure 9: The forest at sunset." /></a></p>
<p><strong>References</strong></p>
<p>John Amanatides. 1984. <a href="https://dl.acm.org/citation.cfm?id=808589">Ray Tracing with Cones</a>. <em>Computer Graphics (Proceedings of SIGGRAPH)</em> 18, 3 (1984), 129-135.</p>
<p>Laurent Belcour, Cyril Soler, Kartic Subr, Nicolas Holzschuch, and Frédo Durand. 2013. <a href="https://dl.acm.org/citation.cfm?id=2487239">5D Covariance Tracing for Efficient Defocus and Motion Blur</a>. <em>ACM Transactions on Graphics</em>. 32, 3 (2013), 31:1–31:18.</p>
<p>Laurent Belcour, Ling-Qi Yan, Ravi Ramamoorthi, and Derek Nowrouzezahrai. 2017. <a href="https://dl.acm.org/citation.cfm?id=2990495">Antialiasing Complex Global Illumination Effects in Path-Space</a>. <em>ACM Transactions on Graphics</em>. 36, 1 (2017), 9:1–9:13.</p>
<p>Brent Burley, David Adler, Matt Jen-Yuan Chiang, Hank Driskill, Ralf Habel, Patrick Kelly, Peter Kutz, Yining Karl Li, and Daniel Teece. 2018. <a href="https://dl.acm.org/citation.cfm?id=3182159">The Design and Evolution of Disney’s Hyperion Renderer</a>. <em>ACM Transactions on Graphics</em>. 37, 3 (2018), 33:1-33:22.</p>
<p>Per Christensen, Julian Fong, Jonathan Shade, Wayne Wooten, Brenden Schubert, Andrew Kensler, Stephen Friedman, Charlie Kilpatrick, Cliff Ramshaw, Marc Bannister, Brenton Rayner, Jonathan Brouillat, and Max Liani. 2018. <a href="https://dl.acm.org/citation.cfm?id=3182162">RenderMan: An Advanced Path-Tracing Architeture for Movie Rendering</a>. <em>ACM Transactions on Graphics</em>. 37, 3 (2018), 30:1–30:21.</p>
<p>Per Christensen, David M. Laur, Julian Fong, Wayne Wooten, and Dana Batali. 2003. <a href="https://diglib.eg.org/handle/10.2312/8776">Ray Differentials and Multiresolution Geometry Caching for Distribution Ray Tracing in Complex Scenes</a>. <em>Computer Graphics Forum</em>. 22, 3 (2003), 543-552.</p>
<p>Robert L. Cook, Loren Carpenter, and Edwin Catmull. 1987. <a href="https://dl.acm.org/citation.cfm?id=37414">The Reyes Image Rendering Architecture</a>. <em>Computer Graphics (Proceedings of SIGGRAPH)</em> 21, 4 (1987), 95-102.</p>
<p>Frédo Durand, Nicolas Holzchuch, Cyril Soler, Eric Chan, and François X Sillion. 2005. <a href="https://dl.acm.org/citation.cfm?id=1073320">A Frequency Analysis of Light Transport</a>. <em>ACM Transactions on Graphics</em>. 24, 3 (2005), 1115-1126.</p>
<p>Luca Fascione, Johannes Hanika, Mark Leone, Marc Droske, Jorge Schwarzhaupt, Tomáš Davidovič, Andrea Weidlich and Johannes Meng. 2018. <a href="https://dl.acm.org/citation.cfm?id=3182161">Manuka: A Batch-Shading Architecture for Spectral Path Tracing in Movie Production</a>. <em>ACM Transactions on Graphics</em>. 37, 3 (2018), 31:1–31:18.</p>
<p>Iliyan Georgiev, Thiago Ize, Mike Farnsworth, Ramón Montoya-Vozmediano, Alan King, Brecht van Lommel, Angel Jimenez, Oscar Anson, Shinji Ogaki, Eric Johnston, Adrien Herubel, Declan Russell, Frédéric Servant, and Marcos Fajardo. 2018. <a href="https://dl.acm.org/citation.cfm?id=3182160">Arnold: A Brute-Force Production Path Tracer</a>. <em>ACM Transactions on Graphics</em>. 37, 3 (2018), 32:1-32:12.</p>
<p>Larry Gritz and James K. Hahn. 1996. <a href="https://www.tandfonline.com/doi/abs/10.1080/10867651.1996.10487462">BMRT: A Global Illumination Implementation of the RenderMan Standard</a>. <em>Journal of Graphics Tools</em>. 3, 1 (1996), 29-47.</p>
<p>Paul S. Heckbert and Pat Hanrahan. 1984. <a href="https://dl.acm.org/citation.cfm?id=808588">Beam Tracing Polygonal Objects</a>. <em>Computer Graphics (Proceedings of SIGGRAPH)</em> 18, 3 (1984), 119-127.</p>
<p>Homan Igehy. 1999. <a href="https://graphics.stanford.edu/papers/trd/">Tracing Ray Differentials</a>. In <em>SIGGRAPH ‘99 (Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques)</em>. 179–186.</p>
<p>Wenzel Jakob. 2010. <a href="http://www.mitsuba-renderer.org/"><em>Mitsuba Renderer</em></a>.</p>
<p>Christopher Kulla, Alejandro Conty, Clifford Stein, and Larry Gritz. 2018. <a href="https://dl.acm.org/citation.cfm?id=3180495">Sony Pictures Imageworks Arnold</a>. <em>ACM Transactions on Graphics</em>. 37, 3 (2018), 29:1-29:18.</p>
<p>Mark Lee, Brian Green, Feng Xie, and Eric Tabellion. 2017. <a href="https://dl.acm.org/citation.cfm?doid=3105762.3105768">Vectorized Production Path Tracing</a>. In <em>HPG ‘17 (Proceedings of High Performance Graphics)</em>. 10:1-10:11.</p>
<p>Darwyn Peachey. 1990. <a href="https://graphics.pixar.com/library/TOD/"><em>Texture on Demand</em></a>. Technical Report 217. Pixar Animation Studios.</p>
<p>Matt Pharr, Wenzel Jakob, and Greg Humphreys. 2016. <a href="http://www.pbr-book.org"><em>Physically Based Rendering:
From Theory to Implementation</em></a>, 3rd ed. Morgan Kaufmann.</p>
<p>Matt Pharr. 2017. <a href="https://www.pbrt.org/texcache.pdf"><em>The Implementation of a Scalable Texture Cache</em></a>. Physically Based Rendering Supplemental Material.</p>
<p>Dan Piponi. 2004. <a href="https://www.tandfonline.com/doi/abs/10.1080/10867651.2004.10504901">Automatic Differentiation, C++ Templates and Photogrammetry</a>. <em>Journal of Graphics Tools</em>. 9, 4 (2004), 41-55.</p>
<p>Mikio Shinya, Tokiichiro Takahashi, and Seiichiro Naito. 1987. <a href="https://dl.acm.org/citation.cfm?id=37408">Principles and Applications of Pencil Tracing</a>. <em>Computer Graphics (Proceedings of SIGGRAPH)</em> 21, 4 (1987), 45-54.</p>
<p>Frank Suykens and Yves. D. Willems. 2001. <a href="http://graphics.cs.kuleuven.be/publications/PATHDIFF/">Path Differentials and Applications</a>. In <em>Rendering Techniques 2001 (Proceedings of the 12th Eurographics Workshop on Rendering)</em>. 257–268.</p>
<p>Lance Williams. 1983. <a href="https://dl.acm.org/citation.cfm?id=801126">Pyramidal Parametrics</a>. <em>Computer Graphics (Proceedings of SIGGRAPH)</em> 12, 3 (1983), 1-11.</p>
https://blog.yiningkarlli.com/2018/08/hyperion-tog-paper.html
Transactions on Graphics Paper- The Design and Evolution of Disney's Hyperion Renderer
2018-08-17T00:00:00+00:00
2018-08-17T00:00:00+00:00
Yining Karl Li
<p>The August 2018 issue of <a href="https://tog.acm.org">ACM Transactions on Graphics</a> (Volume 37 Issue 3) is partially a special issue on production rendering, featuring five systems papers describing notable, major production renderers in use today.
I got to contribute to one of these papers as part of the Hyperion team at Walt Disney Animation Studios!
Our paper, titled “The Design and Evolution of Disney’s Hyperion Renderer”, discusses exactly what the title suggests.
We present a detailed look inside how Hyperion is designed today, discuss the decisions that went into its current design, and examine how Hyperion has evolved since the original EGSR 2013 “<a href="https://disney-animation.s3.amazonaws.com/uploads/production/publication_asset/70/asset/Sorted_Deferred_Shading_For_Production_Path_Tracing.pdf">Sorted Deferred Shading for Production Path Tracing</a>” paper that was the start of Hyperion.
A number of Hyperion developers contributed to this paper as co-authors, along with Hank Driskill, who was the technical supervisor on Big Hero 6 and Moana and was one of the key supporters of Hyperion’s early development and deployment.</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2018/Aug/design_of_hyperion.png"><img src="https://blog.yiningkarlli.com/content/images/2018/Aug/preview/design_of_hyperion.jpg" alt="Image from paper Figure 1: Production frames from Big Hero 6 (upper left), Zootopia (upper right), Moana (bottom left), and Olaf’s Frozen Adventure (bottom right), all rendered using Disney’s Hyperion Renderer." /></a></p>
<p>Here is the paper abstract:</p>
<p><em>Walt Disney Animation Studios has transitioned to path-traced global illumination as part of a progression of brute-force physically based rendering in the name of artist efficiency. To achieve this without compromising our geometric or shading complexity, we built our Hyperion renderer based on a novel architecture that extracts traversal and shading coherence from large, sorted ray batches. In this article, we describe our architecture and discuss our design decisions. We also explain how we are able to provide artistic control in a physically based renderer, and we demonstrate through case studies how we have benefited from having a proprietary renderer that can evolve with production needs.</em></p>
<p>The paper and related materials can be found at:</p>
<ul>
<li><a href="https://www.yiningkarlli.com/projects/hyperiondesign.html">Project Page (Author’s Version)</a></li>
<li><a href="https://dl.acm.org/citation.cfm?doid=3243123.3182159">Official Print Version (ACM Library)</a></li>
</ul>
<p>We owe a huge thanks to <a href="https://pharr.org/matt/">Matt Pharr</a>, who came up with the idea for a TOG special issue on production rendering and coordinated the writing of all of the papers, and <a href="http://www.cs.cornell.edu/~kb/">Kavita Bala</a>, who as editor-in-chief of TOG supported all of the special issue papers.
This issue has actually been in the works for some time; Matt Pharr contacted us over a year ago about putting together a special issue, and we began work on our paper in May 2017.
Matt and Kavita generously gave all of the contributors to the special issue a significant amount of time to write, and Matt provided a lot of valuable feedback and suggestions to all five of the final papers.
The end result is, in my opinion, something special indeed.
The five rendering teams that contributed papers in the end were Solid Angle’s Arnold, Sony Imageworks’ Arnold, Weta Digital’s Manuka, Pixar’s Renderman, and ourselves.
All five of the papers in the special issue are fascinating, well-written, highly technical rendering systems papers (as opposed to just marketing fluff), and absolutely worth a read!</p>
<p>Something important that I want to emphasize here is that the author lists for all five papers are somewhat deceptive.
One might think that the author lists represent all of the people responsible for each renderers’ success; this idea is, of course, inaccurate.
For Hyperion, the authors on this paper represent just a small fraction of all of the people responsible for Hyperion’s success.
Numerous engineers not on the author list have made significant contributions to Hyperion in the past, and the project relies enormously on all of the QA engineers, managers/leaders, TDs, artists, and production partners that test, lead, deploy, and use Hyperion every day.
We also owe an enormous amount to all of the researchers that we have collaborated directly with, or who we haven’t collaborated directly with but have used their work.
The success of every production renderer comes not just from the core development team, but instead from the entire community of folks that surround a production renderer; this is just as true for Hyperion as it is for Renderman, Arnold, Manuka, etc.
The following is often said in our field but nonetheless true: building an advanced production renderer in a reasonable timeframe really is only possible through a massive team effort.</p>
<p>This summer, in addition to publishing this paper, members of the Hyperion team also presented the following at SIGGRAPH 2018:</p>
<ul>
<li>Peter Kutz was on the “<a href="https://dl.acm.org/citation.cfm?id=3214901">Design and Implementation of Modern Production Renderers</a>” panel put together by Matt Pharr to discuss the five TOG production rendering papers. Originally Brent Burley was supposed to represent the Hyperion team, but due to some outside circumstances, Brent wasn’t able to make it to SIGGRAPH this year, so Peter went in Brent’s place.</li>
<li>Matt Jen-Yuan Chiang presented a talk on rendering eyes, titled “<a href="https://dl.acm.org/citation.cfm?id=3214751">Plausible Iris Caustics and Limbal Arc Rendering</a>”, in the “It’s a Material World” talks session.</li>
</ul>
https://blog.yiningkarlli.com/2018/07/disney-animation-datasets.html
Disney Animation Data Sets
2018-07-03T00:00:00+00:00
2018-07-03T00:00:00+00:00
Yining Karl Li
<p>Today at <a href="https://cg.ivd.kit.edu/egsr18/">EGSR 2018</a>, Walt Disney Animation Studios announced the release of two large, production quality/scale data sets for rendering research purposes.
The data sets are available on a new <a href="https://disneyanimation.com/data-sets/">data sets page on the official Disney Animation website</a>.
The first data set is the Cloud Data Set, which contains a large and highly detailed volumetric cloud data set that we used for our “<a href="https://blog.yiningkarlli.com/2017/07/spectral-and-decomposition-tracking.html">Spectral and Decomposition Tracking for Rendering Heterogeneous Volumes</a>” SIGGRAPH 2017 paper, and the second data set is the Moana Island Scene, which is a full production scene from <a href="https://blog.yiningkarlli.com/2016/11/moana.html">Moana</a>.</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2018/Jul/shotCam_hyperion.png"><img src="https://blog.yiningkarlli.com/content/images/2018/Jul/preview/shotCam_hyperion.jpg" alt="Figure 1: The Moana Island Data Set, rendered using Disney's Hyperion Renderer." /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2018/Jul/wdas_cloud_hyperion_render.png"><img src="https://blog.yiningkarlli.com/content/images/2018/Jul/preview/wdas_cloud_hyperion_render.jpg" alt="Figure 2: The Cloud Data Set, rendered using Disney's Hyperion Renderer." /></a></p>
<p>In this post, I’ll share some personal thoughts, observations, and notes.
The release of these data sets was announced by my teammate, Ralf Habel, at EGSR today, but this release has been in the works for a very long time now, and is the product of the collective effort of an enormous number of people across the studio.
A number of people deserve to be highlighted: Rasmus Tamstorf spearheaded the entire effort and was instrumental in getting the resources and legal approval needed for the Moana Island Scene.
Heather Pritchett is the TD that did the actual difficult work of extracting the Moana Island Scene out of Disney Animation’s production pipeline and converting it from proprietary data formats into usable, industry-standard data formats.
Sean Palmer and Jonathan Garcia also helped in resurrecting the data from Moana.
Hyperion developers Ralf Habel and Peter Kutz led the effort to get the Cloud Data Set approved and released; the cloud itself was made by artists Henrik Falt and Alex Nijmeh.
On the management side of things, technology manager Rajesh Sharma and Disney Animation CTO, <a href="https://twitter.com/ncannon?lang=en">Nick Cannon</a>, provided crucial support and encouragement.
Matt Pharr has been crucial in collaborating with us to get these data sets released.
Matt was highly accommodating in helping us get the Moana Island Scene into a PBRT scene; I’ll talk a bit more about this later.
Intel’s Embree team also gave significant feedback.
My role was actually quite small; along with other members of the Hyperion development team, I just provided some consultation throughout the whole process.</p>
<p>Please note the licenses that the data sets come with.
The Cloud Data Set is licensed under a <a href="https://disney-animation.s3.amazonaws.com/uploads/production/data_set_asset/6/asset/License_Cloud.pdf">Creative Commons Attribution ShareAlike 3.0 Unported License</a>; the actual cloud is based on a photograph by Kevin Udy on his <a href="https://coclouds.com/436/cumulus/%202012-07-26/">Colorado Clouds Blog</a>, which is also licensed under the same Creative Commons license.
The Moana Island Scene is licensed under a more restrictive, custom Disney Enterprises <a href="https://disney-animation.s3.amazonaws.com/uploads/production/data_set_asset/4/asset/License_Moana.pdf">research license</a>.
This is because the Moana Island Scene is a true production scene; it was actually used to produce actual frames in the final film.
As such, the data set is being released only for pure research and development purposes; it’s not meant for use in artistic projects.
Please stick to and follow the licenses these data sets are released under; if people end up misusing these data sets, then it makes releasing more data sets into the community in the future much harder for us.</p>
<p>This entire effort was sparked two years ago at SIGGRAPH 2016, when Matt Pharr made an appeal to the industry to provide representative production-scale data sets to the research community.
I don’t know how many times I’ve had conversations about how well new techniques or papers or technologies will scale to production cases, only to have further discussion stymied by the lack of any true production data sets that the research community can test against.
We decided as a studio to answer Matt’s appeal, and last year at SIGGRAPH 2017, Brent Burley and Rasmus Tamstorf announced our intention to release both the Cloud and Moana Island data sets.
It’s taken nearly a year from announcement to release because the process has been complex, and it was very important to the studio to make sure the release was done properly.</p>
<p>One of the biggest challenges was getting all of the data out of the production pipeline and our various proprietary data formats into something that the research community can actually parse and make use of.
Matt Pharr was extremely helpful here; over the past year, Matt has added support for <a href="http://ptex.us">Ptex</a> textures and implemented the <a href="http://blog.selfshadow.com/publications/s2015-shading-course/burley/s2015_pbs_disney_bsdf_notes.pdf">Disney Bsdf</a> in <a href="https://github.com/mmp/pbrt-v3">PBRT v3</a>.
Having Ptex and the Disney Bsdf available in PBRT v3 made PBRT v3 the natural target for an initial port to a renderer other than Hyperion, since internally all of Hyperion’s shading uses the Disney Bsdf, and all of our texturing is done through Ptex.
Our texturing also relies heavily on procedural <a href="https://www.disneyanimation.com/technology/seexpr.html">SeExpr</a> expressions; all of the expression-drive texturing had to be baked down into Ptex for the final release.</p>
<p>Both the Cloud and Moana Island data sets are, quite frankly, enormous.
The Cloud data set contains a single OpenVDB cloud that weighs in at 2.93 GB; the data set also provides versions of the VDB file scaled down to half, quarter, eighth, and sixteenth scale resolutions.
The Moana Island data set comes in three parts: a base package containing raw geometry and texture data, an animation package containing animated stuff, and a PBRT package containing a PBRT scene generated from the base package.
These three packages combined, uncompressed, weigh in at well over 200 GB of disk space; the uncompressed PBRT package along weighs in at around 38 GB.</p>
<p>For the Moana Island Scene, the provided PBRT scene requires a minimum of around 90 GB if RAM to render.
This many seem enormous for consumer machines, because it is.
However, this is also what we mean by “production scale”; for Disney Animation, 90 GB is actually a fairly mid-range memory footprint for a production render.
On a 24-core, dual-socket Intel Xeon Gold 6136 system, the PBRT scene took me a little over an hour and 15 minutes to render from the ‘shotCam’ camera.
Hyperion renders the scene faster, but I would caution against using this data set to do performance shootouts between different renders.
I’m certain that within a short period of time, enthusiastic members of the rendering community will end up porting this scene to Renderman and Arnold and Vray and Cycles and every other production renderer out there, which will be very cool!
But keep in mind, this data set was authored very specifically around Hyperion’s various capabilities and constraints, which naturally will be very different from how one might author a complex data set for other renderers.
Every renderer works a bit differently, so the most optimal way to author a data set for every renderer will be a bit different; this data set is no exception.
So if you want to compare renderers using this data set, make sure you understand the various ways how the way this data set is structured impacts the performance of whatever renderers you are comparing.</p>
<p>For example, Hyperion subdivides/tessellates/displaces everything to as close to sub-poly-per-pixel as it can get while still fitting within computational resources.
This means our scenes are usually very heavily subdivided and tessellated.
However, the PBRT version of the scene doesn’t come with any subdivision; as a result, silhouettes in the following comparison images don’t fully match in some areas.
Similarly, PBRT’s lights and lighting model differ from Hyperion’s, and Hyperion has various artistic controls that are unique to Hyperion, meaning the renders produced by PBRT versus Hyperion differ in many ways:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2018/Jul/shotCam_hyperion.png"><img src="https://blog.yiningkarlli.com/content/images/2018/Jul/preview/shotCam_hyperion.jpg" alt="Figure 3a: 'shotCam' camera angle, rendered using Disney's Hyperion Renderer." /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2018/Jul/shotCam_pbrt.png"><img src="https://blog.yiningkarlli.com/content/images/2018/Jul/preview/shotCam_pbrt.jpg" alt="Figure 3b: 'shotCam' camera angle, rendered using PBRT v3." /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2018/Jul/beachCam_hyperion.png"><img src="https://blog.yiningkarlli.com/content/images/2018/Jul/preview/beachCam_hyperion.jpg" alt="Figure 4a: 'beachCam' camera angle, rendered using Disney's Hyperion Renderer." /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2018/Jul/beachCam_pbrt.png"><img src="https://blog.yiningkarlli.com/content/images/2018/Jul/preview/beachCam_pbrt.jpg" alt="Figure 4b: 'beachCam' camera angle, rendered using PBRT v3." /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2018/Jul/dunesACam_hyperion.png"><img src="https://blog.yiningkarlli.com/content/images/2018/Jul/preview/dunesACam_hyperion.jpg" alt="Figure 5a: 'dunesACam' camera angle, rendered using Disney's Hyperion Renderer." /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2018/Jul/dunesACam_pbrt.png"><img src="https://blog.yiningkarlli.com/content/images/2018/Jul/preview/dunesACam_pbrt.jpg" alt="Figure 5b: 'dunesACam' camera angle, rendered using PBRT v3. Some of the plants are in slightly different locations than the Hyperion render; this was just a small change that happened in data conversion to the PBRT scene." /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2018/Jul/flowersCam_hyperion.png"><img src="https://blog.yiningkarlli.com/content/images/2018/Jul/preview/flowersCam_hyperion.jpg" alt="Figure 6a: 'flowersCam' camera angle, rendered using Disney's Hyperion Renderer." /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2018/Jul/flowersCam_pbrt.png"><img src="https://blog.yiningkarlli.com/content/images/2018/Jul/preview/flowersCam_pbrt.jpg" alt="Figure 6b: 'flowersCam' camera angle, rendered using PBRT v3. Note that the silhouette of the flowers is different compared to the Hyperion render because the Hyperion render subdivides the flowers, whereas the PBRT render displays the base cage." /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2018/Jul/grassCam_hyperion.png"><img src="https://blog.yiningkarlli.com/content/images/2018/Jul/preview/grassCam_hyperion.jpg" alt="Figure 7a: 'grassCam' camera angle, rendered using Disney's Hyperion Renderer." /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2018/Jul/grassCam_pbrt.png"><img src="https://blog.yiningkarlli.com/content/images/2018/Jul/preview/grassCam_pbrt.jpg" alt="Figure 7b: 'grassCam' camera angle, rendered using PBRT v3. The sand dune in the background looks particularly different from the Hyperion render due to subdivision and displacement." /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2018/Jul/palmsCam_hyperion.png"><img src="https://blog.yiningkarlli.com/content/images/2018/Jul/preview/palmsCam_hyperion.jpg" alt="Figure 8a: 'palmsCam' camera angle, rendered using Disney's Hyperion Renderer." /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2018/Jul/palmsCam_pbrt.png"><img src="https://blog.yiningkarlli.com/content/images/2018/Jul/preview/palmsCam_pbrt.jpg" alt="Figure 8b: 'palmsCam' camera angle, rendered using PBRT v3. The palm leaves look especially different due to differences in artistic lighting shaping and curve shading differences. Most notably, the look in Hyperion depends heavily on attributes that vary along the length of the curve, which is something PBRT doesn't support yet. Some more work is needed here to get the palm leaves to look more similar between the two renders." /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2018/Jul/rootsCam_hyperion.png"><img src="https://blog.yiningkarlli.com/content/images/2018/Jul/preview/rootsCam_hyperion.jpg" alt="Figure 9a: 'rootsCam' camera angle, rendered using Disney's Hyperion Renderer." /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2018/Jul/rootsCam_pbrt.png"><img src="https://blog.yiningkarlli.com/content/images/2018/Jul/preview/rootsCam_pbrt.jpg" alt="Figure 9b: 'rootsCam' camera angle, rendered using PBRT v3. Again, the significant difference in appearance in the rocks is probably just due to subdivision/tesselation/displacement." /></a></p>
<p>Another example of a major difference between the Hyperion renders and the PBRT renders is in the water, which Hyperion renders using photon mapping to get the caustics.
The provided PBRT scenes use unidirectional pathtracing for everything including the water, hence the very different caustics.
Similarly, the palm trees in the ‘palmsCam’ camera angle look very different between PBRT and Hyperion because Hyperion’s lighting controls are very different from PBRT; Hyperion’s lights include various artistic controls for custom shaping and whatnot, which aren’t necessarily fully physical.
Also, the palm leaves are modeled using curves, and the shading depends on varying colors and attributes along the length and width of the curve, which PBRT doesn’t support yet (getting the palm leaves is actually the top priority for if more resources are freed up to improve the data set release).
These difference between renderers don’t necessarily mean that one renderer is better than the other; they simply mean that the renderers are different.
This will be true for any pair of renderers that one wants to compare.</p>
<p>The Cloud Data Set includes an example render from Hyperion, which implements our Spectral and Decomposition Tracking paper in its volumetric rendering system to efficiently render the cloud with thousands of bounces.
This render contains no post-processing; what you see in the provided image is exactly what Hyperion outputs.
The VDB file expresses the cloud as a field of heterogeneous densities.
Also provided is an example <a href="https://www.mitsuba-renderer.org">Mitsuba</a> scene, renderable using the <a href="https://github.com/zhoub/mitsuba-vdb">Mitsuba-VDB plugin that can be found on Github</a>.
Please consult the README file for some modifications in Mitsuba that are necessary to render the cloud.
Also, please note that the Mitsuba example will take an extremely long time to render, since Mitsuba isn’t really meant to render high-albedo heterogeneous volumes.
With proper acceleration structures and algorithms, rendering the cloud only takes us a few minutes using Hyperion, and should be similarly fast in any modern production renderer.</p>
<p>One might wonder just why production data sets in general are so large.
This is an interesting question; the short answer across the industry basically boils down to “artist time is more expensive and valuable than computer hardware”.
We could get these scenes to fit into much smaller footprints if we were willing to make our artists spend a lot of time aggressively optimizing assets and scenes and whatnot so that we could fit these scenes into smaller disk, memory, and compute footprints.
However, this isn’t actually always a good use of artist time; computer hardware is cheap compared to wasting artist time, which often could be better spent elsewhere making the movie better.
Throwing more memory and whatnot at huge data sets is also simply more scalable than using more artist resources, relatively speaking.</p>
<p>Both data sets come with detailed README documents; the Moana Island Scene’s documentation in particular is quite extensive and contains a significant amount of information about how assets are authored and structured at Disney Animation, and how renders are lit, art-directed, and assembled at Disney Animation.
I highly recommend reading all of the documentation carefully if you plan on working with these data sets, or just if you are generally curious about how production scenes are built at Disney Animation.</p>
<p>Personally, I’m very much looking forward to seeing what the rendering community (and the wider computer graphics community at large) does with these data sets!
I’m especially excited to see what the realtime world will be able to do with this data; seeing the Moana Island Scene in its full glory in Unreal Engine 4 or Unity would be something indeed, and I think these data sets should provide a fantastic challenge to research into light transport and ray tracing speed as well.
If you do interesting things with these data sets, please write to us at the email addresses in the provided README files!</p>
<p>Also, Matt Pharr <a href="http://pharr.org/matt/blog/2018/07/08/moana-island-pbrt-1.html">has written on his blog</a> about how the Moana Island Scene has further driven the development of PBRT v3.
I highly recommend giving Matt’s blog a read!</p>
https://blog.yiningkarlli.com/2018/02/scandinavian-room-scene.html
Scandinavian Room Scene
2018-02-23T00:00:00+00:00
2018-02-23T00:00:00+00:00
Yining Karl Li
<p>Almost three years ago, I rendered a small <a href="https://blog.yiningkarlli.com/2015/05/complex-room-renders.html">room interior scene</a> to test an indoor, interior illumination scenario.
Since then, a lot has changed in Takua, so I thought I’d revisit an interior illumination test with a much more complex, difficult scene.
I don’t have much time to model stuff anymore these days, so instead I bought <a href="https://evermotion.org/shop/show_product/archinteriors-vol-48/14307">Evermotion’s Archinteriors Volume 48</a> collection, which is labeled as Scandinavian interior room scenes (I don’t know what’s particularly Scandinavian about these scenes, but that’s what the label said) and ported one of the scenes to Takua’s scene format.
Instead of simply porting the scene as-is, I modified and added various things in the scene to make it feel a bit more customized.
See if you can spot what they are:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2018/Feb/room.cam0.0.png"><img src="https://blog.yiningkarlli.com/content/images/2018/Feb/preview/room.cam0.0.jpg" alt="Figure 1: A Scandinavian room interior, rendered in Takua a0.8 using VCM." /></a></p>
<p>I had a lot of fun adding all of my customizations!
I brought over some props from the old complex room scene, such as the purple flowers and vase, a few books, and Utah teapot tray, and also added a few new fun models, such as the MacBook Pro in the back and the copy of Physically Based Rendering 3rd Edition in the foreground.
The black and white photos on the wall are crops of my <a href="https://blog.yiningkarlli.com/2016/07/minecraft-in-renderman-ris.html">Minecraft renders</a>, and some of the books against the back wall have fun custom covers and titles.
Even all of the elements that came with the original scene are re-shaded.
The original scene came with Vray’s standard VrayMtl as the shader for everything; Takua’s base shader parameterization draws some influence from Vray, but also draws from the Disney Bsdf and Arnold’s AlShader and as a result has a parameterization that is sufficiently different that I wound up just re-shading everything instead of trying to write some conversion tool.
For the most part I was able to re-use the textures that came with the scene to drive various shader parameters.
The skydome is from the noncommercial version of <a href="https://www.viz-people.com/shop/hdri-v1/">VizPeople’s HDRi v1 collection</a>.</p>
<p>Speaking of the skydome… the main source of illumination in this scene comes from the sun in the skydome, which presented a huge challenge for efficient light sampling.
Takua has had domelight/environment map importance sampling using CDF inversion sampling for a long time now, which helps a lot, but the indoor nature of this scene still made sampling the sun difficult.
Sampling the sun in an outdoor scene is fairly efficient since most rays will actually reach the sun, but in indoor scenes, importance sampling the sun becomes inefficient without taking occlusion into account since only rays that actually make it outdoors through windows can reach the sun.
The best known method currently for handling domelight importance sampling through windows in an indoor scene is <a href="https://benedikt-bitterli.me/PMEMS.pdf">Portal Masked Environment Map Sampling (PMEMS) by Bitterli et al</a>.
I haven’t actually implemented PMEMS yet though, so the renders in this post all wound up requiring a huge number of samples per pixel to render; I intend on implementing PMEMS at some point in the near future.</p>
<p>Apart from the skydome, this scene also contains several other practical light sources, such as the lamp’s bulb, the MacBook Pro’s screen, and the MacBook Pro’s glowing Apple logo on the back of the screen (which isn’t even visible to camera, but is still enabled since it provides a tiny amount of light against the back wall!).
In addition to choosing where on a single light to sample, choosing which light to sample is also an extremely important and difficult problem.
Until this rendering this scene, I hadn’t really put any effort into efficiently selecting which light to sample.
Most of my focus has been on the integration part of light transport, so Takua’s light selection has just been uniform random selection.
Uniform random selection is terrible for scenes that contain multiple lights with highly varying emission between different lights, which is absolutely the case for this scene.
Like any other importance sampling problem, the ideal solution is to send rays towards lights with a probability proportional to the amount of illumination we expect each light to contribute to each ray origin point.</p>
<p>I implemented a light selection strategy where the probability of selecting each light is weighted by the total emitted power of each light; essentially this boils down to estimating the total emitted power of each light according to the light’s surface texture and emission function, building a CDF across all of the lights using the total emission estimates, and then using standard CDF inversion sampling to pick lights.
This strategy works significantly better than uniform random selection and made a huge difference in render speed for this scene, as seen in Figures 2 through 4.
Figure 2 uses uniform random light selection with 128 spp; note how the area lit by the wall-mounted lamp is well sampled, but the image overall is really noisy.
Figure 3 uses power-weighted light selection with the same spp as Figure 2; the lamp area is more noisy than in Figure 2, but the render is less noisy overall.
Notably, Figure 3 also took a third of the time compared to Figure 2 for the same sample count; this is because in this scene, sending rays towards the lamp is significantly more expensive due to heavier geometry than sending rays towards the sun, even when rays towards the sun get occluded by the walls.
Figure 4 uses power-weighted light selection again, but is equal-time to Figure 2 instead of equal-spp; note the significant noise reduction:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2018/Feb/room.0.uniform.png"><img src="https://blog.yiningkarlli.com/content/images/2018/Feb/preview/room.0.uniform.jpg" alt="Figure 2: The same frame from Figure 1, 128 spp using uniform random light selection. Average pixel RMSE compared to Figure 1: 0.439952." /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2018/Feb/room.0.power.equalsample.png"><img src="https://blog.yiningkarlli.com/content/images/2018/Feb/preview/room.0.power.equalsample.jpg" alt="Figure 3: Power-weighted light selection, with equal spp to Figure 2. Average pixel RMSE compared to Figure 1: 0.371441." /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2018/Feb/room.0.power.equaltime.png"><img src="https://blog.yiningkarlli.com/content/images/2018/Feb/preview/room.0.power.equaltime.jpg" alt="Figure 4: Power-weighted light selection again, but this time with equal time instead of equal spp to Figure 2. Average pixel RMSE compared to Figure 1: 0.315465." /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2018/Feb/room_sampling_crops.png"><img src="https://blog.yiningkarlli.com/content/images/2018/Feb/room_sampling_crops.png" alt="Figure 5: Zoomed crops of Figures 2 through 4. From left to right: uniform random sampling, equal sample power-weighted sampling, and equal time power-weighted sampling." /></a></p>
<p>However, power-weighted light selection still is not even close to being the most optimal technique possible; this technique completely ignores occlusion and distance, which are extremely important.
Unfortunately, because occlusion and distance to each light varies for each point in space, creating a light selection strategy that takes occlusion and distance into account is extremely difficult and is a subject of continued research in the field.
In Hyperion, we use a cache point system, which we described on page 97 of our <a href="https://graphics.pixar.com/library/ProductionVolumeRendering/paper.pdf">SIGGRAPH 2017 Production Volume Rendering course notes</a>.
Other published research on the topic includes <a href="https://cgl.ethz.ch/publications/papers/paperMue17a.php">Practical Path Guiding for Efficient Light-Transport Simulation</a> by Muller et al, <a href="http://cgg.mff.cuni.cz/~jaroslav/papers/2014-onlineis/">On-line Learning of Parametric Mixture Models for Light Transport Simulation</a> by Vorba et al, <a href="http://cgg.mff.cuni.cz/~jaroslav/papers/2016-productis/2016-productis-paper.pdf">Product Importance Sampling for Light Transport Path Guiding</a> by Herholz et al, <a href="https://arxiv.org/abs/1701.07403">Learning Light Transport the Reinforced Way</a> by Dahm et al, and more.
At some point in the future I’ll revisit this topic.</p>
<p>For a long time now, Takua has also had a simple interactive mode where the camera can be moved around in a non-shaded/non-lit view; I used this mode to interactively scout out some interesting and fun camera angles for some more renders.
Being able to interactively scout in the same renderer used to final rendering is an extremely powerful tool; instead of guessing at depth of field settings and such, I was able to directly set and preview depth of field with immediate feedback.
Unfortunately some of the renders below are noisier than I would like, due to the previously mentioned light sampling difficulties.
All of the following images are rendered using Takua a0.8 with VCM:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2018/Feb/room.cam1.0.png"><img src="https://blog.yiningkarlli.com/content/images/2018/Feb/preview/room.cam1.0.jpg" alt="Figure 6: A MacBook Pro running Takua Renderer to produce Figure 1." /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2018/Feb/room.cam2.0.png"><img src="https://blog.yiningkarlli.com/content/images/2018/Feb/preview/room.cam2.0.jpg" alt="Figure 7: Physically Based Rendering Third Edition sitting on the coffee table." /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2018/Feb/room.cam3.0.png"><img src="https://blog.yiningkarlli.com/content/images/2018/Feb/preview/room.cam3.0.jpg" alt="Figure 8: Closeup of the same purple flowers from the old Complex Room scene." /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2018/Feb/room.cam4.0.png"><img src="https://blog.yiningkarlli.com/content/images/2018/Feb/preview/room.cam4.0.jpg" alt="Figure 9: Utah Teapot tea set on the coffee table." /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2018/Feb/room.cam5.0.png"><img src="https://blog.yiningkarlli.com/content/images/2018/Feb/preview/room.cam5.0.jpg" alt="Figure 10: A glass globe with mirror-polished metal continents, sitting in the sunlight from the window." /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2018/Feb/room.cam6.0.png"><img src="https://blog.yiningkarlli.com/content/images/2018/Feb/preview/room.cam6.0.jpg" alt="Figure 11: Close-up of two glass and metal mugs filled with tea." /></a></p>
<p>Beyond difficult light sampling, generally complex and difficult light transport with lots of subtle caustics also wound up presenting major challenges in this scene.
For example, note the subtle caustics on the wall in the upper right hand part of Figure 10; those caustics are actually visibly not fully converged, even though the sample count across Figure 10 was in the thousands of spp!
I intentionally did not use adaptive sampling in any of these renders; instead, I wanted to experiment with a common technique used in a lot of modern production renderers for noise reduction: in-render firefly clamping.
My adaptive sampler is already capable of detecting firefly pixels and driving more samples at fireflies in the hopes of accelerating variance reduction on firefly pixels, but firefly clamping is a much more crude, biased, but nonetheless effective technique.
The idea is to detect on each pixel spp if a returned sample is an outlier relative to all of the previously accumulated samples, and discard or clamp the sample if it in fact is an outlier.
Picking what threshold to use for outlier detection is a very manual process; even Arnold provides a <a href="https://support.solidangle.com/display/AFMUG/Clamping">tuning max-value parameter</a> for firefly clamping.</p>
<p>I wanted to be able to directly compare the render with and without firefly clamping, so I implemented firefly clamping on top of Takua’s AOV system.
When enabled, firefly clamping mode produces two images for a single render: one output with firefly clamping enabled, and one with clamping disabled.
I tried re-rendering Figure 10 using unidirectional pathtracing and a relatively low spp count to produce as many fireflies as I could, for a clearer comparison.
For this test, I set the firefly threshold to be samples that are at least 250 times brighter than the estimated pixel value up to that sample.</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2018/Feb/room.cam5.fireflies.png"><img src="https://blog.yiningkarlli.com/content/images/2018/Feb/preview/room.cam5.fireflies.jpg" alt="Figure 12: The same render as Figure 10, but rendered with a lower sample count and using unidirectional pathtracing instead of VCM to draw out more fireflies." /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2018/Feb/room.cam5.nofireflies.png"><img src="https://blog.yiningkarlli.com/content/images/2018/Feb/preview/room.cam5.nofireflies.jpg" alt="Figure 13: From the same run of Takua Renderer as Figure 12, but the firefly-clamped render output instead of the raw render." /></a></p>
<p>Note how Figure 13 appears to be completely firefly-free compared to Figure 12, and how Figure 13 doesn’t have visible caustic noise on the walls compared to Figure 10.
However, notice how Figure 13 is also missing significant illumination in some areas, such as in the corner of the walls near the floor behind the wooden step ladder, or in the deepest parts of the purple flower bunch.
Finding a threshold that eliminates all fireflies without loosing significant illumination in other areas is very difficult or, in some cases, impossible since some of these types of light transport essentially manifest as firefly-like high energy samples that only smooth out over time.
For the final renders in Figure 1 and Figures 6 through 11, I wound up not actually using any firefly clamping.
While biased noise-reduction techniques are a necessary evil in actual production, I expect that I’ll try to avoid relying on firefly clamping in the vast majority of what I do with Takua, since Takua is meant to just be a brute-force, hobby kind of thing anyway.</p>
https://blog.yiningkarlli.com/2017/12/lambo-renders-revisited.html
Aventador Renders Revisited
2017-12-03T00:00:00+00:00
2017-12-03T00:00:00+00:00
Yining Karl Li
<p>A long time ago, I made <a href="http://blog.yiningkarlli.com/2013/03/stratified-versus-uniform-sampling.html">some</a> <a href="http://blog.yiningkarlli.com/2013/03/first-progress-on-new-pathtracing-core.html">posts</a> that featured a cool Lamborghini Aventador model.
Recently, I revisited that model and made some new renders using the current version of Takua, mostly just for fun.
To me, one of the most important parts of writing a renderer has always been being able to actually use the renderer to make fun images.
The last time I rendered this model was something like four years ago, and back then Takua was still in a very basic state; the renders in those old posts don’t even have any shading beyond 50% grey lambertian surfaces!
The renders in this post utilize a lot of advanced features that I’ve added since then, such as a proper complex layered Bsdf and texturing system, advanced bidirectional light transport techniques, huge speed improvements to ray traversal, advanced motion blur and generalized time capabilities, and more.
I’m way behind in writing up many of these features and capabilities, but in the meantime, I thought I’d post some for-fun rendering projects I’ve done with Takua.</p>
<p>All of the renders in this post are directly from Takua, with a basic white balance and conversion from HDR EXR to LDR PNG being the only post-processing steps.
Each render took about half a day to render (except for the wireframe render, which was much faster) on a 12-core workstation at 2560x1440 resolution.</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2017/Dec/lambo_orangered.png"><img src="https://blog.yiningkarlli.com/content/images/2017/Dec/preview/lambo_orangered.jpg" alt="Figure 1: An orange-red Lamborghini Aventador, rendered in Takua a0.7 using VCM." /></a></p>
<p>Shading the Aventador model was a fun, interesting exercise.
I went for a orange-red paint scheme since, well, Lamborghinis are supposed to look outrageous and orange-red is a fairly exotic paint scheme (I suppose I could have picked green or yellow or something instead, but I like orange-red).
I ended up making a triple-lobe shader with a metallic base, a dielectric lobe, and a clear-coat lobe on top of that.
The base lobe uses a GGX microfacet metallic Brdf.
Takua’s shading system implements a proper metallic Fresnel model for conductors, where the Fresnel model includes both a <em>Nd</em> component representing refractive index and a <em>k</em> component representing the extinction coefficient for when an electromagnetic wave propagates through a material.
For conductors, the final Fresnel index of refraction for each wavelength of light is defined by a complex combination of <em>Nd</em> and <em>k</em>.
For the base metallic lobe, most of the color wound up coming from the <em>k</em> component.
The dielectric lobe is meant to simulate paint on top of a car’s metal body; the dielectric lobe is where most of the orange-red color comes from.
The dielectric lobe is again a GGX microfacet Brdf, but with a dielectric Fresnel model, which has a much simpler index of refraction calculation than the metallic Fresnel model does.
I should note that Takua’s current standard material implementation actually only supports a single primary specular lobe and an additional single clear-coat lobe, so for shaders authored with both a metallic and dielectric component, Takua takes a blend weight between the two components and for each shading evaluation stochastically selects between the two lobes according to the blend weight.
The clear-coat layer on top has just a slightly amount of extinction to provide just a bit more of the final orange look, but is mostly just clear.</p>
<p>All of the window glass in the render is tinted slightly dark through extinction instead of through a fixed refraction color.
Using proper extinction to tint glass is more realistic than using a fixed refraction color.
Similarly, the red and yellow glass used in the head lights and tail lights are colored through extinction.
The brake disks use an extremely high resolution bump map to get the brushed metal look.
The branding and markings on the tire walls are done through a combination of bump mapping and adjusting the roughness of the microfacet Brdf; the tire treads are made using a high resolution normal map.
There’s no <a href="http://blog.yiningkarlli.com/2017/05/subdivision-and-displacement.html">displacement mapping</a> at all, although in retrospect the tire treads probably should be displacement mapped if I want to put the camera closer to them.
Also, I actually didn’t really shade the interior of the car much, since I knew I was going for exterior shots only.</p>
<p>Eventually I’ll get around to implementing a proper car paint Bsdf in Takua, but until then, the approach I took here seems to hold up reasonable well as long as the camera doesn’t get super close up to the car.</p>
<p>I lit the scene using two lights: an HDR skydome from <a href="http://hdri-skies.com">HDRI-Skies</a>, and a single long, thin rectangular area light above the car.
The skydome provides the overall soft-ish lighting that illuminates the entire scene, and the rectangular area light provides the long, interesting highlights on the car body that help with bringing out the car’s shape.</p>
<p>For all of the renders in this post, I used my VCM integrator, since the scene contains a lot of subtle caustics and since the inside of the car is lit entirely through glass.
I also wound up modifying my <a href="http://blog.yiningkarlli.com/2015/03/adaptive-sampling.html">adaptive sampler</a>; it’s still the same adaptive sampler that I’ve had for a few years now, but with an important extension.
Instead of simply reducing the total number of paths per iteration as areas reach convergence, the adaptive sampler now keeps the number of paths the same and instead reallocates paths from completed pixels to high-variance pixels.
The end result is that the adaptive sampler is now much more effective at eliminating fireflies and targeting caustics and other noisy areas.
In the above render, some pixels wound up with as few as 512 samples, while a few particularly difficult pixels finished with as many as 20000 samples.
Here is the adaptive sampling heatmap for Figure 1 above; brighter areas indicate more samples. Note how the adaptive sampler found a number of areas that we’d expect to be challenging, such as the interior through the car’s glass windows, and parts of the body with specular inter-reflections.</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2017/Dec/lambo_sampleMask.png"><img src="https://blog.yiningkarlli.com/content/images/2017/Dec/preview/lambo_sampleMask.jpg" alt="Figure 2: Adaptive sampling heatmap for Figure 1. Brighter areas indicate more samples." /></a></p>
<p>I recently implemented support for arbitrary camera shutter curves, so I thought doing a motion blurred render would be fun.
After all, Lamborghinis are supposed to go fast!
I animated the Lamborghini driving forward in Maya; the animation was very basic, with the main body just translating forward and the wheels both translating and rotating.
Of course Takua has proper rotational motion blur.
The motion blur here is effectively multi-segment motion blur; generating multi-segment motion blur from an animated sequence in Takua is very easy due to how Takua handles and understands time.
I actually think that Takua’s concept of time is one of the most unique things in Takua; it’s very different from how every other renderer I’ve used and seen handles time.
I intend to write more about this later.
Instead of an instantaneous shutter, I used a custom cosine-based shutter curve that places many more time samples near the center of the shutter interval than towards the shutter open and close.
Using a shutter shape like this wound up being important to getting the right look to the motion blur; even the car is moving extremely quickly, the overall form of the car is still clearly distinguishable and the front and back of the car appear more motion-blurred than the main body.</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2017/Dec/lambo_orangered_motionblur.png"><img src="https://blog.yiningkarlli.com/content/images/2017/Dec/preview/lambo_orangered_motionblur.jpg" alt="Figure 3: Motion blurred render, using multi-segment motion blur with a cosine-based shutter curve." /></a></p>
<p>Since Takua has a procedural wireframe texture now, I also did a wireframe render.
I mentioned my procedural wireframe texture in a previous post, but I didn’t write about how it actually works.
For triangles and quads, the wireframe texture is simply based on the distance from the hitpoint to the nearest edge.
If the distance to the nearest edge is smaller than some threshold, draw one color, otherwise, draw some other color.
The nearest edge calculation can be done as follows (the variable names should be self-explanatory):</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>float calculateMinDistance(const Poly& p, const Intersection& hit) const {
float md = std::numeric_limits<float>::infinity();
const int verts = p.isQuad() ? 4 : 3;
for (int i = 0; i < verts; i++) {
const glm::vec3& cur = p[i].m_position;
const glm::vec3& next = p[(i + 1) % verts].m_position;
const glm::vec3 d1 = glm::normalize(next - cur);
const glm::vec3 d2 = hit.m_point - cur;
const float l = glm::length((cur + d1 * glm::dot(d1, d2) - hit.m_point));
md = glm::min(md, l * l);
}
return md;
};
</code></pre></div></div>
<p>The topology of the meshes are pretty strange, since the car model came as a triangle mesh, which I then subdivided:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2017/Dec/lambo_wireframe.png"><img src="https://blog.yiningkarlli.com/content/images/2017/Dec/preview/lambo_wireframe.jpg" alt="Figure 4: Procedural wireframe texture." /></a></p>
<p>The material in the wireframe render only uses the lambertian diffuse lobe in Takua’s standard material; as such, the adaptive sampling heatmap for the wireframe render is interesting to compare to Figure 2.
Overall the sample distribution is much more even, and areas where diffuse inter-reflections are present got more samples:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2017/Dec/lambo_wireframe_sampleMask.png"><img src="https://blog.yiningkarlli.com/content/images/2017/Dec/preview/lambo_wireframe_sampleMask.jpg" alt="Figure 5: Adaptive sampling heatmap for Figure 4. Brighter areas indicate more samples. Compare with Figure 2." /></a></p>
<p>Takua’s shading model supports layering different materials through parameter blending, similar to how the <a href="https://disney-animation.s3.amazonaws.com/library/s2012_pbs_disney_brdf_notes_v2.pdf">Disney Brdf</a> (and, at this point, <a href="http://blog.selfshadow.com/publications/s2017-shading-course/walster/s2017_pbs_volumetric_notes.pdf">most</a> <a href="http://blog.selfshadow.com/publications/s2017-shading-course/dreamworks/s2017_pbs_dreamworks_notes.pdf">other</a> <a href="http://blog.selfshadow.com/publications/s2017-shading-course/pixar/s2017_pbs_pixar_notes.pdf">shading</a> <a href="http://blog.selfshadow.com/publications/s2017-shading-course/imageworks/s2017_pbs_imageworks_slides.pdf">systems</a>) handles material layering.
I wanted to make an even more outrageous looking version of the Aventador than the orange-red version, so I used the procedural wireframe texture as a layer mask to drive parameter blending between a black paint and a metallic gold paint:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2017/Dec/lambo_gold.png"><img src="https://blog.yiningkarlli.com/content/images/2017/Dec/preview/lambo_gold.jpg" alt="Figure 6: An outrageous Aventador paint scheme using a procedural wireframe texture to blend between black and metallic gold car paint." /></a></p>
https://blog.yiningkarlli.com/2017/11/olafs-frozen-adventure.html
Olaf's Frozen Adventure
2017-11-16T00:00:00+00:00
2017-11-16T00:00:00+00:00
Yining Karl Li
<p>After an amazing 2016, <a href="http://www.disneyanimation.com/">Walt Disney Animation Studios</a> is having a bit of a break year this year.
Disney Animation doesn’t have a feature film this year; instead, we made a half-hour featurette called <a href="https://www.disneyanimation.com/projects/olafsfrozenadventure">Olaf’s Frozen Adventure</a>, which will be released in front of Pixar’s <a href="https://www.pixar.com/feature-films/coco#coco-main">Coco</a> during Thanksgiving.
I think this is the first time a Disney Animation short/featurette has accompanied a Pixar film.
Olaf’s Frozen Adventure is a fun little holiday story set in the world of Frozen, and I had the privilege of getting to play a small role in making Olaf’s Frozen Adventure!
I got an official credit as part of a handful of engineers that did some specific, interesting technology development for Olaf’s Frozen Adventure.</p>
<p>Olaf’s Frozen Adventure is really really funny; because Olaf is the main character, the entire story takes on much more of a self-aware, at times somewhat absurdist tone.
The featurette also has a bunch of new songs- there are six new songs in total, which is somehow pretty close to the original film’s count of eight songs, but in a third of the runtime.
Olaf’s Frozen Adventure was originally announced as a TV special, but the wider Walt Disney Company was so happy with the result that they decided to give Olaf’s Frozen Adventure a theatrical release instead!</p>
<p>Something I personally find fascinating about Olaf’s Frozen Adventure is comparing it visually with the original Frozen.
Olaf’s Frozen Adventure is rendered entirely with Disney’s <a href="http://www.disneyanimation.com/technology/innovations/hyperion">Hyperion Renderer</a>, compared with Frozen, which was rendered using pre-RIS Renderman.
While both films used our Disney BRDF <a href="https://doi.org/10.1145/2343483.2343493">[Burley 2012]</a> and Ptex <a href="https://doi.org/10.1111/j.1467-8659.2008.01253.x">[Burley and Lacewell 2008]</a>, Olaf’s Frozen Adventure benefits from all of the improvements and advancements that have been made during Big Hero 6, Zootopia, and Moana.
The original Frozen used dipole subsurface scattering, radiosity caching, and generally had fairly low geometric complexity relative to Hyperion-era films.
In comparison, Olaf’s Frozen Adventure uses brute force subsurface scattering, uses path-traced global illumination, uses the full Disney BSDF (which is significantly extended from the Disney BRDF) <a href="https://doi.org/10.1145/2776880.2787670">[Burley 2015]</a>, uses our advanced fur/hair shader developed during Zootopia <a href="https://doi.org/10.1111/cgf.12830">[Chiang et al. 2016]</a>, and has much greater geometric complexity.
A great example of the greater geometric complexity is the knitted scarf sequence <a href="https://doi.org/10.1145/3214745.3214817">[Staub et al. 2018]</a>, where 2D animation was brought into Hyperion as a texture map to drive the colors on a knitted scarf that was modeled and rendered down to the fiber level.
Some shots even utilize an extended version of the photon mapped caustics we developed during Moana; the photon mapped caustics system on Moana only supported distant lights as a photon source, but for Olaf’s Frozen Adventure, the photon mapping system was extended to support all of Hyperion’s existing light types as photon sources.
These extensions to our photon mapping system is one of the things I worked on for Olaf’s Frozen Adventure, and was used for lighting the ice crystal tree that Elsa creates at the end of the film.
Even the water in Arendelle Harbor looks way better than in Frozen, since the FX artists were able to make use of the incredible water systems developed for Moana <a href="https://doi.org/10.1145/3084363.3085067">[Palmer et al. 2017]</a>.
Many of these advancements are discussed in our SIGGRAPH 2017 Course Notes <a href="http://www.yiningkarlli.com/projects/ptcourse2017.html">[Burley et al. 2017]</a>.</p>
<p>One of the huge advantages to working on an in-house production rendering team in a vertically integrated studio is being able to collaborate and partner closely with productions on executing long-term technical visions.
Because of the show leadership’s confidence in our long-term development efforts targeted at later shows, the artists on Olaf’s Frozen Adventure were willing to take on and try out early versions of a number of new features in Hyperion that were originally targeted at later shows.
Some of these “preview” features wound up making a big difference on Olaf’s Frozen Adventure, and lessons learned on Olaf’s Frozen Adventure were instrumental in making these features much more robust and complete on Ralph Breaks the Internet.</p>
<p>One major feature was brute force path-traced subsurface scattering; Peter Kutz, Matt Chiang, and Brent Burley had originally started development during Moana’s production on brute force path-traced subsurface scattering <a href="https://doi.org/10.1145/2897839.2927433">[Chiang 2016]</a> as a replacement for Hyperion’s existing normalized diffusion based subsurface scattering <a href="https://doi.org/10.1145/2776880.2787670">[Burley 2015]</a>.
This feature wasn’t completed in time for use on Moana (although some initial testing was done using Moana assets), but was far enough along by Olaf’s Frozen Adventure was in production that artists started to experiment with it.
If I remember correctly, the characters in Olaf’s Frozen Adventure are still using normalized diffusion, but path-traced subsurface wound up finding extensive use in rendering all of the snow in the show, since the additional detail that path-traced subsurface brings out helped highlight the small granular details in the snow.
A lot of lessons learned from using path-traced subsurface scattering on the snow were then applied to making path-traced subsurface scattering more robust and easier to use and control.
These experiences gave us the confidence to go ahead with full-scale deployment on Ralph Breaks the Internet, which uses path-traced subsurface scattering for everything including characters.</p>
<p>Another major development effort that found experimental use on Olaf’s Frozen Adventure were some large overhauls to Hyperion’s ray traversal system.
During the production of Moana, we started running into problems with how large instance groups are structured in Hyperion.
Moana’s island environments featured vast quantities of instanced vegetation geometry, and because of how the instancing was authored, Hyperion’s old strategy for grouping instances in the top-level BVH wound up producing heavily overlapping BVH leaves, which in extreme cases could severely degrade traversal performance.
On Moana, the solution to this problem was to change how instances were authored upstream in the pipeline, but the way that the renderer wanted instances organized was fairly different from how artists and our pipeline like to think about instances, which made authoring more difficult.
This problem motivated Peter Kutz and I to develop a new traversal system that would be less sensitive to how instance groups were authored; the system we came up with allows Hyperion to internally break up top-level BVH nodes with large overlapping bounds into smaller, tighter subbounds based on the topology of the lower-level BVHs.
It turns out this system is conceptually essentially identical to BVH rebraiding <a href="https://doi.org/10.1145/3105762.3105776">[Benthin et al. 2017]</a>, but we developed and deployed this system independently before Benthin 2017 was published.
As part of this effort, we also wound up revisiting Hyperion’s original cone-based packet traversal strategy <a href="https://doi.org/10.1111/cgf.12158">[Eisenacher et al. 2013]</a> and, motivated by extensive testing and statistical performance analysis, developed a new, simpler, higher performance multithreading strategy for handling Hyperion’s ultra-wide batched ray traversal.
Olaf’s Frozen Adventure has a sequence where Olaf and Sven are being pulled down a mountainside through a forest by a burning sled; the enormous scale of the groundplane and large quantities of instanced trees proved to be challenging for Hyperion’s old traversal system.
We were able to partner with the artists to deploy a mid-development prototype of our new traversal system on this sequence, and were able to cut traversal times by up to close to an order of magnitude in some cases.
As a result, the artists were able to render this sequence with reasonable render times, and we were able to field-test the new traversal system prior to studio-wide deployment and iron out various kinks that were found along the way.</p>
<p>The last major mid-development Hyperion feature that saw early experimental use on Olaf’s Frozen Adventure was our new, next-generation spectral and decomposition tracking <a href="https://doi.org/10.1145/3072959.3073665">[Kutz et al. 2017]</a> based null-collision volume rendering system, which was written with the intention of eventually completely replacing Hyperion’s existing residual ratio tracking <a href="https://doi.org/10.1145/2661229.2661292">[Novák 2014]</a> based volume rendering system <a href="https://doi.org/10.1145/3084873.3084907">[Fong 2017]</a>.
Artists on Olaf’s Frozen Adventure ran into some difficulties with rendering loose, fluffy white snow, where the bright white appearance is the result of high-order scattering requiring large numbers of bounces.
We realized that this problem is essentially identical to the problem of rendering white puffy clouds, which also have an appearance dominated by energy from high-order scattering.
Since null-collision volume integration is specifically very efficient at handling high-order scattering, we gave the artists an early prototype version of Hyperion’s new volume rendering system to experiment with rendering loose fluffy snow as a volume.
The initial results looked great; I’m not sure if this approach wound up being used in the final film, but this experiment gave both us and the artists a lot of confidence in the new volume rendering system and provided valuable feedback.</p>
<p>As usual with Disney Animation projects I get to work on, here are some stills in no particular order, from the film.
Even though Olaf’s Frozen Adventure was originally meant for TV, the whole studio still put the same level of effort into it that goes into full theatrical features, and I think it shows!</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2017/Nov/LOAF_01.png"><img src="https://blog.yiningkarlli.com/content/images/2017/Nov/preview/LOAF_01.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2017/Nov/LOAF_02.png"><img src="https://blog.yiningkarlli.com/content/images/2017/Nov/preview/LOAF_02.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2017/Nov/LOAF_03.png"><img src="https://blog.yiningkarlli.com/content/images/2017/Nov/preview/LOAF_03.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2017/Nov/LOAF_04.png"><img src="https://blog.yiningkarlli.com/content/images/2017/Nov/preview/LOAF_04.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2017/Nov/LOAF_05.png"><img src="https://blog.yiningkarlli.com/content/images/2017/Nov/preview/LOAF_05.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2017/Nov/LOAF_06.png"><img src="https://blog.yiningkarlli.com/content/images/2017/Nov/preview/LOAF_06.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2017/Nov/LOAF_07.png"><img src="https://blog.yiningkarlli.com/content/images/2017/Nov/preview/LOAF_07.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2017/Nov/LOAF_08.png"><img src="https://blog.yiningkarlli.com/content/images/2017/Nov/preview/LOAF_08.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2017/Nov/LOAF_09.png"><img src="https://blog.yiningkarlli.com/content/images/2017/Nov/preview/LOAF_09.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2017/Nov/LOAF_10.png"><img src="https://blog.yiningkarlli.com/content/images/2017/Nov/preview/LOAF_10.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2017/Nov/LOAF_11.png"><img src="https://blog.yiningkarlli.com/content/images/2017/Nov/preview/LOAF_11.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2017/Nov/LOAF_12.png"><img src="https://blog.yiningkarlli.com/content/images/2017/Nov/preview/LOAF_12.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2017/Nov/LOAF_13.png"><img src="https://blog.yiningkarlli.com/content/images/2017/Nov/preview/LOAF_13.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2017/Nov/LOAF_14.png"><img src="https://blog.yiningkarlli.com/content/images/2017/Nov/preview/LOAF_14.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2017/Nov/LOAF_15.png"><img src="https://blog.yiningkarlli.com/content/images/2017/Nov/preview/LOAF_15.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2017/Nov/LOAF_17.png"><img src="https://blog.yiningkarlli.com/content/images/2017/Nov/preview/LOAF_17.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2017/Nov/LOAF_18.png"><img src="https://blog.yiningkarlli.com/content/images/2017/Nov/preview/LOAF_18.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2017/Nov/LOAF_19.png"><img src="https://blog.yiningkarlli.com/content/images/2017/Nov/preview/LOAF_19.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2017/Nov/LOAF_20.png"><img src="https://blog.yiningkarlli.com/content/images/2017/Nov/preview/LOAF_20.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2017/Nov/LOAF_22.png"><img src="https://blog.yiningkarlli.com/content/images/2017/Nov/preview/LOAF_22.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2017/Nov/LOAF_23.png"><img src="https://blog.yiningkarlli.com/content/images/2017/Nov/preview/LOAF_23.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2017/Nov/LOAF_26.png"><img src="https://blog.yiningkarlli.com/content/images/2017/Nov/preview/LOAF_26.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2017/Nov/LOAF_24.png"><img src="https://blog.yiningkarlli.com/content/images/2017/Nov/preview/LOAF_24.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2017/Nov/LOAF_25.png"><img src="https://blog.yiningkarlli.com/content/images/2017/Nov/preview/LOAF_25.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2017/Nov/LOAF_21.png"><img src="https://blog.yiningkarlli.com/content/images/2017/Nov/preview/LOAF_21.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2017/Nov/LOAF_27.png"><img src="https://blog.yiningkarlli.com/content/images/2017/Nov/preview/LOAF_27.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2017/Nov/LOAF_16.png"><img src="https://blog.yiningkarlli.com/content/images/2017/Nov/preview/LOAF_16.jpg" alt="" /></a></p>
<p>Here is a credits frame with my name! I wasn’t actually expecting to get a credit on Olaf’s Frozen Adventure, but because I had spent a lot of time supporting the show and working with artists on deploying experimental Hyperion features to solve particularly difficult shots, the show decided to give me a credit! I was very pleasantly surprised by that; my teammate Matt Chiang got a credit as well for similar reasons.</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2017/Nov/LOAF_credits.png"><img src="https://blog.yiningkarlli.com/content/images/2017/Nov/preview/LOAF_credits.jpg" alt="" /></a></p>
<p>All images in this post are courtesy of and the property of Walt Disney Animation Studios.</p>
<p><strong>References</strong></p>
<p>Carsten Benthin, Sven Woop, Ingo Wald, and Attila T. Áfra. 2017. <a href="https://doi.org/10.1145/3105762.3105776">Improved Two-Level BVHs using Partial Re-Braiding</a>. In <em>HPG ‘17 (Proceedings of High Performance Graphics)</em>. 7:1-7:8.</p>
<p>Brent Burley. <a href="https://doi.org/10.1145/2343483.2343493">Physically Based Shading at Disney</a>. 2012. In <em>ACM SIGGRAPH 2012 Course Notes: <a href="https://blog.selfshadow.com/publications/s2012-shading-course/">Practical Physically-Based Shading in Film and Game Production</a></em>.</p>
<p>Brent Burley. <a href="https://doi.org/10.1145/2776880.2787670">Extending the Disney BRDF to a BSDF with Integrated Subsurface Scattering</a>. 2015. In <em>ACM SIGGRAPH 2015 Course Notes: <a href="https://blog.selfshadow.com/publications/s2015-shading-course">Physically Based Shading in Theory and Practice</a></em>.</p>
<p>Brent Burley, David Adler, Matt Jen-Yuan Chiang, Ralf Habel, Patrick Kelly, Peter Kutz, Yining Karl Li, and Daniel Teece. 2017. <a href="https://www.yiningkarlli.com/projects/ptcourse2017.html">Recent Advances in Disney’s Hyperion Renderer</a>. <em><a href="http://dx.doi.org/10.1145/3084873.3084904">Path Tracing in Production Part 1</a>, ACM SIGGRAPH 2017 Course Notes</em>.</p>
<p>Brent Burley and Dylan Lacewell. 2008. <a href="https://doi.org/10.1111/j.1467-8659.2008.01253.x">Ptex: Per-face Texture Mapping for Production Rendering</a>. <em>Computer Graphics Forum</em>. 27, 4 (2008), 1155-1164.</p>
<p>Matt Jen-Yuan Chiang, Benedikt Bitterli, Chuck Tappan, and Brent Burley. 2016. <a href="https://doi.org/10.1111/cgf.12830">A Practical and Controllable Hair and Fur Model for Production Path Tracing</a>. <em>Computer Graphics Forum</em>. 35, 2 (2016), 275-283.</p>
<p>Matt Jen-Yuan Chiang, Peter Kutz, and Brent Burley. 2016. <a href="https://doi.org/10.1145/2897839.2927433">Practical and Controllable Subsurface Scattering for Production Path Tracing</a>. In <em>ACM SIGGRAPH 2016 Talks</em>. 49:1-49:2.</p>
<p>Christian Eisenacher, Gregory Nichols, Andrew Selle, and Brent Burley. 2013. <a href="https://doi.org/10.1111/cgf.12158">Sorted Deferred Shading for Production Path Tracing</a>. <em>Computer Graphics Forum</em>. 32, 4 (2013), 125-132.</p>
<p>Julian Fong, Magnus Wrenninge, Christopher Kulla, and Ralf Habel. 2017. <a href="https://doi.org/10.1145/3084873.3084907">Production Volume Rendering</a>. In <em>ACM SIGGRAPH 2017 Courses</em>.</p>
<p>Peter Kutz, Ralf Habel, Yining Karl Li, and Jan Novák. 2017. <a href="https://doi.org/10.1145/3072959.3073665">Spectral and Decomposition Tracking for Rendering Heterogeneous Volumes</a>. <em>ACM Transactions on Graphics</em>. 36, 4 (2017), 111:1-111:16.</p>
<p>Jan Novák, Andrew Selle, and Wojciech Jarosz. 2014. <a href="https://doi.org/10.1145/2661229.2661292">Residual Ratio Tracking for Estimating Attenuation in Participating Media</a>. <em>ACM Transactions on Graphics</em>. 33, 6 (2014), 179:1-179:11.</p>
<p>Sean Palmer, Jonathan Garcia, Sara Drakeley, Patrick Kelly, and Ralf Habel. 2017. <a href="https://doi.org/10.1145/3084363.3085067">The Ocean and Water Pipeline of Disney’s Moana</a>. In <em>ACM SIGGRAPH 2017 Talks</em>. 29:1-29:2.</p>
<p>Josh Staub, Alessandro Jacomini, Dan Lund. 2018. <a href="https://doi.org/10.1145/3214745.3214817">The Handiwork Behind “Olaf’s Frozen Adventure”</a>. In <em>ACM SIGGRAPH 2018 Talks</em>. 26:1-26:2.</p>
https://blog.yiningkarlli.com/2017/08/recent-advances-in-hyperion.html
SIGGRAPH 2017 Course Notes- Recent Advances in Disney's Hyperion Renderer
2017-08-04T00:00:00+00:00
2017-08-04T00:00:00+00:00
Yining Karl Li
<p>This year at SIGGRAPH 2017, Luca Fascione and Johannes Hanika from Weta Digital organized a Path Tracing in Production course.
The course was split into two halves: a first half about production renderers, and a second half about using production renderers to make movies.
Brent Burley presented our recent work on Disney’s Hyperion Renderer as part of the first half of the course.
To support Brent’s section of the course, the entire Hyperion team worked together to put together some course notes describing recent work in Hyperion done for Zootopia, Moana, and upcoming films.</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2017/Jul/course_notes_zootopia.png"><img src="https://blog.yiningkarlli.com/content/images/2017/Jul/preview/course_notes_zootopia.jpg" alt="Image from course notes Figure 8: a production frame from Zootopia, rendered using Disney's Hyperion Renderer." /></a></p>
<p>Here is the abstract for the course notes:</p>
<p><em>Path tracing at Walt Disney Animation Studios began with the Hyperion renderer, first used in production on Big Hero 6. Hyperion is a custom, modern path tracer using a unique architecture designed to efficiently handle complexity, while also providing artistic controllability and efficiency.
The concept of physically based shading at Disney Animation predates the Hyperion renderer. Our history with physically based shading significantly influenced the development of Hyperion, and since then, the development of Hyperion has in turn influenced our philosophy towards physically based shading.</em></p>
<p>The course notes and related materials can be found at:</p>
<ul>
<li><a href="https://jo.dreggn.org/path-tracing-in-production/2017/index.html">Official Course Resources Page (Full course notes and supplemental materials)</a></li>
<li><a href="https://www.yiningkarlli.com/projects/ptcourse2017.html">Project Page (Author’s Version)</a></li>
<li><a href="https://dl.acm.org/citation.cfm?doid=3084873.3084904">Official Print Version (ACM Library)</a></li>
</ul>
<p>The course wasn’t recorded due to proprietary content from various studios, but the overall course notes for the entire course cover everything that was presented.
The major theme of our part of the course notes (and Brent’s presentation) is replacing multiple scattering approximations with accurate brute-force path-traced solutions.
Interestingly, the main motivator for this move is primarily a desire for better, more predictable and intuitive controls for artists, as opposed to simply just wanting better visual quality.
In the course notes, we specifically discuss fur/hair, path-traced subsurface scattering, and volume rendering.</p>
<p>The Hyperion team also had two other presentations at SIGGRAPH 2017:</p>
<ul>
<li>Ralf Habel presented several sections of the “<a href="https://graphics.pixar.com/library/ProductionVolumeRendering/">Production Volume Rendering</a>”” course, which was jointly put together by Julian Fong and Magnus Wrenninge from Pixar Animation Studios, Christophe Kulla from Sony Imageworks, and Ralf Habel from Walt Disney Animation Studios.</li>
<li>Peter Kutz presented our “<a href="https://blog.yiningkarlli.com/2017/07/spectral-and-decomposition-tracking.html">Spectral and Decomposition Tracking for Rendering Heterogeneous Volumes</a>” technical paper in the “Rendering Volumes” papers session.</li>
</ul>
https://blog.yiningkarlli.com/2017/07/spectral-and-decomposition-tracking.html
SIGGRAPH 2017 Paper- Spectral and Decomposition Tracking for Rendering Heterogeneous Volumes
2017-07-25T00:00:00+00:00
2017-07-25T00:00:00+00:00
Yining Karl Li
<p>Some recent work I was part of at Walt Disney Animation Studios has been published in the July 2017 issue of ACM Transactions on Graphics as part of SIGGRAPH 2017!
The paper is titled “<a href="http://dl.acm.org/citation.cfm?id=3073665">Spectral and Decomposition Tracking for Rendering Heterogeneous Volumes</a>”, and the project was a collaboration between the Hyperion development team at <a href="http://disneyanimation.com">Walt Disney Animation Studios</a> (WDAS) and the rendering group at <a href="http://www.disneyresearch.com/research-labs/disney-research-zurich">Disney Research Zürich</a> (DRZ).
From the WDAS side, the authors are <a href="http://peterkutz.com">Peter Kutz</a> (who was at Penn at the same time as me), <a href="https://www.linkedin.com/in/ralf-habel-6a74bb2/">Ralf Habel</a>, and myself.
On the DRZ side, our collaborator was <a href="http://drz.disneyresearch.com/~jnovak/">Jan Novák</a>, the head of DRZ’s rendering research group.</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2017/Jul/color_explosion.png"><img src="https://blog.yiningkarlli.com/content/images/2017/Jul/preview/color_explosion.jpg" alt="Image from paper Figure 12: a colorful explosion with chromatic extinction rendered using spectral tracking." /></a></p>
<p>Here is the paper abstract:</p>
<p><em>We present two novel unbiased techniques for sampling free paths in heterogeneous participating media. Our decomposition tracking accelerates free-path construction by splitting the medium into a control component and a residual component and sampling each of them separately. To minimize expensive evaluations of spatially varying collision coefficients, we define the control component to allow constructing free paths in closed form. The residual heterogeneous component is then homogenized by adding a fictitious medium and handled using weighted delta tracking, which removes the need for computing strict bounds of the extinction function. Our second contribution, spectral tracking, enables efficient light transport simulation in chromatic media. We modify free-path distributions to minimize the fluctuation of path throughputs and thereby reduce the estimation variance. To demonstrate the correctness of our algorithms, we derive them directly from the radiative transfer equation by extending the integral formulation of null-collision algorithms recently developed in reactor physics. This mathematical framework, which we thoroughly review, encompasses existing trackers and postulates an entire family of new estimators for solving transport problems; our algorithms are examples of such. We analyze the proposed methods in canonical settings and on production scenes, and compare to the current state of the art in simulating light transport in heterogeneous participating media.</em></p>
<p>The paper and related materials can be found at:</p>
<ul>
<li><a href="https://www.disneyanimation.com/technology/publications/96">Official WDAS Project Page (Preprint paper and supplemental materials)</a></li>
<li><a href="http://www.yiningkarlli.com/projects/specdecomptracking.html">Project Page (Author’s Version)</a></li>
<li><a href="http://dl.acm.org/citation.cfm?doid=3072959.3073665">Official Print Version (ACM Library)</a></li>
</ul>
<p>Peter Kutz will be presenting the paper at <a href="http://s2017.siggraph.org">SIGGRAPH 2017</a> in Log Angeles as part of the <a href="http://s2017.siggraph.org/technical-papers/sessions/rendering-volumes">Rendering Volumes</a> Technical Papers session.</p>
<p>Instead of repeating the contents of the paper here (which is pointless since the paper already says everything we want to say), I thought instead I’d use this blog post to talk about some of the process we went through while writing this paper.
Please note that everything stated in this post are my own opinions and thoughts, not Disney’s.</p>
<p>This project started over a year ago, when we began an effort to significantly overhaul and improve Hyperion’s volume rendering system.
Around the same time that we began to revisit volume rendering, we heard a lecture from a visiting professor on multilevel Monte Carlo (MLMC) methods.
Although the final paper has nothing to do with MLMC methods, the genesis of this project was in initial conversations we had about how MLMC methods might be applied to volume rendering.
We concluded that MLMC could be applicable, but weren’t entirely sure how.
However, these conversations eventually gave Peter the idea to develop the technique that would eventually become decomposition tracking (importantly, decomposition tracking does not actually use MLMC though).
Further conversations about weighted delta tracking then led to Peter developing the core ideas behind what would become spectral tracking.
After testing some initial implementations of these prototype version of decomposition and spectral tracking, Peter, Ralf, and I shared the techniques with Jan.
Around the same time, we also shared the techniques with our sister teams, Pixar’s RenderMan development group in Seattle and the Pixar Research Group in Emeryville, who were able to independently implement and verify our techniques.
Being able to share research between Walt Disney Animation Studios, Disney Research, the Renderman group, Pixar Animation Studios, Industrial Light & Magic, and Imagineering is one of the reasons why Disney is such an amazing place to be for computer graphics folks.</p>
<p>At this point we had initial rudimentary proofs for why decomposition and spectral tracking worked separately, but we still didn’t have a unified framework that could be used to explain and combine the two techniques.
Together with Jan, we began by deep-diving into the origins of delta/Woodcock tracking in neutron transport and reactor physics papers from the 1950s and 1960s and working our way forward to the present.
All of the key papers we dug up during this deep-dive are cited in our paper.
Some of these early papers were fairly difficult to find.
For example, the original delta tracking paper, “Techniques used in the GEM code for Monte Carlo neutronics calculations in reactors and other systems of complex geometry” (Woodcock et al. 1965), is often cited in graphics literature, but a cursory Google search doesn’t provide any links to the actual paper itself.
We eventually managed to track down a copy of the original paper in the archives of the United States Department of Commerce, which for some reason hosts a lot of archive material from Argonne National Laboratory.
Since the original Woodcock paper has been in the public domain for some time now but is fairly difficult to find, I’m hosting a <a href="http://yiningkarlli.com/projects/specdecomptracking/references/Woodcock1965.pdf">copy here</a> for any researchers that may be interested.</p>
<p>Several other papers we were only able to obtain by requesting archival microfilm scans from several university libraries.
I won’t host copies here, since the public domain status for several of them isn’t clear, but if you are a researcher looking for any of the papers that we cited and can’t find it, feel free to contact me.
One particularly cool find was “The Relativistic Doppler Problem” (Zerby et al. 1961), which Peter obtained by writing to the Oak Ridge National Laboratory’s research library.
Their staff were eventually able to find the paper in their records/archives, and subsequently scanned and uploaded the paper online.
The paper is now <a href="https://www.osti.gov/scitech/biblio/4836227">publicly available here</a>, on the United States Department of Energy’s Office of Scientific and Technical Information website.</p>
<p>Eventually, through significant effort from Jan, we came to understand Galtier et al.’s 2013 paper, “<a href="https://www.researchgate.net/publication/258211025_Integral_formulation_of_null-collision_Monte_Carlo_algorithms">Integral Formulation of Null-Collision Monte Carlo Algorithms</a>”, and were able to import the integral formulation into computer graphics and demonstrate how to derive both decomposition and spectral tracking directly from the radiative transfer equation using the integral formulation.
This step also allowed Peter to figure out how to combine spectral and decomposition tracking into a single technique.
With all of these pieces in place, we had the framework for our SIGGRAPH paper.
We then put significant effort into working out remaining details, such as finding a good mechanism for bounding the free-path-sampling coefficient in spectral tracking.
Producing all of the renders, results, charts, and plots in the paper also took an enormous amount of time; it turns out that producing all of this stuff can take significantly longer than the amount of time originally spent coming up with and implementing the techniques in the first place!</p>
<p>One major challenge we faced in writing the final paper was finding the best order in which to present the three main pieces of the paper: decomposition tracking, spectral tracking, and the integral formulation of null-collision algorithms.
At one point, we considered first presenting decomposition tracking, since on a general level decomposition tracking is the easiest of the three contributions to understand.
Then, we planned to use the proof of decomposition tracking to expand out into the integral formulation of the RTE with null collisions, and finally derive spectral tracking from the integral formulation.
The idea was essentially to introduce the easiest technique first, expand out to the general mathematical framework, and then demonstrate the flexibility of the framework by deriving the second technique.
However, this approach in practice felt disjointed, especially with respect to the body of prior work we wanted to present, which underpinned the integral framework but wound up being separated by the decomposition tracking section.
So instead, we arrived on the final presentation order, where we first present the integral framework and derive out prior techniques such as delta tracking, and then demonstrate how to derive out new decomposition tracking and spectral tracking techniques.
We hope that presenting the paper in this way will encourage other researchers to adopt the integral framework and derive other, new techniques from the framework.
For Peter’s presentation at SIGGRAPH, however, Peter chose to go with the original order since it made for a better presentation.</p>
<p>Since our final paper was already quite long, we had to move some content into a separate supplemental document.
Although the supplemental content isn’t necessary for implementing the core algorithms presented, I think the supplemental content is very useful for gaining a better understanding of the techniques.
The supplemental content contains, among other things, an extended proof of the minimum-of-exponents mechanism that decomposition tracking is built on, various proofs related to choosing bounds for the local collision weight in spectral tracking, and various additional results and further analysis.
We also provide a nifty interactive viewer for comparing our techniques against vanilla delta tracking; the interactive viewer framework was originally developed by <a href="http://zurich.disneyresearch.com/~fabricer/">Fabrice Rousselle</a>, Jan Novák and <a href="https://benedikt-bitterli.me">Benedikt Bitterli</a> at Disney Research Zürich.</p>
<p>One of the major advantages of doing rendering research at a major animation or VFX studio is the availability of hundreds of extremely talented artists, who are always eager to try out new techniques and software.
Peter, Ralf, and I worked closely with a number of artists at WDAS to test our techniques and produce interesting scenes with which to generate results and data for the paper.
Henrik Falt and Alex Nijmeh had created a number of interesting clouds in the process of testing our general volume rendering improvements, and worked with us to adapt a cloud dataset for use in Figure 11 of our paper.
The following is one of the renders from Figure 11:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2017/Jul/single_cloud.png"><img src="https://blog.yiningkarlli.com/content/images/2017/Jul/preview/single_cloud.jpg" alt="Image from paper Figure 11: an optically thick cloud rendered using decomposition tracking." /></a></p>
<p>Henrik and Alex also constructed the cloudscape scene used as the banner image on the first page of the paper.
After we submitted the paper, Henrik and Alex continued iterating on this scene, which eventually resulted in the more detailed version seen in our SIGGRAPH Fast Forward video.
The version of the cloudscape used in our paper is reproduced below:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2017/Jul/beauty_clouds.png"><img src="https://blog.yiningkarlli.com/content/images/2017/Jul/preview/beauty_clouds.jpg" alt="Image from paper Figure 1: a cloudscape rendered using spectral and decomposition tracking." /></a></p>
<p>To test out spectral tracking, we wanted an interesting, dynamic, colorful dataset.
After describing spectral tracking to Jesse Erickson, we arrived at the idea of a color explosion similar in spirit to certain visuals used in recent <a href="https://www.youtube.com/watch?v=WVPRkcczXCY">Apple</a> and <a href="https://www.youtube.com/watch?v=BzMLA8YIgG0">Microsoft</a> ads, which in turn were inspired by the <a href="https://en.wikipedia.org/wiki/Holi">Holi festival</a> celebrated in India and Nepal.
Jesse authored the color explosion in Houdini and provided a set of VDBs for each color section, which we were then able to shade, light, and render using Hyperion’s implementation of spectral tracking.
The final result was the color explosion from Figure 12 of the paper, seen at the top of this post.
We were honored to learn that the color explosion figure was chosen to be one of the pictures on the back cover of this year’s conference proceedings!</p>
<p>At one point we also remembered that brute force path-traced subsurface scattering is just volume rendering inside of a bounded surface, which led to the translucent heterogeneous Stanford dragon used in Figure 15 of the paper:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2017/Jul/sss_dragon.png"><img src="https://blog.yiningkarlli.com/content/images/2017/Jul/preview/sss_dragon.jpg" alt="Image from paper Figure 15: a subsurface scattering heterogeneous Stanford dragon rendered using spectral and decomposition tracking." /></a></p>
<p>For our video for the SIGGRAPH 2017 Fast Forward, we were able to get a lot of help from a number of artists. Alex and Henrik and a number of other artists significantly expanded and improved the cloudscape scene, and we also rendered out several more color explosion variants. The final fast forward video contains work from Alex Nijmeh, Henrik Falt, Jesse Erickson, Thom Wickes, Michael Kaschalk, Dale Mayeda, Ben Frost, Marc Bryant, John Kosnik, Mir Ali, Vijoy Gaddipati, and Dimitre Berberov. The awesome title effect was thought up by and created by Henrik. The final video is a bit noisy since we were severely constrained on available renderfarm resources (we were basically squeezing our renders in between actual production renders), but I think the end result is still really great:</p>
<div class="embed-container"><iframe src="https://player.vimeo.com/video/229503895" frameborder="0">Spectral and Decomposition Tracking for Rendering Homogeneous Volumes- SIGGRAPH 2017 Fast Forward Video</iframe></div>
<p>Here are a couple of cool stills from the fast forward video:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2017/Jul/fastforward_01.png"><img src="https://blog.yiningkarlli.com/content/images/2017/Jul/preview/fastforward_01.jpg" alt="An improved cloudscape from our SIGGRAPH Fast Forward video." /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2017/Jul/fastforward_02.png"><img src="https://blog.yiningkarlli.com/content/images/2017/Jul/preview/fastforward_02.jpg" alt="An orange-purple color explosion from our SIGGRAPH Fast Forward video." /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2017/Jul/fastforward_03.png"><img src="https://blog.yiningkarlli.com/content/images/2017/Jul/preview/fastforward_03.jpg" alt="A green-yellow color explosion from our SIGGRAPH Fast Forward video." /></a></p>
<p>We owe an enormous amount of thanks to fellow Hyperion teammate Patrick Kelly, who played an instrumental role in designing and implementing our overall new volume rendering system, and who discussed with us extensively throughout the project.
Hyperion teammate David Adler also helped out a lot in profiling and instrumenting our code.
We also must thank Thomas Müller, Marios Papas, Géraldine Conti, and David Adler for proofreading, and Brent Burley, Michael Kaschalk, and Rajesh Sharma for providing support, encouragement, and resources for this project.</p>
<p>I’ve worked on a <a href="http://blog.yiningkarlli.com/2014/11/sky-paper.html">SIGGRAPH Asia paper</a> before, but working on a large scale publication in the context of a major animation studio instead of in school was a very different experience.
The support and resources we were given and the amount of talent and help that we were able to tap into made this project possible.
This project is also an example of the incredible value that comes from companies maintaining in-house industrial research labs; this project absolutely would not have been possible without all of the collaboration from DRZ, in both the form of direct collaboration from Jan and indirect collaboration from all of the DRZ researchers that provided discussions and feedback.
Everyone worked really hard, but overall the whole process was immensely intellectually satisfying and fun, and seeing our new techniques in use by talented, excited artists makes all of the work absolutely worthwhile!</p>
https://blog.yiningkarlli.com/2017/05/subdivision-and-displacement.html
Subdivision Surfaces and Displacement Mapping
2017-05-14T00:00:00+00:00
2017-05-14T00:00:00+00:00
Yining Karl Li
<p>Two standard features that every modern production renderer supports are <a href="https://en.wikipedia.org/wiki/Subdivision_surface">subdivision surfaces</a> and some form of <a href="https://en.wikipedia.org/wiki/Displacement_mapping">displacement mapping</a>.
As we’ll discuss a bit later in this post, these two features are usually very closely linked to each other in both usage and implementation.
Subdivision and displacement are crucial tools for representing detail in computer graphics; from both a technical and authorship point of view, being able to represent more detail than is actually present in a mesh is advantageous.
Applying detail at runtime allows for geometry to take up less disk space and memory than would be required if all detail was baked into the geometry, and artists often like the ability to separate broad features from high frequency detail.</p>
<p>I recently added support for subdivision surfaces and for both scalar and vector displacement to Takua; Figure 1 shows an ocean wave was rendered using vector displacement in Takua.
The ocean surface is entirely displaced from just a single plane!</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2017/May/displacement_ocean_0.jpg"><img src="https://blog.yiningkarlli.com/content/images/2017/May/preview/displacement_ocean_0.jpg" alt="Figure 1: An ocean surface modeled as a flat plane and rendered using vector displacement mapping." /></a></p>
<p>Both subdivision and displacement originally came from the world of rasterization rendering, where on-the-fly geometry generation was historically both easier to implement and more practical/plausible to use.
In rasterization, geometry is streamed to through the renderer and drawn to screen, so each individual piece of geometry could be subdivided, tessellated, displaced, splatted to the framebuffer, and then discarded to free up memory.
Old REYES Renderman was famously efficient at rendering subdivision surfaces and displacement surfaces for precisely this reason.
However, in naive ray tracing, rays can intersect geometry at any moment in any order.
Subdividing and displacing geometry on the fly for each ray and then discarding the geometry is insanely expensive compared to processing geometry once across an entire framebuffer.
The simplest solution to this problem is to just subdivide and displace everything up front and keep it all around in memory during ray tracing.
Historically though, just caching everything was never a practical solution since computers simply didn’t have enough memory to keep that much data around.
As a result, past research work put significant effort into more intelligent ray tracing architectures that made on-the-fly subdivision/displacement affordable again; notable advancements include geometry caching for ray tracing <a href="http://graphics.stanford.edu/papers/displace">[Pharr and Hanrahan 1996]</a>, direct ray tracing of displacement mapped triangles <a href="https://doi.org/10.1007/978-3-7091-6303-0_28">[Smits et al. 2000]</a>, reordered ray tracing <a href="https://jo.dreggn.org/home/2010_rayes.pdf">[Hanika et al. 2010]</a>, and GPU ray traced vector displacement <a href="https://www.crcpress.com/GPU-Pro-6-Advanced-Rendering-Techniques/Engel/p/book/9781482264616">[Harada 2015]</a>.</p>
<p>In the past five years or so though, the story on ray traced displacement has changed.
We now have machines with gobs and gobs of memory (at a number of studios, renderfarm nodes with 256 GB of memory or more is not unusual anymore).
As a result, ray traced renderers don’t need to be nearly as clever anymore about managing displaced geometry; a combination of camera-adaptive tessellation and a simple geometry cache with a least-recently-used eviction strategy is often enough to make ray traced displacement practical.
Heavy displacement is now common in the workflows for a number of production pathtracers, including Arnold, Renderman/RIS, Vray, Corona, Hyperion, Manuka, etc.
With the above in mind, I tried to implement subdivision and displacement in Takua as simply as I possibly could.</p>
<p>Takua doesn’t have any concept of an eviction strategy for cached tessellated geometry; the hope is to just fit in memory and be as efficient as possible with what memory is available.
Admittedly, since Takua is just my hobby renderer instead of a fully in-use production renderer, and I have personal machines with 48 GB of memory, I didn’t think particularly hard about cases where things don’t fit in memory.
Instead of tessellating on-the-fly per ray or anything like that, I simply pre-subdivide and pre-displace everything upfront during the initial scene load.
Meshes are loaded, subdivided, and displaced in parallel with each other.
If Takua discovers that all of the subdivided and displaced geometry isn’t going to fit in the allocated memory budget, the renderer simply quits.</p>
<p>I should note that Takua’s scene format distinguishes between a mesh and a geom; a mesh is the raw vertex/face/primvar data that makes up a surface, while a geom is an object containing a reference to a mesh along with transformation matrices, shader bindings, and so on and so forth.
This separation between the mesh data and the geometric object allows for some useful features in the subdivision/displacement system.
Takua’s scene file format allows for binding subdivision and displacement modifiers either on the shader level, or per each geom.
Bindings at the geom level override bindings on the shader level, which is useful for authoring since a whole bunch of objects can share the same shader but then have individual specializations for different subdivision rates and different displacement maps and displacement settings.
During scene loading, Takua analyzes what subdivisions/displacements are required for which meshes by which geoms, and then de-duplicates and aggregates any cases where different geoms want the same subdivision/displacement for the same mesh.
This de-duplication even works for instances (I should write a separate post about Takua’s approach to instancing someday…).</p>
<p>Once Takua has put together a list of all meshes that require subdivision, meshes are subdivided in parallel.
For Catmull-Clark subdivision <a href="https://www.sciencedirect.com/science/article/abs/pii/0010448578901100">[Catmull and Clark 1978]</a>, I rely on <a href="https://graphics.pixar.com/opensubdiv/docs/intro.html">OpenSubdiv</a> for calculating subdivision <a href="https://graphics.pixar.com/opensubdiv/docs/far_overview.html#far-stenciltable">stencil tables</a> <a href="https://dl.acm.org/doi/10.1145/166117.166121">[Halstead et al. 1993]</a> for feature adaptive subdivision <a href="https://dl.acm.org/doi/10.1145/2077341.2077347">[Nießner et al. 2012]</a>, evaluating the stencils, and final tessellation.
As far as I can tell, stencil calculation in OpenSubdiv is single threaded, so it can get fairly slow on really heavy meshes.
Stencil evaluation and final tessellation is super fast though, since OpenSubdiv provides a number of <a href="https://graphics.pixar.com/opensubdiv/docs/osd_overview.html#limit-stencil-evaluation">parallel evaluators</a> that can run using a variety of backends ranging from TBB on the CPU to CUDA or OpenGL compute shaders on the GPU.
Takua currently relies on OpenSubdiv’s TBB evaluator.
One really neat thing about the stencil implementation in OpenSubdiv is that the stencil calculation is dependent on only the topology of the mesh and not individual primvars, so a single stencil calculation can then be reused multiple times to interpolate many different primvars, such as positions, normals, uvs, and more.
Currently Takua doesn’t support creases; I’m planning on adding crease support later.</p>
<p>No writing about subdivision surfaces is complete without a picture of a cube being subdivided into a sphere, so Figure 2 shows a render of a cube with subdivision levels 0, 1, 2, and 3, going from left to right.
Each subdivided cube is rendered with a procedural wireframe texture that I implemented to help visualize what was going on with subdivision.</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2017/May/subdcube.jpg"><img src="https://blog.yiningkarlli.com/content/images/2017/May/preview/subdcube.jpg" alt="Figure 2: A cube with 0, 1, 2, and 3 subdivision levels, going from left to right." /></a></p>
<p>Each subdivided mesh is placed into a new mesh; base meshes that require multiple subdivision levels for multiple different geoms get one new subdivided mesh per subdivision level.
After all subdivided meshes are ready, Takua then runs displacement.
Displacement is parallelized both by mesh and within each mesh.
Also, Takua supports both on-the-fly displacement and fully cached displacement, which can be specified per shader or per geom.
If a mesh is marked for full caching, the mesh is fully displaced, stored as a separate mesh from the undisplaced subdivision mesh, and then a BVH is built for the displaced mesh.
If a mesh is marked for on-the-fly displacement, the displacement system calculates each displaced face, then calculates the bounds for that face, and then discards the face.
The displaced bounds are then used to build a tight BVH for the displaced mesh without actually having to store the displaced mesh itself; instead, just a reference to the undisplaced subdivision mesh has to be kept around.
When a ray traverses the BVH for an on-the-fly displacement mesh, each BVH leaf node specifies which triangles on the undisplaced mesh need to be displaced to produce final polys for intersection and then the displaced polys are intersected and discarded again.
For the scenes in this post, on-the-fly displacement seems to be about twice as slow as fully cached displacement, which is to be expected, but if the same mesh is displaced multiple different ways, then there are correspondingly large memory savings.
After all displacement has been calculated, Takua goes back and analyzes which base meshes and undisplaced subdivision meshes are no longer needed, and frees those meshes to reclaim memory.</p>
<p>I implemented support for both scalar displacement via regular grayscale texture maps, and vector displacement from OpenEXR textures.
The ocean render from the start of this post uses vector displacement applied to a single plane.
Figure 3 shows another angle of the same vector displaced ocean:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2017/May/displacement_ocean_1.jpg"><img src="https://blog.yiningkarlli.com/content/images/2017/May/preview/displacement_ocean_1.jpg" alt="Figure 3: Another view of the vector displaced ocean surface from Figure 1. The ocean surface has a dielectric refractive material complete with colored attenuated transmission. A shallow depth of field is used to lend added realism." /></a></p>
<p>For both ocean renders, the vector displacement OpenEXR texture is borrowed from Autodesk, who generously provide it as part of an <a href="http://area.autodesk.com/learning/rendering-an-ocean-with-displacement1">article</a> about vector displacement in Arnold.
The renders are lit with a skydome using <a href="http://hdri-skies.com/shop/hdri-sky-193/">hdri-skies.com’s HDRI Sky 193</a> texture.</p>
<p>For both scalar and vector displacement, the displacement amount from the displacement texture can be controlled by a single scalar value.
Vector displacement maps are assumed to be in a local tangent space; which axis is used as the basis of the tangent space can be specified per displacement map.
Figure 4 shows three dirt shaderballs with varying displacement scaling values.
The leftmost shaderball has a displacement scale of 0, which effectively disables displacement.
The middle shaderball has a displacement scale of 0.5 of the native displacement values in the vector displacement map.
The rightmost shaderball has a displacement scale of 1.0, which means just use the native displacement values from the vector displacement map.</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2017/May/displacementscales.jpg"><img src="https://blog.yiningkarlli.com/content/images/2017/May/preview/displacementscales.jpg" alt="Figure 4: Dirt shaderballs with displacement scales of 0.0, 0.5, and 1.0, going from left to right." /></a></p>
<p>Figure 5 shows a closeup of the rightmost dirt shaderball from Figure 4.
The base mesh for the shaderball is relatively low resolution, but through subdivision and displacement, a huge amount of geometric detail can be added in-render.
In this case, the shaderball is tessellated to a point where each individual micropolygon is at a subpixel size.
The model for the shaderball is based on <a href="http://bertrand-benoit.com/blog/free-mat-test-scene/">Bertrand Benoit</a>’s shaderball.
The displacement map and other textures for the dirt shaderball are from <a href="https://megascans.se">Quixel’s Megascans</a> library.</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2017/May/dirtsphere.jpg"><img src="https://blog.yiningkarlli.com/content/images/2017/May/preview/dirtsphere.jpg" alt="Figure 5: Closeup of the dirt shaderball from Figure 4. In this render, the shaderball is tessellated and displaced to a subpixel resolution." /></a></p>
<p>One major challenge with displacement mapping is cracking.
Cracking occurs when adjacent polygons displace the same shared vertices different ways for each polygon.
This can happen when the normals across a surface aren’t continuous, or if there is a discontinuity in either how the displacement texture is mapped to the surface, or in the displacement texture itself.
I implemented an optional, somewhat brute-force solution to displacement cracking.
If crack removal is enabled, Takua analyzes the mesh at displacement time and records how many different ways each vertex in the mesh has been displaced by different faces, along with which faces want to displace that vertex.
After an initial displacement pass, the crack remover then goes back and for every vertex that is displaced more than one way, all of the displacements are averaged into a single displacement, and all faces that use that vertex are updated to share the same averaged result.
This approach requires a fair amount of bookkeeping and pre-analysis of the displaced mesh, but it seems to work well.
Figure 6 is a render of two cubes with geometric normals assigned per face.
The two cubes are displaced using the same checkerboard displacement pattern, but the cube on the left has crack removal disabled, while the cube on the right has crack removal enabled:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2017/May/crackedcube.jpg"><img src="https://blog.yiningkarlli.com/content/images/2017/May/preview/crackedcube.jpg" alt="Figure 6: Displaced cubes with and without crack elimination." /></a></p>
<p>In most cases, the crack removal system seems to work pretty well.
However, the system isn’t perfect; sometimes, stretching artifacts can appear, especially with surfaces with a textured base color.
This stretching happens because the crack removal system basically stretches micropolygons to cover the crack.
This texture stretching can be seen in some parts of the shaderballs in Figures 5, 7, and 8 in this post.</p>
<p>Takua automatically recalculates normals for subdivided/displaced polygons.
By default, Takua simply uses the geometric normal as the shading normal for displaced polygons; however, an option exists to calculate smooth normals for the shading normals as well.
I chose to use geometric normals as the default with the hope that for subpixel subdivision and displacement, a different shading normal wouldn’t be as necessary.</p>
<p>In the future, I may choose to implement my own subdivision library, and I should probably also put more thought into some kind of proper combined tessellation cache and eviction strategy for better memory efficiency.
For now though, everything seems to work well and renders relatively efficiently; the non-ocean renders in this post all have sub-pixel subdivision with millions of polygons and each took several hours to render at 4K (3840x2160) resolution on a machine with dual Intel Xeon X5675 CPUs (12 cores total).
The two ocean renders I let run overnight at 1080p resolution; they took longer to converge mostly due to the depth of field.
All renders in this post were shaded using a new, vastly improved shading system that I’ll write about at a later point.
Takua can now render a lot more complexity than before!</p>
<p>In closing, I rendered a few more shaderballs using various displacement maps from the Megascans library, seen in Figures 7 and 8.</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2017/May/shaderspheres_0.jpg"><img src="https://blog.yiningkarlli.com/content/images/2017/May/preview/shaderspheres_0.jpg" alt="Figure 7: A pebble sphere and a leafy sphere. Note the overhangs on the leafy sphere, which are only possible using vector displacement." /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2017/May/shaderspheres_1.jpg"><img src="https://blog.yiningkarlli.com/content/images/2017/May/preview/shaderspheres_1.jpg" alt="Figure 8: A compacted sand sphere and a stone sphere. Unfortunately, there is some noticeable texture stretching on the compacted sand sphere where crack removal occured." /></a></p>
<p><strong>References</strong></p>
<p>Edwin E. Catmull and James H. Clark. 1978. <a href="https://www.sciencedirect.com/science/article/abs/pii/0010448578901100">Recursively Generated B-spline Surfaces on Arbitrary Topological Meshes</a>. <em>Computer-Aided Design</em>. 10, 6 (1978), 350-355.</p>
<p>Mark Halstead, Michael Kass, and Tony DeRose. 1993. <a href="https://dl.acm.org/doi/10.1145/166117.166121">Efficient, Fair Interpolation using Catmull-Clark Surfaces</a>. In <em>SIGGRAPH 1993: Proceedings of the 20th Annual Conference on Computer Graphics and Interactive Techniques</em>. 35-44.</p>
<p>Johannes Hanika, Alexander Keller, and Hendrik P A Lensch. 2010. <a href="https://dl.acm.org/citation.cfm?id=1839241">Two-Level Ray Tracing with Reordering for Highly Complex Scenes</a>. In <em>GI 2010 (Proceedings of the 2010 Conference on Graphics Interfaces)</em>. 145-152.</p>
<p>Takahiro Harada. 2015. <a href="https://www.crcpress.com/GPU-Pro-6-Advanced-Rendering-Techniques/Engel/p/book/9781482264616">Rendering Vector Displacement Mapped Surfaces in a GPU Ray Tracer</a>. In <em>GPU Pro 6</em>. 459-474.</p>
<p>Matthias Nießner, Charles Loop, Mark Meyer, and Tony DeRose. 2012. <a href="https://dl.acm.org/doi/10.1145/2077341.2077347">Feature Adaptive GPU Rendering of Catmull-Clark Subdivision Surfaces</a>. <em>ACM Transactions on Graphics</em>. 31, 1 (2012), 6:1-6:11.</p>
<p>Matt Pharr and Pat Hanrahan. 1996. <a href="http://graphics.stanford.edu/papers/displace/">Geometry Caching for Ray-Tracing Displacement Maps</a>. In <em>Rendering Techniques 1996 (Proceedings of the 7th Eurographics Workshop on Rendering)</em>. 31-40.</p>
<p>Brian Smits, Peter Shirley, and Michael M. Stark. 2000. <a href="https://doi.org/10.1007/978-3-7091-6303-0_28">Direct Ray Tracing of Displacement Mapped Triangles</a>. In <em>Rendering Techniques 2000 (Proceedings of the 11th Eurographics Workshop on Rendering)</em>. 307-318.</p>
https://blog.yiningkarlli.com/2016/11/moana.html
Moana
2016-11-17T00:00:00+00:00
2016-11-17T00:00:00+00:00
Yining Karl Li
<p>2016 is the first year ever that <a href="http://www.disneyanimation.com/">Walt Disney Animation Studios</a> is releasing two CG animated films. We released <a href="http://www.disneyanimation.com/projects/moanaopia">Zootopia</a> back in March, and next week, we will be releasing our newest film, <a href="http://www.disneyanimation.com/projects/moana">Moana</a>. I’ve spent the bulk of the last year and a half working as part of Disney’s <a href="http://www.disneyanimation.com/technology/innovations/hyperion">Hyperion Renderer</a> team on a long list of improvements and new features for Moana. Moana is the first film I have an official credit on, and I couldn’t be more excited for the world to see what we have made!</p>
<p>We’re all incredibly proud of Moana; the story is fantastic, the characters are fresh and deep and incredibly appealing, and the music is an instant classic. Most important for a rendering guy though, I think Moana is flat out the best looking animated film anyone has ever made. Every single department on this film really outdid themselves. The technology that we had to develop for this film was staggering; we have a whole new distributed fluid simulation package for the endless oceans in the film, we added advanced new lighting capabilities to Hyperion that have never been used in an animated film before to this extent (to the best of my knowledge), we made huge advances in our animation technology for characters such as Maui; the list goes on and on and on. Something like over 85% of the shots in this movie have significant FX work in them, which is unheard of for animated features.</p>
<p>Hyperion gained a number of major new capabilities in support of making Moana.
Rendering the ocean was a major concern on Moana, so much of Hyperion’s development during Moana revolved around features related to rendering water.
Our lighters wanted caustics in all shots with shallow water, such as shots set at the beach or near the shoreline; faking caustics was quickly ruled out as an option since setting up lighting rigs with fake caustics that looked plausible and visually pleasing proved to be difficult and laborious.
We found that providing real caustics was vastly preferable to faking things, both from a visual quality standpoint and a artist workflow standpoint, so we wound up adding a photon mapping system to Hyperion.
The design of the photon mapping system is highly optimized around handling sun-water caustics, which allows for some major performance optimizations, such as an adaptive photon distribution system that makes sure that photons are not wasted on off-camera parts of the scene.
Most of the photon mapping system was written by Peter Kutz; I also got to work on the photon mapping system a bit.</p>
<p>Water is in almost every shot in the film in some form, and the number of water effects was extremely varied, ranging from the ocean surface going out for dozens of miles in every direction, to splashes and boat wakes <a href="https://dl.acm.org/citation.cfm?id=3073597">[Stomakhin and Selle 2017]</a> and other finely detailed effects.
Water had to be created using a host of different techniques, from relatively simple procedural wave functions <a href="https://dl.acm.org/citation.cfm?id=3005379">[Garcia et al. 2016]</a>, to hand-animatable rigged wave systems <a href="https://dl.acm.org/citation.cfm?doid=3084363.3085056">[Byun and Stomakhin 2017]</a>, all the way to huge complex fluid simulations using Splash, a custom in-house APIC-based fluid simulator <a href="https://dl.acm.org/citation.cfm?id=2766996">[Jiang et al. 2015]</a>.
We even had to support water as a straight up rigged character <a href="https://dl.acm.org/citation.cfm?id=3085091">[Frost et al. 2017]</a>!
In order to bring the results of all of these techniques together into a single renderable water surface, an enormous amount of effort was put into building a level-set compositing system, in which all water simulation results would be converted into signed distance fields that could then be combined and converted into a watertight mesh.
Having a single watertight mesh was important, since the ocean often also contained a homogeneous volume to produce physically correct scattering.
This is where all of the blues and the greens in ocean water come from.
This entire system could be run by Hyperion at rendertime, or could be run offline beforehand to generate a cached result that Hyperion could load; a whole complex pipeline had to be build to support this capability <a href="https://dl.acm.org/citation.cfm?id=3085067">[Palmer et al. 2017]</a>.
Building this level-set compositing and meshing system involved a large number of TDs and engineers; on the Hyperion side, this project was led by Ralf Habel, Patrick Kelly, and Andy Selle.
Peter and I also helped out at various points.</p>
<p>At one point early on the film’s production, we noticed that our lighters were having a difficult time getting specular glints off of the ocean surface to look right.
For artistic controllability reasons, our lighters prefer to keep the sun and the skydome as two separate lights; the skydome is usually an image-based light that is either painted or is from photography with the sun painted out, and the sun is usually a distant infinite light that subtends some sold angle.
After a lot of testing, we found that the look of specular glints on the ocean surface comes partially from the sun itself, but also partially from the atmospheric scattering that makes the sun look hazy and larger in the sky than it actually is.
To get this look, I added a system to analytically add a Mie-scattering halo around our distant lights; we called the result the “halo light”.</p>
<p>Up until Moana, Hyperion actually never had proper importance sampling for emissive meshes; we just relied on paths randomly finding their way to emissive meshes and only worried about importance sampling analytical area lights and distant infinite lights.
For shots with the big lava monster Te-Ka <a href="https://dl.acm.org/citation.cfm?id=3085076">[Bryant et al. 2017]</a>, however, most of the light in the frame came from emissive lava meshes, and most of what was being lit were complex, dense smoke volumes.
Peter added a highly efficient system for importance sampling emissive meshes into the renderer, which made Te-Ka shots go from basically un-renderable to not a problem at all.
David Adler also made some huge improvements to our denoiser’s ability to handle volumes to help with those shots.</p>
<p>Moana also saw a huge number of other improvements during Moana; Dan Teece and Matt Chiang made numerous improvements to the shading system, I reworked the ribbon curve intersection system to robustly handle Heihei’s and hawk-Maui’s feathers, Greg Nichols made our camera-adaptive tessellation more robust, and the team in general made many speed and memory optimizations.
Throughout the whole production cycle, Hyperion partnered really closely with production to make Moana the most beautiful animated film we’ve ever made.
This close partnership is what makes working at Disney Animation such an amazing, fun, and interesting experience.</p>
<p>The first section of the credits sequence in Moana showcases a number of the props that our artists made for the film. I highly recommend staying and staring at all of the eye candy; our look and modeling departments are filled with some of the most dedicated and talented folks I’ve ever met. The props in the credits have simply preposterous amounts of detail on them; every single prop has stuff like tiny little flyaway fibers or microscratches or imperfections or whatnot on them. In some of the international posters, one can see that all of the human characters are covered with fine peach fuzz (an important part of making their skin catch the sunlight correctly), which we rendered in every frame! Something that we’re really proud of is the fact that <em>none of the credit props were specially modeled for the credits</em>! Those are all the exact props we used in every frame that they show up in, which really is a testament to both how amazing our artists our and how much work we’ve put into every part of our technology. The vast majority of production for Moana happened in essentially the 9 months between Zootopia’s release in March and October; this timeline becomes even more astonishing given the sheer beauty and craftsmanship in Moana.</p>
<p>Below are a number of stills (in no particular order) from the movie, 100% rendered using Hyperion.
These stills give just a hint at how beautiful this movie looks; definitely go see it on the biggest screen you can find!</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_01.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_01.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_20.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_20.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_12.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_12.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_14.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_14.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_13.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_13.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_04.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_04.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_05.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_05.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_06.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_06.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_38.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_38.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_07.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_07.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_08.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_08.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_10.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_10.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_11.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_11.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_09.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_09.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_03.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_03.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_02.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_02.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_44.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_44.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_16.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_16.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_17.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_17.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_19.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_19.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_35.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_35.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_37.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_37.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_21.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_21.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_22.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_22.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_43.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_43.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_23.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_23.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_24.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_24.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_25.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_25.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_26.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_26.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_27.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_27.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_28.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_28.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_29.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_29.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_30.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_30.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_31.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_31.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_32.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_32.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_15.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_15.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_33.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_33.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_34.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_34.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_18.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_18.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_45.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_45.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_36.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_36.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_39.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_39.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_40.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_40.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_41.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_41.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_42.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_42.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_46.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_46.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_47.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_47.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_48.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_48.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_50.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_50.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_49.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_49.jpg" alt="" /></a></p>
<p>Here is a credits frame with my name that Disney kindly provided! Most of the Hyperion team is grouped under the Rendering/Pipeline/Engineering Services (three separate teams under the same manager) category this time around, although a handful of Hyperion guys show up in an earlier part of the credits instead.</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_credits.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Nov/WAKA_credits.jpg" alt="" /></a></p>
<p>All images in this post are courtesy of and the property of Walt Disney Animation Studios.</p>
<p><strong>Addendum 2018-08-18</strong>: A lot more detailed information about the photon mapping system, the level-set compositing system, and the halo light is now available as part of our recent TOG paper on Hyperion <a href="https://dl.acm.org/citation.cfm?id=3182159">[Burley et al. 2018]</a>.</p>
<p><strong>References</strong></p>
<p>Marc Bryant, Ian Coony, and Jonathan Garcia. 2017. <a href="https://dl.acm.org/citation.cfm?id=3085076">Moana: Foundation of a Lava Monster</a>. In <em>ACM SIGGRAPH 2017, Talks</em>. 10:1-10:2.</p>
<p>Brent Burley, David Adler, Matt Jen-Yuan Chiang, Hank Driskill, Ralf Habel, Patrick Kelly, Peter Kutz, Yining Karl Li, and Daniel Teece. 2018. <a href="https://dl.acm.org/citation.cfm?id=3182159">The Design and Evolution of Disney’s Hyperion Renderer</a>. <em>ACM Transactions on Graphics</em>. 37, 3 (2018), 33:1-33:22.</p>
<p>Dong Joo Byun and Alexey Stomakhin. 2017. <a href="https://dl.acm.org/citation.cfm?doid=3084363.3085056">Moana: Crashing Waves</a>. In <em>ACM SIGGRAPH 2017, Talks</em>. 41:1-41:2.</p>
<p>Ben Frost, Alexey Stomakhin, and Hiroaki Narita. 2017. <a href="https://dl.acm.org/citation.cfm?id=3085091">Moana: Performing Water</a>. In <em>ACM SIGGRAPH 2017, Talks</em>. 30:1-30:2.</p>
<p>Jonathan Garcia, Sara Drakeley, Sean Palmer, Erin Ramos, David Hutchins, Ralf Habel, and Alexey Stomakhin. 2016. <a href="https://dl.acm.org/citation.cfm?id=3005379">Rigging the Oceans of Disney’s Moana</a>. In <em>ACM SIGGRAPH Asia 2016, Technical Briefs</em>. 30:1-30:4.</p>
<p>Chenfafu Jiang, Craig Schroeder, Andrew Selle, Joseph Teran, and Alexey Stomakhin. 2015. <a href="https://dl.acm.org/citation.cfm?id=2766996">The Affine Particle-in-Cell Method</a>. <em>ACM Transactions on Graphics</em>. 34, 4 (2015), 51:1-51:10.</p>
<p>Sean Palmer, Jonathan Garcia, Sara Drakeley, Patrick Kelly, and Ralf Habel. 2017. <a href="https://dl.acm.org/citation.cfm?id=3085067">The Ocean and Water Pipeline of Disney’s Moana</a>. In <em>ACM SIGGRAPH 2017, Talks</em>. 29:1-29:2.</p>
<p>Alexey Stomakhin and Andy Selle. 2017. <a href="https://dl.acm.org/citation.cfm?id=3073597">Fluxed Animated Boundary Method</a>. <em>ACM Transactions on Graphics</em>. 36, 4 (2017), 68:1-68:8.</p>
https://blog.yiningkarlli.com/2016/09/pbrtv3.html
Physically Based Rendering 3rd Edition
2016-09-30T00:00:00+00:00
2016-09-30T00:00:00+00:00
Yining Karl Li
<p>Today is the release date for the digital version of the new <a href="https://www.amazon.com/Physically-Based-Rendering-Theory-Implementation-ebook/dp/B01M013UX1/ref=mt_kindle?_encoding=UTF8&me=">Physically Based Rendering 3rd Edition</a>, by <a href="http://pharr.org/matt/">Matt Pharr</a>, <a href="https://rgl.epfl.ch/people/wjakob">Wenzel Jakob</a>, and <a href="https://twitter.com/humper">Greg Humphreys</a>.
As anyone in the rendering world knows, Physically Based Rendering is THE reference book for the field; for novices, Physically Based Rendering is the best introduction one can get to the field, and for experts, Physically Based Rendering is an invaluable reference book to consult and check.
I share a large office with three other engineers on the Hyperion team, and I think between the four of us, we actually have an average of more than one copy per person (of varying editions).
I could not recommend this book enough.
The third edition adds Wenzel Jakob as an author; Wenzel is the author of the research-oriented <a href="http://www.mitsuba-renderer.org">Mitsuba Renderer</a> and is one of the most prominent new researchers in rendering in the past decade.
There is a lot of great new light transport stuff in the third edition, which I’m guessing comes from Wenzel.
Both Wenzel’s work and the previous editions of Physically Based Rendering were instrumental in influencing my path in rendering, so of course I’ve already had the third edition on pre-order since it was announced over a year ago.</p>
<p>Each edition of Physically Based Rendering is accompanied by a major release of the <a href="https://github.com/mmp/pbrt-v3">PBRT renderer</a>, which implements the book.
The PBRT renderer is a major research resource for the community, and basically everyone I know in the field has learned something or another from looking through and taking apart PBRT.
As part of the drive towards PBRT-v3, Matt Pharr made a call for interesting scenes to provide as demo scenes with the PBRT-v3 release.
I offered Matt the PBRT-v2 scene I <a href="http://blog.yiningkarlli.com/2015/03/bsdf-system.html">made a while back</a>, because how could that scene <em>not</em> be rendered in PBRT?
I’m very excited that Matt accepted and included the scene as part of PBRT-v3’s example scenes!
You can find the example scenes <a href="http://pbrt.org/scenes-v3.html">here on the PBRT website</a>.</p>
<p>Converting the scene to PBRT’s format required a lot of manual work, since PBRT’s scene specification and shading system is very different from Takua’s.
As a result, the image that PBRT renders out looks slightly different from Takua’s version, but that’s not a big deal.
Here is the scene rendered using PBRT-v3:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Sep/pbrtv2_pbrtv3.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Sep/pbrtv2_pbrtv3.jpg" alt="Physically Based Rendering 2nd Edition, rendered using PBRT-v3." /></a></p>
<p>…and for comparison, the same scene rendered using Takua:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Sep/pbrtv2_takua.png"><img src="https://blog.yiningkarlli.com/content/images/2016/Sep/pbrtv2_takua.png" alt="Physically Based Rendering 2nd Edition, rendered using Takua Renderer a0.5." /></a></p>
<p>Really, it’s just the lighting that is a bit different; the Takua version is slightly warmer and slightly underexposed in comparison.</p>
<p>At some point I should make an updated version of this scene using the third edition book.
I’m hoping to be able to contribute more of my Takua test scenes to the community in PBRT-v3 format in the future; giving back to such a major influence on my own career is extremely important.
As part of the process of porting the scene over to PBRT-v3, I also wrote a super-hacky render viewer for PBRT that shows the progress of the render as the renderer runs.
Unfortunately, this viewer is mega-hacky, and I don’t have time at the moment to clean it up and release it.
Hopefully at some point I’ll be able to; alternatively, anyone else who wants to take a look and give it a stab, feel free to contact me.</p>
<hr />
<p><strong>Addendum 04/28/2017</strong>: Matt was recently looking for some interesting water-sim scenes to demonstrate dielectrics and glass materials and refraction and whatnot.
I contributed a few frames from <a href="http://yiningkarlli.com/projects/arielflip.html">my PIC/FLIP fluid simulator, Ariel</a>.
Most of the data from Ariel doesn’t exist in meshed format anymore; I still have all of the raw VDBs and stuff, but the meshes took up way more storage space than I could afford at the time.
I can still regenerate all of the meshes though, and I also have a handleful of frames in mesh form still from my <a href="http://blog.yiningkarlli.com/2015/06/attenuated-transmission.html">attenuated transmission blog post</a>.
The frame from the first image in that post is now also included in the PBRT-v3 <a href="http://pbrt.org/scenes-v3.html">example scene suite</a>!
The PBRT version looks very different since it is intended to demonstrate and test something very different from what I was doing in that blog post, but it still looks great!</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Sep/ariel_pbrtv3.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Sep/ariel_pbrtv3.jpg" alt="A frame from my Ariel fluid simulator, rendered using PBRT-v3." /></a></p>
https://blog.yiningkarlli.com/2016/07/minecraft-in-renderman-ris.html
Rendering Minecraft in Renderman/RIS
2016-07-22T00:00:00+00:00
2016-07-22T00:00:00+00:00
Yining Karl Li
<p>The vast majority of my computer graphics time is spent developing renderers (Disney’s Hyperion renderer as a professional, Takua Renderer as a hobbyist). However, I think having experience using renderers as an artist is an important part of knowing what to focus on as a renderer developer. I also think that knowing how a variety of different renderers work and how they are used is important; a lot of artists are used to using several different renderers, and each renderer has its own vocabulary and tried and true workflows and whatnot. Finally, there are a lot of really smart people working on all of the major production renderers out there, and seeing the cool things everyone is doing is fun and interesting! Because of all of these reasons, I like putting some time aside every once in a while to tinker with other renderers. I usually don’t write about my art projects that much anymore, but this project was particularly fun and produced some nice looking images, so I thought I’d write it up. As usual, before we dive into the post, here is the final image I made, rendered using Pixar’s Photorealistic Renderman 20 in RIS mode:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Jul/aerial_shot_final_comp.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Jul/preview/aerial_shot_final_comp.jpg" alt="A Minecraft town from the pve.nerd.nu Minecraft server, rendered in Renderman 20/RIS." /></a></p>
<p>About two years ago, Pixar’s Photorealistic Renderman got a new rendering mode called RIS. PRman was one of the first production renderers ever developed, and historically PRman has always been a <a href="http://graphics.pixar.com/library/Reyes/">REYES-style rasterization</a> renderer. Over time though, PRman has gained a whole bunch of added on features. At the time of Monsters University, PRman was actually a kind of hybrid rasterizer and raytracer; the rendering system on Monsters University used raytracing to build a <a href="http://graphics.pixar.com/library/RadiosityCaching/">multiresolution radiosity cache</a> that was then used for calculating GI contributions in the shading part of REYES rasterization. That approach worked well and produced beautiful images, but it was also really complicated and had a number of drawbacks! RIS replaces all of that with a brand new, pure pathtracing system. In fact, while RIS is marketed as a new mode in PRman, RIS is actually a completely new renderer written almost completely from scratch; it just happens to be able to read Renderman RIB files as input.</p>
<p>Recently, I wanted to try rendering a Minecraft world from a Minecraft server that I play on. There are a lot of great Minecraft rendering tools available these days (<a href="http://chunky.llbit.se/gallery.html">Chunky</a> comes to mind), but I wanted much more production-like control over the look of the render, so I decided to do everything using a normal CG production workflow instead of a prebuilt dedicated Minecraft rendering tool. I thought that I would use the project as a chance to give RIS a spin. At Cornell’s Program of Computer Graphics, Pixar was kind enough to provide us with access to the Renderman 19 beta program, which included the first version of RIS. I tinkered with the PRman 19 beta quite a lot at Cornell, and being an early beta, RIS had some bugs and incomplete bits back then. Since then though, the Renderman team has followed up PRman 19 with versions 20 and 21, which introduced a number of new features and speed/stability improvements to RIS. Since joining the Hyperion team, I’ve had the chance to meet and talk to various (really smart!) folks on the Renderman team since they are a sister team to us, but I haven’t actually had time to try the new versions of RIS. This project was a fun way to try the newest version of RIS on my own!</p>
<p>The Minecraft data for this project comes from the <a href="http://nerd.nu">Nerd.nu community Minecraft server</a>, which is run by a collective of players for free. I’ve been playing on the Nerd.nu PvE (Player versus Environment) server for years and years now, and players have built a mind-boggling number of amazing detailed creations. Every couple of months, the server is reset with a fresh map; I wanted to render a town that fellow player Avi_Dangerstein and I built on the previous map revision. Fortunately, all previous Nerd.nu map revisions are available for download in the <a href="http://mcp-dl.com/">server archives</a> (the specific map I used is labeled pve-rev17). Here is an overview of the map revision I wanted to pull data from:</p>
<p><a href="http://redditpublic.com/carto/pve/p17/carto/#/-24/64/176/-6/0/0"><img src="https://blog.yiningkarlli.com/content/images/2016/Jul/preview/cartograph.jpg" alt="Cartograph view of Revision 17 of the Nerd.nu PvE server, located at p.nerd.nu. Click through to go to the full, zoomable cartograph." /></a></p>
<p>…and here is a zoomed in view of the part of the map that contains our town. The vast majority of the town was built by two players over the course of about 4 months. Our town is about 250 blocks long; the entire server map is a 6000 block by 6000 block square.</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Jul/cartograph_zoomed.png"><img src="https://blog.yiningkarlli.com/content/images/2016/Jul/preview/cartograph_zoomed.jpg" alt="Zoomed cartograph view of our Minecraft town." /></a></p>
<p>The first problem to tackle in this project was just getting Minecraft world data into a usable format. Pixar provides a free, non-commercial version of Renderman for Maya, and I’m very familiar with Maya, so my entire workflow for this project was based around good ol’ Maya. Maya doesn’t know how to read Minecraft data though… in fact, Minecraft’s <a href="http://minecraft.gamepedia.com/Chunk_format">chunked data format</a> is a fascinating rabbit hole to read about in its own right. I briefly entertained the idea of writing my own Minecraft to Maya importer, but then I found a number of Minecraft to Obj exporters that other folks have already written. I first tried <a href="https://github.com/jmc2obj/j-mc-2-obj">jmc2obj</a>, but the section of the Minecraft world that I wanted to export was so large that jmc2obj kept running out of memory and crashing. Instead, I found that <a href="http://erich.realtimerendering.com">Eric Haines</a>’s <a href="http://www.realtimerendering.com/erich/minecraft/public/mineways/">Mineways</a> exporter was able to handle the data load well (incidentally, Eric Haines is also a Cornell Program of Computer Graphics alum; I inherited a pile of his ACM Transactions on Graphics hardcopies while at Cornell). The chunk of the world I wanted to export was pretty large; in the Mineways screenshot below, the area outlined in red is the part of the world that I wanted:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Jul/mineways_section.png"><img src="https://blog.yiningkarlli.com/content/images/2016/Jul/preview/mineways_section.jpg" alt="Section of the map for export is outlined in red." /></a></p>
<p>The area outlined above is significantly larger than the area I wound up rendering… initially I was thinking of a very different camera angle from the ground with the mountains in the background before I picked an aerial view much later. The size of the exported obj mesh was about 1.5 GB. Mineways exports the world as a single mesh, optimized to remove all completely occluded internal faces (so the final mesh is hollow instead of containing useless faces for all of the internal blocks). Each visible block face is uv’d into a corresponding square on a single texture file. This approach produces an efficient mesh, but I realized early on that I would need water in a separate mesh containing completely enclosed volumes for each body of water (Mineways only provides geometry for the top surface of water). Glass had to be handled similarly; both water and glass need special handling for the same reasons that I mentioned immediately after the first diagram in my <a href="http://blog.yiningkarlli.com/2015/06/attenuated-transmission.html">attenuated transmission blog post</a>. Mineways allows for exporting different block types as separate meshes (but still with internal faces removed), so I simply deleted the water and glass meshes after exporting. Luckily, jmc2obj allows exporting individual block types as closed meshes, so I went back to jmc2obj for just the water and glass. Since just the water and glass is a much smaller data set than the whole world, jmc2obj was able to export without a problem. Since rendering refractive interfaces correctly requires expanding out the refractive mesh slightly at the interfaces, I wrote a custom program based on Takua Renderer’s obj mesh processing library to push out all of the vertices of the water and glass meshes slightly along the average of the face normals at each vertex.</p>
<p>Next up was shading everything in Maya. Renderman 20 ships with an implementation of <a href="https://disney-animation.s3.amazonaws.com/library/s2012_pbs_disney_brdf_notes_v2.pdf">Disney’s Principled Brdf</a>, which I’ve gotten very familiar with using, so I went with Renderman’s PxrDisney Bxdf. The Disney Brdf allows for quickly creating very good looking materials using a fairly small parameter set. Overall I tried to stick close to the in-game aesthetic, which meant using all of the standard in-game textures instead of a custom resource pack, and I also wound up having to reign back a bit on making materials look super realistic. Everything basically has some varied roughness and specularity, and that’s pretty much it. I did add a subtle bump map to everything though; I made the bump map by simply making a black and white version of the default texture pack and messing with the brightness and contrast a bit. Below is a render of a <a href="http://www.minecraftforum.net/forums/mapping-and-modding/resource-packs/1243823-qmagnets-test-map-for-resource-packs-and-map">test world</a> created by Minecraft Forum user QMagnet specifically for testing resource packs. I lit the test scene using a single IBL (<a href="http://hdri-skies.com/shop/hdri-sky-141/">HDRI Sky 141 from the HDRI-Skies library</a>). The test render below isn’t using the final specialized water and leaf shaders I created, which I’ll describe a bit further down, and there are also some resolution problems on the alpha masks for the leaf blocks, but overall this test render gives an idea of what my final materials look like:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Jul/materialtest.png"><img src="https://blog.yiningkarlli.com/content/images/2016/Jul/preview/materialtest.jpg" alt="Final materials on a resource pack test world." /></a></p>
<p>One detail worth going into a bit more detail about are the glowing blocks. The glowstone, lantern, and various torch blocks use a trick based on something that I have seen lighters use in production. The basic idea is to decouple the direct and indirect visibility for the light. I got this decoupling to work in RIS by making all of the glowing blocks into pairs of textured PxrMeshLights. Using PxrMeshLights is necessary in order to allow for efficient light sampling; however, the actual exposures the lights are at make the textures blow out in camera. In order to make the textures discernible in camera, a second PxrMeshLights is needed for each glowing object; one of the lights is at the correct exposure but is marked visible to only indirect rays and invisible to direct camera rays, and the other light is at a much lower exposure but is also only visible to direct camera rays. This trick is a totally non-physical cheaty-hack, but it allows for a believable visual appearance if the exposures are chosen carefully.</p>
<p>In the final renders a few pictures down, I also used a more specialized shader for leaves and vines and tall grass and whatnot. The leaf block shader uses a PxrLMPlastic material instead of PxrDisney; this is because the leaf block shader has a slight amount of diffuse transmission (translucency) and also has more specialized diffuse/specular roughness maps.</p>
<p>For the water shader in the final render, I used a PxrLMGlass material with an IOR of 1.325, a slightly blue tinted refraction color, and a light blue absorption color. Using slightly different colors for the refraction and absorption colors allows for the water to transition to a slightly different hue at deeper depths than at the surface (as opposed to just a more saturated version of the same color). I also added a simple water surface displacement map to get the wavy surface effect. Combined with the refractive interface stuff mentioned before, the final water looks like this:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Jul/watertest.png"><img src="https://blog.yiningkarlli.com/content/images/2016/Jul/preview/watertest.jpg" alt="Water test render, using a PxrLMGlass material. Unfortunately, no true caustics here..." /></a></p>
<p>Note the total lack of real caustics in the water… I wound up just using the basic pathtracer built into RIS instead of Pixar’s VCM implementation. Pixar’s VCM implementation is one of the first commercial VCM implementations out there, but as of Renderman 20, it has no adaptivity in its light path distribution whatsoever. As a result, the Renderman 20 VCM integrator is not really suitable for use on huge scenes; most of the light paths end up getting wasted on areas of the scene that aren’t even close to being in-camera, which means that all of the efficiency in rendering caustics is lost. This problem is fundamental to lighttracing-based techniques (meaning that bidirectional techniques inherit the same problem), and solving it remains a relatively open problem (Takua has some basic photon targeting mechanisms for PPM/VCM that I’ll write about at some point). Apparently, this large-scene problem was a major challenge on Finding Dory and is one of the main reasons why Pixar didn’t use VCM heavily on Dory; Dory relied mostly on projected and pre-baked caustics.</p>
<p>I should also note that Renderman 21 does away with the PxrLM and PxrDisney materials entirely and instead introduces the shader set that Christophe Hery and Ryusuke Villemin wrote for Finding Dory. I haven’t tried the Renderman 21 shading system yet, but I would be very curious to compare against our Disney Brdf.</p>
<p>The final lighting setup I used was very simple. There are two main lights in the scene: an IBL dome light for sky illumination, and a 0.5 degree distant light as a sun stand-in. The IBL is another free sky from the HDRI-Skies library; this time, I used <a href="http://hdri-skies.com/shop/hdri-sky-084/">HDRI Sky 84</a>. There is also a third spotlight used for getting long, dramatic shadows out of the fog, which I’ll talk about a bit later. Here is a lighting test with just the dome and distant lights on a grey clay version of the scene:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Jul/clay_lighting.png"><img src="https://blog.yiningkarlli.com/content/images/2016/Jul/preview/clay_lighting.jpg" alt="Grey clay render lit using the final distant and dome light setup." /></a></p>
<p>For efficiency reasons, I broke out the fog into a separate pass entirely and added it back in comp afterwards. At the time that I did this project, Renderman 20’s volume system was still relatively new (Renderman 21 introduces a significantly overhauled, much faster volume system, but Renderman 21 wasn’t out yet when I did this project), and while perfectly capable, wasn’t necessarily super fast. Iterating on the look of the fog separately from the main render was simply a more efficient workflow. Here is the raw render directly out of RIS:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Jul/aerial_shot_main_pass.png"><img src="https://blog.yiningkarlli.com/content/images/2016/Jul/preview/aerial_shot_main_pass.jpg" alt="Raw render of the main render pass, straight out of RIS." /></a></p>
<p>For the fog, I initially wanted to do fully simulated fog in Houdini. I experimented with using a point SOP to control wind direction and to drive a wind DOP and have fog flow through the scene, but the sheer scale of the scene made this approach impracticable on my home computers. Instead, I wound up just creating a static procedural volume noise field and dumping it out to VDB. I then brought the VDB back into Maya for RIS rendering. Initially I tried rendering the fog pass without the additional spotlight and got something that looked like this:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Jul/aerial_shot_fog_pass_old.png"><img src="https://blog.yiningkarlli.com/content/images/2016/Jul/preview/aerial_shot_fog_pass_old.jpg" alt="My initial attempt at the fog pass." /></a></p>
<p>After getting this first fog attempt rendered, I did a first pass at a final comp and color grade. I wound up using a very different color grade on this earlier attempt. This earlier version is the version that I shared in some places, such as the <a href="http://www.reddit.com/r/mcpublic">Nerd.nu subreddit</a> and on Twitter:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Jul/aerial_shot_final_comp_oldversion.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Jul/preview/aerial_shot_final_comp_oldversion.jpg" alt="First comp and grade attempt, using old version of fog." /></a></p>
<p>This first attempt looked okay, but didn’t quite hit what I was going for. I wanted something with much more dramatic shadow beams, and I also felt that the fog didn’t really look settled into the terrain. Eventually I realized that I needed to make the fog sparser and that the fog should start thinning out after rising just a bit off of the ground. After adjusting the fog and adding in a spotlight with a bit of a cooler temperature than the sun, I got the image below. I’m pretty happy with how the fog looks like it is settling in the river valley and is pouring out of the forested hill in the upper left of the image, even though none of the fog is actually simulated!</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Jul/aerial_shot_fog_pass_final.png"><img src="https://blog.yiningkarlli.com/content/images/2016/Jul/preview/aerial_shot_fog_pass_final.jpg" alt="Final fog pass, with extra spotlight. Note how the fog seems to sit in the lower river valley and pour out of the forest." /></a></p>
<p>Finally, I brought everything together in comp and added a color grading pass in Lightroom. The grade that I went with is much much more heavy-handed than what I usually like to use, but it felt appropriate for this image. I also added a slight amount of vignetting and grain in the final image. The final image is at the top of this post, but here it is again for convenience:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Jul/aerial_shot_final_comp.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Jul/preview/aerial_shot_final_comp.jpg" alt="Final composite with fog, color grading, and vignetting/grain." /></a></p>
<p>I learned a lot about using RIS from this project! By my estimation, RIS is orders of magnitude easier to use than old REYES Renderman; the overall experience was fairly similar to my previous experiences with Vray and Arnold. Both Takua and Hyperion make some similar choices and some very different choices in comparison, but then again, every renderer has large similarities and large differences from every other renderer out there. Rendering a Minecraft world was a lot of fun; I definitely am looking forward to doing more Minecraft renders using this pipeline again sometime in the future.</p>
<p>Also, here’s a shameless plug for the <a href="http://nerd.nu">Nerd.nu</a> Minecraft server that this data set is from. If you like playing Minecraft and are looking for a fast, free, friendly community to build with, you should definitely come check out the Nerd.nu PvE server, located at p.nerd.nu. The little town in this post is not even close to the most amazing thing that people have built on that server.</p>
<p>A final note on the (lack of) activity on my blog recently: we’ve been extremely busy at Walt Disney Animation Studios for the past year trying to release both Zootopia and Moana in the same year. Now that we’re closing in on the release of Moana, hopefully I’ll find time to post more. I have a lot of cool posts about Takua Renderer in various states of drafting; look for them soon!</p>
<hr />
<p><strong>Addendum 10/02/2016</strong>: After I published this post, Eric Haines wrote to me with a few typo corrections and, more importantly, to tell me about a way to get completely enclosed meshes from Mineways using the <a href="http://www.realtimerendering.com/erich/minecraft/public/mineways/mineways.html#schemes">color schemes feature</a>. Serves me right for not reading the documentation completely before starting! The color schemes feature allows assigning a color and alpha value to each block type; the key part of this feature for my use case is that Mineways will delete blocks with a zero alpha value when exporting. Setting all blocks except for water to have an alpha of zero allows for exporting water as a complete enclosed mesh; the same trick applies for glass or really any other block type.</p>
<p>One of the neat things about this feature is that the Mineways UI draws the map respecting assigned alpha values from the color scheme being used. As a result, setting everything except for water to have a zero alpha produces a cool view that shows only all of the water on the map:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Jul/mineways_water_only.png"><img src="https://blog.yiningkarlli.com/content/images/2016/Jul/preview/mineways_water_only.jpg" alt="Mineways map view showing only water blocks. This image shows the same exact area of the map as the other Mineways screenshot earlier in the post." /></a></p>
<p>Going forward, I’ll definitely be adopting this technique to get water meshes instead of using jmc2obj. Being able to handle all of the mesh exporting work in a single program makes for a nicer, more streamlined pipeline. Of course both jmc2obj and Mineways are excellent pieces of software, but in my testing Mineways handles large map sections much better, and I also think that Mineways produces better water meshes compared to jmc2obj. As a result, my pipeline is now entirely centered around Mineways.</p>
https://blog.yiningkarlli.com/2016/02/zootopia.html
Zootopia
2016-02-12T00:00:00+00:00
2016-02-12T00:00:00+00:00
Yining Karl Li
<p><a href="http://www.disneyanimation.com/">Walt Disney Animation Studios</a>’ newest film, <a href="http://www.disneyanimation.com/projects/zootopia">Zootopia</a>, will be releasing in the United States three weeks from today.
I’ve been working at Walt Disney Animation Studios on the the core development team for Disney’s <a href="http://www.disneyanimation.com/technology/innovations/hyperion">Hyperion Renderer</a> since July of last year, and the release of Zootopia is really special for me; Zootopia is the first feature film I’ve worked on.
My actual role on Zootopia was fairly limited; so far, I’ve been spending most of my time and effort on the version of Hyperion for our next film, <a href="http://www.disneyanimation.com/projects/moana">Moana</a> (coming out November of this year).
On Zootopia I basically only did support and bugfixes for Zootopia’s version of Hyperion (and I actually don’t even have a credit in Zootopia, since I hadn’t been at the studio for very long when the credits were compiled).
Nonetheless, I’m incredibly proud of all of the work and effort that has been put into Zootopia, and I consider myself very fortunate to have been able to play even a small role in making the film!</p>
<p>Zootopia is a striking film in every way.
The story is fantastic and original and relevant, the characters are all incredibly appealing, the setting is fascinating and immensely clever, the music is wonderful.
However, on this blog, we are more interested in the technical side of things; luckily, the film is just as unbelievable in its technology.
Quite simply, Zootopia is a breathtakingly beautiful film.
In the same way that Big Hero 6 was several orders of magnitude more complex and technically advanced than Frozen in every way, Zootopia represents yet another enormous leap over Big Hero 6 (which can be hard to believe, considering how gorgeous Big Hero 6 is).</p>
<p>The technical advances made on Zootopia are far beyond what I can go into detail here since I don’t think I can describe them in a way that does them justice, but I think I can safely say that Zootopia is the most technically advanced animated film ever made to date.
The fur and cloth (and cloth on top of fur!) systems on Zootopia are beyond anything I’ve ever seen, the sets and environments are simply ludicrous in both detail and scale, and of course the shading and lighting and rendering are jaw-dropping.
In a lot of ways, many of the technical challenges that had to be solved on Zootopia can be summarized in a single word: complexity.
Enormous care had to be put into creating believable fur and integrating different furry characters into different environments <a href="https://dl.acm.org/doi/10.1145/2936733.2936736">[Burkhard et al. 2016]</a>, and the huge quantities of fur in the movie required developing new level-of-detail approaches <a href="https://dl.acm.org/citation.cfm?id=2927466">[Palmer and Litaker 2016]</a> to make the fur manageable on both the authoring and rendering sides.
The sheer number of crowds characters in the film also required developing a new crowds workflow <a href="https://dl.acm.org/doi/10.1145/2897839.2927467">[El-Ali et al. 2016]</a>, again to make both authoring and rendering tractable, and the complex jungle environments seen throughout most of the film similarly required new approaches to procedural vegetation <a href="https://dl.acm.org/citation.cfm?id=2927469">[Keim et al. 2016]</a>.
Complexity wasn’t just a problem on a large scale though; Zootopia is also incredible rich in the smaller details.
Zootopia was the first movie that Disney Animation deployed a flesh simulation system on <a href="https://dl.acm.org/citation.cfm?id=2927390">[Milne et al. 2016]</a> in order to create convincing muscular movement under the skin and fur of the animal characters.
Even individual effects such as scooping ice cream <a href="https://dl.acm.org/citation.cfm?id=2927445">[Byun et al. 2016]</a> sometimes required innovative new CG techniques.
On the rendering side the Hyperion team developed a brand new BSDF for shading hair and fur <a href="https://onlinelibrary.wiley.com/doi/abs/10.1111/cgf.12830">[Chiang et al. 2016]</a>, with a specific focus on balencing artistic controllability, physical plausibility, and render efficiency.
Disney isn’t paying me to write this on my personal blog, and I don’t write any of this to make myself look grand either.
I played only a small role, and really the amazing quality of the film is a testament to the capabilities of the hundreds of artists that actually made the final frames.
I’m deeply humbled to see what amazing things great artists can do with the tools that my team makes.</p>
<p>Okay, enough rambling. Here are some stills from the film, 100% rendered with Hyperion, of course. Go see the film; these images only scratch the surface in conveying how gorgeous the film is.</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Feb/zoot_01.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Feb/zoot_01.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Feb/zoot_03.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Feb/zoot_03.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Feb/zoot_13.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Feb/zoot_13.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Feb/zoot_14.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Feb/zoot_14.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Feb/zoot_02.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Feb/zoot_02.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Feb/zoot_04.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Feb/zoot_04.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Feb/zoot_05.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Feb/zoot_05.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Feb/zoot_40.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Feb/zoot_40.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Feb/zoot_06.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Feb/zoot_06.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Feb/zoot_07.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Feb/zoot_07.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Feb/zoot_16.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Feb/zoot_16.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Feb/zoot_08.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Feb/zoot_08.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Feb/zoot_10.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Feb/zoot_10.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Feb/zoot_11.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Feb/zoot_11.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Feb/zoot_12.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Feb/zoot_12.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Feb/zoot_09.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Feb/zoot_09.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Feb/zoot_33.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Feb/zoot_33.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Feb/zoot_15.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Feb/zoot_15.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Feb/zoot_17.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Feb/zoot_17.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Feb/zoot_18.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Feb/zoot_18.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Feb/zoot_41.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Feb/zoot_41.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Feb/zoot_39.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Feb/zoot_39.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Feb/zoot_19.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Feb/zoot_19.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Feb/zoot_20.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Feb/zoot_20.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Feb/zoot_21.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Feb/zoot_21.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Feb/zoot_31.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Feb/zoot_31.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Feb/zoot_22.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Feb/zoot_22.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Feb/zoot_27.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Feb/zoot_27.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Feb/zoot_23.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Feb/zoot_23.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Feb/zoot_35.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Feb/zoot_35.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Feb/zoot_36.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Feb/zoot_36.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Feb/zoot_37.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Feb/zoot_37.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Feb/zoot_24.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Feb/zoot_24.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Feb/zoot_25.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Feb/zoot_25.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Feb/zoot_28.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Feb/zoot_28.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Feb/zoot_29.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Feb/zoot_29.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Feb/zoot_32.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Feb/zoot_32.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Feb/zoot_34.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Feb/zoot_34.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Feb/zoot_30.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Feb/zoot_30.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Feb/zoot_38.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Feb/zoot_38.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2016/Feb/zoot_26.jpg"><img src="https://blog.yiningkarlli.com/content/images/2016/Feb/zoot_26.jpg" alt="" /></a></p>
<p>All images in this post are courtesy of and the property of Walt Disney Animation Studios.</p>
<p><strong>References</strong></p>
<p>Nicholas Burkard, Hans Keim, Brian Leach, Sean Palmer, Ernest J. Petti, and Michelle Robinson. 2016. <a href="https://dl.acm.org/doi/10.1145/2936733.2936736">From Armadillo to Zebra: Creating the Diverse Characters and World of Zootopia</a>. In <em>ACM SIGGRAPH 2016 Production Sessions</em>. 24:1-24:2.</p>
<p>Dong Joo Byun, James Mansfield, and Cesar Velazquez. 2016. <a href="https://dl.acm.org/citation.cfm?id=2927445">Delicious Looking Ice Cream Effects with Non-Simulation Approaches</a>. In <em>ACM SIGGRAPH 2016 Talks</em>. 25:1-25:2.</p>
<p>Matt Jen-Yuan Chiang, Benedikt Bitterli, Chuck Tappan, and Brent Burley. 2016. <a href="https://onlinelibrary.wiley.com/doi/abs/10.1111/cgf.12830">A Practical and Controllable Hair and Fur Model for Production Path Tracing</a>. <em>Computer Graphics Forum</em>. 35, 2 (2016), 275-283.</p>
<p>Moe El-Ali, Joyce Le Tong, Josh Richards, Tuan Nguyen, Alberto Luceño Ros, and Norman Moses Joseph. 2016. <a href="https://dl.acm.org/doi/10.1145/2897839.2927467">Zootopia Crowd Pipeline</a>. In <em>ACM SIGGRAPH 2016 Talks</em>. 59:1-59:2.</p>
<p>Hans Keim, Maryann Simmons, Daniel Teece, and Jared Reisweber. 2016. <a href="https://dl.acm.org/citation.cfm?id=2927469">Art-Directable Procedural Vegetation in Disney’s Zootopia</a>. In <em>ACM SIGGRAPH 2016 Talks</em>. 18:1-18:2.</p>
<p>Andy Milne, Mark McLaughlin, Rasmus Tamstorf, Alexey Stomakhin, Nicholas Burkard, Mitch Counsell, Jesus Canal, David Komorowski, and Evan Goldberg. 2016. <a href="https://dl.acm.org/citation.cfm?id=2927390">Flesh, Flab, and Fascia Simulation on Zootopia</a>. In <em>ACM SIGGRAPH 2016 Talks</em>. 34:1-34:2.</p>
<p>Sean Palmer and Kendall Litaker. 2016. <a href="https://dl.acm.org/citation.cfm?id=2927466">Artist Friendly Level-of-Detail in a Fur-Filled World</a>. In <em>ACM SIGGRAPH 2016 Talks</em>. 32:1-32:2.</p>
https://blog.yiningkarlli.com/2015/06/attenuated-transmission.html
Attenuated Transmission
2015-06-18T00:00:00+00:00
2015-06-18T00:00:00+00:00
Yining Karl Li
<p><a href="https://blog.yiningkarlli.com/content/images/2015/Jun/fluid.2.png"><img src="https://blog.yiningkarlli.com/content/images/2015/Jun/preview/fluid.2.jpg" alt="Blue liquid in a glass box, with attenuated transmission. Simulated using PIC/FLIP in Ariel, rendered in Takua a0.5 using VCM." /></a></p>
<p>A few months ago I added attenuation to Takua a0.5’s Fresnel refraction BSDF. Adding attenuation wound up being more complex than originally anticipated because handling attenuation through refractive/transmissive mediums requires volumetric information in addition to the simple surface differential geometry. In <a href="http://blog.yiningkarlli.com/2015/03/bsdf-system.html">a previous post about my BSDF system</a>, I mentioned that the BSDF system only considered surface differential geometry information; adding attenuation meant extending my BSDF system to also consider volume properties and track more information about previous ray hits.</p>
<p>First off, what is attenuation? Within the context of rendering and light transport, attenuation is when light is progressively absorbed within a medium, which results in a decrease in light intensity as one goes further and further into a medium away from a light source. One simple example is in deep water- near the surface, most of the light that has entered the water remains unabsorbed, and so the light intensity is fairly high and the water is fairly clear. Going deeper and deeper into the water though, more and more light is absorbed and the water becomes darker and darker. Clear objects gain color when light is attenuated at different rates according to different wavelengths. Combined with scattering, attenuation is a major contributing property to the look of transmissive/refractive materials in real life.</p>
<p>Attenuation is described using the <a href="https://en.wikipedia.org/wiki/Beer%E2%80%93Lambert_law">Beer-Lambert Law</a>. The part of the Beer-Lambert Law we are concerned with is the definition of transmittance:</p>
<div>\[ T = \frac{\Phi_{e}^{t}}{\Phi_{e}^{i}} = e^{-\tau}\]</div>
<p>The above equation states that the transmittance of a material is equal to the transmitted radiant flux over the received radiant flux, which in turn is equal to e raised to the power of the negative of the optical depth. If we assume uniform attenuation within a medium, the Beer-Lambert law can be expressed in terms of an attenuation coefficient μ as:</p>
<div>\[ T = e^{-\mu\ell} \]</div>
<p>From these expressions, we can see that light is absorbed exponentially as distance into an absorbing medium increases. Returning back to building a BSDF system, supporting attenuation therefore means having to know not just the current intersection point and differential geometry, but also the distance a ray has traveled since the <em>previous</em> intersection point. Also, if the medium’s attenuation rate is not constant, then an attenuating BSDF not only needs to know the distance since the previous intersection point, but also has to sample along the incoming ray at some stepping increment and calculate the attenuation at each step. In other words, supporting attenuation required BSDFs to know the previous hit point in addition to the current one and also requires BSDFs to be able to ray march from the previous hit point to the current one.</p>
<p>Adding previous hit information and ray march support to my BSDF system was a very straightforward task. I also added volumetric data support to Takua, allowing for the following attenuation test with a glass Stanford Dragon filled with a checkerboard red and blue medium. The red and blue medium is ray marched through to calculate the total attenuation. Note how the thinner parts of the dragon allow more light through resulting in a lighter appearance, while thicker parts of the dragon attenuate more light resulting in a darker appearance. Also note the interesting red and blue caustics below the dragon:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2015/Jun/dragon_vcm.png"><img src="https://blog.yiningkarlli.com/content/images/2015/Jun/preview/dragon_vcm.jpg" alt="Glass Stanford Dragon filled with a red and blue volumetric checkerboard attenuating medium. Rendered in Takua a0.5 using VCM." /></a></p>
<p>Things got much more complicated once I added support for what I call “deep attenuation”- that is, attenuation through multiple mediums embedded inside of each other. A simple example is an ice cube floating in a glass of liquid, which one might model in the following way:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2015/Jun/fluid_diagram.png"><img src="https://blog.yiningkarlli.com/content/images/2015/Jun/preview/fluid_diagram_small.png" alt="Diagram of glass-fluid-ice interfaces. Arrows indicate normal directions." /></a></p>
<p>There are two things in the above diagram that make deep attenuation difficult to implement. First, note that the ice cube is modeled without a corresponding void in the liquid- that is, a ray path that travels through the ice cube records a sequence of intersection events that goes something like “enter water, enter ice cube, exit ice cube, exit water”, as opposed to “enter water, exit water, enter ice cube, exit ice cube, enter water, exit water”. Second, note that the liquid boundary is actually slightly <em>inside</em> of the inner wall of the glass. Intuitively, this may seem like a mistake or an odd property, but this is actually the correct way to model a liquid-glass interface in computer graphics- see <a href="http://adaptivesamples.com/2013/10/19/fluid-in-a-glass/">this article</a> and <a href="http://www.aversis.be/tutorials/vray/vray-20-glass-liquid-02.htm">this other article</a> for details on why.</p>
<p>So why do these two cases complicate things? As a ray enters each new medium, we need to know what medium the ray is in so that we can execute the appropriate BSDF and get the correct attenuation for that medium. We can only evaluate the attenuation once the ray <em>exits</em> the medium, since attenuation is dependent on how far through the medium the ray traveled. The easy solution is to simply remember what the BSDF is when a ray enters a medium, and then use the remembered BSDF to evaluate attenuation upon the next intersection. For example, imagine the following sequence of intersections:</p>
<ol>
<li>Intersect glass upon entering glass.</li>
<li>Intersect glass upon exiting glass.</li>
<li>Intersect water upon entering water.</li>
<li>Intersect water upon exiting water.</li>
</ol>
<p>This sequence of intersections is easy to evaluate. The evaluation would go something like:</p>
<ol>
<li>Enter glass. Store glass BSDF.</li>
<li>Exit glass. Evaluate attenuation from stored glass BSDF.</li>
<li>Enter water. Store water BSDF.</li>
<li>Exit water. Evaluate attenuation from stored water BSDF.</li>
</ol>
<p>So far so good. However, remember that in the first case, sometimes we might not have a surface intersection to mark that we’ve exited one medium before entering another. The following scenario demonstrates how this first case results in missed attenuation evaluations:</p>
<ol>
<li>Intersect water upon entering water.</li>
<li>Exit water, but no intersection!</li>
<li>Intersect ice upon entering ice.</li>
<li>Intersect ice upon exiting ice.</li>
<li>Enter water again, but no intersection either!</li>
<li>Intersect water upon exiting water.</li>
</ol>
<p>The evaluation sequence ends up playing out okay:</p>
<ol>
<li>Enter water. Store water BSDF.</li>
<li>Exit water, but no intersection. No BSDF evaluated.</li>
<li>Enter ice. Intersection occurs, so evaluate attenuation from stored water BSDF. Store ice BSDF.</li>
<li>Exit ice. Evaluate attenuation from stored ice BSDF.</li>
<li>Enter water again, but no intersection, so no BSDF stored.</li>
<li>Exit water…. but there is no previous BSDF stored! No attenuation is evaluated!</li>
</ol>
<p>Alternatively, in step 6, instead of no previous BSDF, we might still have the ice BSDF stored and evaluate attenuation based on the ice. However, this result is still wrong, since we’re now using the ice BSDF for the water.</p>
<p>One simple solution to this problem is to keep a stack of previously seen BSDFs with each ray instead of just storing the previously seen BSDF. When the ray enters a medium through an intersection, we push a BSDF onto the stack. When the ray exits a medium through an intersection, we evaluate whatever BSDF is on the top of the stack and pop the stack. Keeping a stack works well for the previous example case:</p>
<ol>
<li>Enter water. Push water BSDF on stack.</li>
<li>Exit water, but no intersection. No BSDF evaluated.</li>
<li>Enter ice. Intersection occurs, so evaluate water BSDF from top of stack. Push ice BSDF on stack.</li>
<li>Exit ice. Evaluate ice BSDF from top of stack. Pop ice BSDF off stack.</li>
<li>Enter water again, but no intersection, so no BSDF stored.</li>
<li>Exit water. Intersection occurs, so evaluate water BSDF from top of stack. Pop ice BSDF off stack.</li>
</ol>
<p>Excellent, we now have evaluated different medium attenuations in the correct order, haven’t missed any evaluations or used the wrong BSDF for a medium, and as we exit the water and ice our stack is now empty as it should be. The first case from above is now solved… what happens with the second case though? Imagine the following sequence of intersections where the liquid boundary is inside of the two glass boundaries:</p>
<ol>
<li>Intersect glass upon entering glass.</li>
<li>Intersect water upon entering water.</li>
<li>Intersect glass upon exiting glass.</li>
<li>Intersect water upon exiting water.</li>
</ol>
<p>The evaluation sequence using a stack is:</p>
<ol>
<li>Enter glass. Push glass BSDF on stack.</li>
<li>Enter water. Evaluate glass attenuation from top of stack. Push water BSDF.</li>
<li>Exit glass. Evaluate water attenuation from top of stack, pop water BSDF.</li>
<li>Exit water. Evaluate glass attenuation from top of stack, pop glass BSDF.</li>
</ol>
<p>The evaluation sequence is once again in the wrong order- we just used the glass attenuation when we were traveling through water at the end! Solving this second case requires a modification to our stack based scheme. Instead of popping the top of the stack every time we exit a medium, we should scan the stack from the top down and pop the first instance of a BSDF matching the BSDF of the surface we just exited through. This modified stack results in:</p>
<ol>
<li>Enter glass. Push glass BSDF on stack.</li>
<li>Enter water. Evaluate glass attenuation from top of stack. Push water BSDF.</li>
<li>Exit glass. Evaluate water attenuation from top of stack. Scan stack and find first glass BSDF matching the current surface’s glass BSDF and pop that BSDF.</li>
<li>Exit water. Evaluate water attenuation from top of stack. Scan stack and pop first matching water BSDF.</li>
</ol>
<p>At this point, I should mention that pushing/popping onto the stack should only occur when a ray travels <em>through</em> a surface. When the ray simply reflects off of a surface, an intersection has occurred and therefore attenuation from the top of the stack should still be evaluated, but the stack itself should not be modified. This way, we can support diffuse inter-reflections inside of an attenuating medium and get the correct diffuse inter-reflection <em>with</em> attenuation between diffuse bounces! Using this modified stack scheme for attenuation evaluation, we can now correctly handle all deep attenuation cases and embed as many attenuating mediums in each other as we could possibly want.</p>
<p>…or at least, I think so. I plan on running more tests before conclusively deciding this all works. So there may be a followup to this post later if I have more findings.</p>
<p>A while back, I <a href="http://blog.yiningkarlli.com/2014/01/flip-simulator.html">wrote a PIC/FLIP fluid simulator</a>. However, at the time, Takua Renderer didn’t have attenuation support, so I wound up <a href="http://blog.yiningkarlli.com/2014/02/flip-meshing-pipeline.html">rendering my simulations with Vray</a>. Now that Takua a0.5 has robust deep attenuation support, I went back and used some frames from my fluid simulator as tests. The image at the top of this post is a simulation frame from my fluid simulator, rendered entirely with Takua a0.5. The water is set to attenuate red and green light more than blue light, resulting in the blue appearance of the water. In addition, the glass has a slight amount of hazy green attenuation too, much like real aquarium glass. As a result, the glass looks greenish from the ends of each glass plate, but is clear when looking through each plate, again much like real glass. Here are two more renders:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2015/Jun/fluid.0.png"><img src="https://blog.yiningkarlli.com/content/images/2015/Jun/preview/fluid.0.jpg" alt="Simulated using PIC/FLIP in Ariel, rendered in Takua a0.5 using VCM." /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2015/Jun/fluid.1.png"><img src="https://blog.yiningkarlli.com/content/images/2015/Jun/preview/fluid.1.jpg" alt="Simulated using PIC/FLIP in Ariel, rendered in Takua a0.5 using VCM." /></a></p>
https://blog.yiningkarlli.com/2015/05/complex-room-renders.html
Complex Room Renders
2015-05-30T00:00:00+00:00
2015-05-30T00:00:00+00:00
Yining Karl Li
<p><a href="https://blog.yiningkarlli.com/content/images/2015/May/room_angle1.png"><img src="https://blog.yiningkarlli.com/content/images/2015/May/preview/room_angle1.jpg" alt="Rendered in Takua a0.5 using VCM. Model credits in the post below." /></a></p>
<p>I realize I have not posted in some weeks now, which means I still haven’t gotten around to writing up Takua a0.5’s architecture and VCM integrator. I’m hoping to get to that once I’m finished with my thesis work. In the meantime, here are some more pretty pictures rendered using Takua a0.5.</p>
<p>A few months back, I made a high-complexity scene designed to test Takua a0.5’s capability for handling “real-world” workloads. The scene was also designed to have an extremely difficult illumination setup. The scene is an indoor room that is lit primarily from outside through glass windows. Yes, the windows are actually modeled as geometry with a glass BSDF! This means everything seen in these renders is being lit primarily through caustics! Of course, no real production scene would be set up in this manner, but I chose this difficult setup specifically to test the VCM integrator. There is a secondary source of light from a metal cylindrical lamp, but this light source is also difficult since the actual light is emitted from a sphere light inside of a reflective metal cylinder that blocks primary visibility from most angles.</p>
<p>The flowers and glass vase are the same ones from my earlier <a href="http://blog.yiningkarlli.com/2015/02/flower-vase-render.html">Flower Vase Renders post</a>. The original flowers and vase are by <a href="https://www.behance.net/andi_mix">Andrei Mikhalenko</a>, with custom textures of my own. The amazing, colorful Takua poster on the back wall is by my good friend <a href="http://alice-yang.tumblr.com/">Alice Yang</a>. The two main furniture pieces are by <a href="http://odesd2.com.ua/ru">ODESD2</a>, and the Braun SK4 record player model is by one of my favorite archviz artists, <a href="http://bertrand-benoit.com/">Bertrand Benoit</a>. The teapot is, of course, the famous Utah teapot. All textures, shading, and other models are my own.</p>
<p>As usual, all depth of field is completely in-camera and in-renderer. Also, all BSDFs in this scene are fairly complex; there is not a single simple diffuse surface anywhere in the scene! Instancing is used very heavily; the wicker baskets, notebooks, textbooks, chess pieces, teacups, and tea dishes are all instanced from single pieces of geometry. The floorboards are individually modeled but not instanced, since they all vary in length and slightly in width.</p>
<p>A few more pretty renders, all rendered in Takua a0.5 using VCM:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2015/May/room_angle4.png"><img src="https://blog.yiningkarlli.com/content/images/2015/May/preview/room_angle4.jpg" alt="Closeup of Braun SK4 record player with DOF. Rendered using VCM." /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2015/May/room_angle6.png"><img src="https://blog.yiningkarlli.com/content/images/2015/May/preview/room_angle6.jpg" alt="Flower vase and tea set. Rendered using VCM" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2015/May/room_angle7.png"><img src="https://blog.yiningkarlli.com/content/images/2015/May/preview/room_angle7.jpg" alt="Floorboards, textbooks, and rough metal bin with DOF. The book covers are entirely made up. Rendered using VCM." /></a></p>
https://blog.yiningkarlli.com/2015/05/note-on-images.html
Note On Images
2015-05-23T00:00:00+00:00
2015-05-23T00:00:00+00:00
Yining Karl Li
<p>Just a quick note on images on this blog. So far, I’ve generally been embedding full resolution, losslessly compressed PNG format images in the blog. I prefer having the full resolution, lossless images available on the blog since they are the exact output from my renderer. However, full resolution lossless PNGs can get fairly large (several MB for a single 1920x1080 frame), which is dragging down the load times for the blog.</p>
<p>Going forward, I’ll be embedding lossy compressed JPG images in blog posts, but the JPGs will link through to the full resolution, lossless PNG originals. Fortunately, high quality JPG compression is quite good these days at fitting an image with nearly imperceptible compression differences into a much smaller footprint. I’ll also be going back and applying this scheme to old posts too at some point.</p>
<hr />
<p><strong>Addendum 04/08/2016</strong>: Now that I am doing some renders in 4K resolution (3840x2160), it’s time for an addendum to this policy.
I won’t be uploading full resolution lossless PNGs for 4K images, due to the overwhelming file size (>30MB for a single image, which means a post with just a handful of 4K images can easily add up to hundreds of MB).
Instead, for 4K renders, I will embed a downsampled 1080P JPG image in the post, and link through to a 4K JPG compressed to balance image quality and file size.</p>
https://blog.yiningkarlli.com/2015/04/hyperion.html
Hyperion
2015-04-24T00:00:00+00:00
2015-04-24T00:00:00+00:00
Yining Karl Li
<p>Just a quick update on future plans. Starting in July, I’m going to be working full time for <a href="http://www.disneyanimation.com/">Walt Disney Animation Studios</a> as a software engineer on their custom, in-house <a href="http://www.fxguide.com/featured/disneys-new-production-renderer-hyperion-yes-disney/">Hyperion Renderer</a>. I couldn’t be more excited about working with everyone on the Hyperion team; ever since the <a href="https://disney-animation.s3.amazonaws.com/uploads/production/publication_asset/70/asset/Sorted_Deferred_Shading_For_Production_Path_Tracing.pdf">Sorted Deferred Shading paper</a> was published two years ago, I’ve thought that the Hyperion team is doing some of the most interesting work there is in the rendering field right now.</p>
<p>I owe an enormous thanks to everyone that’s advised and supported and encouraged me to continue exploring the rendering and graphics world. Thanks, Joe, Don, Peter, Tony, Mark, Christophe, Amy, Fran, Gabriel, Harmony, and everyone else!</p>
<p>Normally as a rule I only post images to this blog that I made or have a contribution to, but this time I’ll make an exception. Here’s one of my favorite stills from Big Hero 6, rendered entirely using Hyperion and lit by Angela McBride, a friend from PUPs 2011! Images like this one are an enormous source of inspiration to me, so I absolutely can’t wait to get started at Disney and help generate more gorgeous imagery like this!</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2015/Apr/BH6_still_Baymaxhug.jpg"><img src="https://blog.yiningkarlli.com/content/images/2015/Apr/preview/BH6_still_Baymaxhug.jpg" alt="A still from Big Hero 6, rendered entirely using Hyperion. Property of Walt Disney Animation Studios." /></a></p>
https://blog.yiningkarlli.com/2015/03/bsdf-system.html
BSDF System
2015-03-23T00:00:00+00:00
2015-03-23T00:00:00+00:00
Yining Karl Li
<p>Takua a0.5’s BSDF system was particularly interesting to build, especially because in previous versions of Takua Renderer, I never really had a good BSDF system. Previously, my BSDFs were written in a pretty ad-hoc way and were somewhat hardcoded into the pathtracing integrator, which made BSDF extensibility very difficult and multi-integrator support nearly impossible without significant duplication of BSDF code. In Takua a0.5, I’ve written a new, extensible, modularized BSDF system that is inspired by <a href="mitsuba-renderer.org">Mitsuba</a> and <a href="https://renderman.pixar.com/resources/current/RenderMan/bxdfRef.html">Renderman 19/RIS</a>. In this post, I’ll write about how Takua a0.5’s BSDF system works and show some pretty test images generated during development with some interesting models and props.</p>
<p>First, here’s a still-life sort of render showcasing a number of models with a number of interesting materials, all using Takua a0.5’s BSDF system and rendered using my VCM integrator. All of the renders in this post are rendered either using my BDPT integrator or my VCM integrator.</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2015/Mar/still_life.png"><img src="https://blog.yiningkarlli.com/content/images/2015/Mar/preview/still_life.jpg" alt="Still-life scene with a number of interesting, complex materials created using Takua a0.5's BSDF system. The chess pieces and notebooks make use of instancing. Rendered in Takua a0.5 using VCM." /></a></p>
<p>BSDFs in Takua a0.5 are designed to support bidirectional evaluation and importance sampling natively. Basically, this means that all BSDFs need to implement five basic functions. These five basic functions are:</p>
<ul>
<li>Evaluate, which takes input and output directions of light and a normal, and returns the BSDF weight, cosine of the angle of the input direction, and color absorption of the scattering event. Evaluate can also optionally return the probability of the output direction given the input direction, with respect to solid angle.</li>
<li>CalculatePDFW, which takes the input and output directions of light and a normal, and returns the forward probability of the output direction given the input direction. In order to make the BSDF operate bidirectionally, this function also needs to be able to return the backwards probability if the input and output are reversed.</li>
<li>Sample, which takes in an input direction, a normal, and a random number generator and returns an output direction, the BSDF weight, the forward probability of the output direction, and the cosine of the input angle.</li>
<li>IsDelta, which returns true if the BSDF’s probability distribution function is a <a href="http://en.wikipedia.org/wiki/Dirac_delta_function">Dirac delta function</a> and false otherwise. This attribute is important for allowing BDPT and VCM to handle perfectly specular BSDFs correctly, since perfectly specular BSDFs are something of a special case.</li>
<li>GetContinuationProbability, which takes in an input direction and normal and returns the probability of ending a ray path at this BSDF. This function is used for Russian Roulette early path termination.</li>
</ul>
<p>In order to be correct and bididirectional, each of these functions should return results that agree with the other functions. For example, taking the output direction generated by Sample and calling Evaluate with the Sample output direction should produce the same color absorption and forward probability and other attributes as Sample. Sample, Evaluate, and CalculatePDFW are all very similar functions and often can share a large amount of common code, but each one is tailored to a slightly different purpose. For example, Sample is useful for figuring out a new random ray direction along a ray path, while Evaluate is used for calculating BSDF weights while importance sampling light sources.</p>
<p>Small note: I wrote that these five functions all take in a normal, which is technically all they need in terms of differential geometry. However, in practice, passing in a surface point and UV and other differential geometry information is very useful since that allows for various properties to be driven by 2D and 3D textures. In Takua a0.5, I pass in a normal, surface point, UV coordinate, and a geom and primitive ID for future PTEX support, and allow every BSDF attribute to be driven by a texture.</p>
<p>One of the test props I made is the <a href="http://www.pbrt.org/">PBRT book</a>, since I thought rendering the Physically Based Rendering book with a physically based renderer and physically based shading would be amusing. The base diffuse color is driven by a texture map, and the interesting rippled and variation in the glossiness of the book cover comes from driving additional gloss and specular properties with more texture maps.</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2015/Mar/pbrt.png"><img src="https://blog.yiningkarlli.com/content/images/2015/Mar/preview/pbrt.jpg" alt="Physically Based Rendering book, rendered with my physically based renderer. Note the texture-driven gloss and specular properties. Rendered using BDPT." /></a></p>
<p>In order to be physically correct, BSDFs should also fulfill the following three properties:</p>
<ul>
<li>Positivity, meaning that the return value of the BSDF should always be positive or equal to 0.</li>
<li>Helmholtz Reciprocity, which means the return value of the BSDF should not be changed by switching the input and output directions (although switching the input and output CAN change how things are calculated internally, such as in perfectly specular refractive materials).</li>
<li>Energy Conservation, meaning the surface cannot reflect more light than arrives.</li>
</ul>
<p>At the moment, my base BSDFs are not actually the best physically based BSDFs in the world… I just have Lambertian diffuse, normalized Blinn-Phong, and Fresnel-based perfectly specular reflection/refraction. At a later point I’m planning on adding Beckmann and Disney’s Principled BSDF, and possibly others such as GGX and Ward. However, for the time being, I can still create highly complex and interesting materials because of the modular nature of Takua a0.5’s BSDF system; one of the most powerful uses of this modular system is combining base BSDFs into more complex BSDFs. For example, I have another BSDF called FresnelPhong, which internally calls normalized Blinn-Phong BSDF but also calls the Fresnel code from my Fresnel specular BSDF to account for an output direction with the Fresnel effect with glossy surfaces. Since the Fresnel specular BSDF handles refractive materials, FresnelPhong allows for creating glossy transmissive surfaces such as frosted glass (albeit not as accurate to reality as one would get with Beckmann or GGX).</p>
<p>Another one of my test props is a glass chessboard, where half of the pieces and board squares are using frosted glass. Needless to say, this scene is very difficult to render using unidirectional pathtracing. I only have one model of each chess piece type, and all of the pieces on the board are instances with varying materials per instance.</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2015/Mar/chessboard_0.png"><img src="https://blog.yiningkarlli.com/content/images/2015/Mar/preview/chessboard_0.jpg" alt="Chessboard with ground glass squares and clear glass squares. Rendered using BDPT." /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2015/Mar/chessboard_1.png"><img src="https://blog.yiningkarlli.com/content/images/2015/Mar/preview/chessboard_1.jpg" alt="Chessboard with ground glass and clear glass pieces. Rendered using BDPT." /></a></p>
<p>Another interesting use of modular BSDFs and embedding BSDFs inside of other BSDFs is in implementing bump mapping. Takua a0.5 implements bump mapping as a simple BSDF wrapper that calculates the bump mapped normal and passes that normal into whatever the underlying BSDF is. This approach allows for any BSDF to have a bump map, and even allows for applying multiple bump maps to the same piece of geometry. In addition to specifying bump maps as wrapper BSDFs, Takua a0.5 also allows attaching bump maps to individual geometry so that the same BSDF can be reused with a number of different bump maps attached to a number of different geometries, but under the hood this system works exactly the same as the BSDF wrapper bump map.</p>
<p>This notebook prop’s leathery surface detail comes entirely from a BSDF wrapper bump map:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2015/Mar/notebook.png"><img src="https://blog.yiningkarlli.com/content/images/2015/Mar/preview/notebook.jpg" alt="Notebook with a leathery surface. All surface detail comes from bump mapping. Rendered using BDPT." /></a></p>
<p>Finally, one of the most useful and interesting features of Takua a0.5’s BSDF system is the layered BSDF. The layered BSDF is a special BSDF that allows arbitrary combining, layering, and mixing between different BSDFs, much like Vray’s BlendMtl or Renderman 19/RIS’s LM shader system. Any BSDF can be used as a layer in a layered BSDF, including entire other layered BSDF networks. The Takua layered BSDF consists of a base substrate BSDF, and an arbitrary number of coat layers on top of the substrate. Each coat is given a texture-drive weight which determines how much of the final output BSDF is from the current coat layer versus from all of the layers and substrate below the current coat layer. Since the weight for each coat layer must be between 0 and 1, the result layered BSDF maintains physical correctness as long as all of the component BSDFs are also physically correct. Practically, the layered BSDF is implemented so that with each iteration, only one of the component BSDFs is evaluated and sampled, with the particular component BSDF per iteration chosen randomly based on each component BSDF’s weighting.</p>
<p>The layered BSDF system is what allows the creation of truly interesting and complex materials, since objects in reality often have complex materials consisting of a number of different scattering event types. For example, a real object may have a diffuse base with a glossy clear coat, but there may also be dust and fingerprints on top of the clear coat contributing to the final appearance. The globe model seen in my adaptive sampling post uses a complex layered BSDF; the base BSDF is ground glass, with the continents layered on top as a perfectly specular mirror BSDF, and then an additional dirt and fingerprints layer on top made up of diffuse and varying glossy BSDFs:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2015/Mar/globe_0.png"><img src="https://blog.yiningkarlli.com/content/images/2015/Mar/preview/globe_0.jpg" alt="Glass globe using Takua's layered BSDF system. The globe has a base ground glass layer, a mirror layer for continents, and a dirt/fingerprints layer for additional detail. Rendered using VCM." /></a></p>
<p>Here’s an additional close-up render of the globe that better shows off some of the complex surface detail:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2015/Mar/globe_1.png"><img src="https://blog.yiningkarlli.com/content/images/2015/Mar/preview/globe_1.jpg" alt="Close-up of the globe. Rendered using VCM." /></a></p>
<p>Going forward, I’m planning on adding a number of better BSDFs to Takua a0.5 (as mentioned before). Since the BSDF system is so modular and extensible, adding new BSDFs should be relatively simple and should require little to no additional work to integrate into the renderer. Because of how I designed BSDF wrappers, any new BSDF I add will automatically work with the bump map BSDF wrapper and the layered BSDF system. I’m also planning on adding interesting effects to the refractive/transmission BSDF, such as absorption based on Beer’s law and spectral diffraction.</p>
<p>After I finish work on my thesis, I also intend on adding more complex materials for subsurface scattering and volume rendering. These additions will be much more involved than just adding GGX or Beckmann, but I have a rough roadmap for how to proceed and I’ve already built a lot of supporting infrastructure into Takua a0.5. The plan for now is to implement a unified SSS/volume system based on the <a href="http://www.cs.dartmouth.edu/~wjarosz/publications/krivanek14upbp.html">Unified Points, Beams, and Paths</a> presented at SIGGRAPH 2014. UPBP can be thought of as extending VCM to combine a number of different volumetric rendering techniques. I can’t wait to get started on that over the summer!</p>
https://blog.yiningkarlli.com/2015/03/adaptive-sampling.html
Adaptive Sampling
2015-03-18T00:00:00+00:00
2015-03-18T00:00:00+00:00
Yining Karl Li
<p>Adaptive sampling is a relatively small and simple but very powerful feature, so I thought I’d write briefly about how adaptive sampling works in Takua a0.5. Before diving into the details though, I’ll start with a picture. The scene I’ll be using for comparisons in this post is a globe of the Earth, made of a polished ground glass with reflective metal insets for the landmasses and with a rough scratched metal stand. The globe is on a white backdrop and is lit by two off-camera area lights. The following render is the fully converged reference baseline for everything else in the post, rendered using VCM:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2015/Mar/adaptive_globe_baseline_vcm.png"><img src="https://blog.yiningkarlli.com/content/images/2015/Mar/preview/adaptive_globe_baseline_vcm.jpg" alt="Fully converged reference baseline. Rendered in Takua a0.5 using VCM." /></a></p>
<p>As <a href="http://blog.yiningkarlli.com/2015/02/bidirectional-pathtracing-integrator.html">mentioned before</a>, in pathtracing based renderers, we solve the path integral through Monte Carlo sampling, which gives us an estimate of the total integral per sample thrown. As we throw more and more samples at the scene, we get a better and better estimate of the total integral, which explains why pathtracing based integrators start out producing a noisy image but eventually converge to a nice, smooth image if enough rays are traced per pixel.</p>
<p>In a naive renderer, the number of samples traced per pixel is usually just a fixed number, equal for all pixels. However, not all parts of the image are necessarily equally difficult to sample; for example, in the globe scene, the backdrop should require fewer samples than the ground glass globe to converge, and the ground glass globe in turn should require fewer samples than the two caustics on the ground. This observation means that a fixed sampling strategy can potentially be quite wasteful. Instead, computation can be used much more efficiently if the sampling strategy can adapt and drive more samples towards pixels that require more work to converge, while driving fewer samples towards pixels that have already converged mid-render. Such a sample can also be used to automatically stop the renderer once the sampler has detected that the entire render has converged, without needing user guesswork for how many samples to use.</p>
<p>The following image is the same globe scene as above, but limited to 5120 samples per pixel using bidirectional pathtracing and a fixed sampler. Note that most of the image is reasonable converged, but there is still noise visible in the caustics:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2015/Mar/fixed_globe_bdpt.png"><img src="https://blog.yiningkarlli.com/content/images/2015/Mar/preview/fixed_globe_bdpt.jpg" alt="Fixed sampling, 5120 samples per pixel, BDPT." /></a></p>
<p>Since it may be difficult to see the difference between this image and the baseline image on smaller screens, here is a close-up crop of the same caustic area between the two images:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2015/Mar/globe_fixed_baseline_comparison.png"><img src="https://blog.yiningkarlli.com/content/images/2015/Mar/globe_fixed_baseline_comparison.png" alt="500% crop. Left: converged baseline render. Right: fixed sampling, 5120 samples per pixel, BDPT." /></a></p>
<p>The difficult part of implementing an adaptive sampler is, of course, figuring out a metric for convergence. The <a href="http://www.pbrt.org/">PBRT book</a> presents a very simple adaptive sampling strategy on page 388 of the 2nd edition: for each pixel, generate some minimum number of initial samples and record the radiances returned by each initial sample. Then, take the average of the luminances of the returned radiances, and compute the contrast between each initial sample’s radiance and the average luminance. If any initial sample has a contrast from the average luminance above some threshold (say, 0.5), generate more samples for the pixel up until some maximum number of samples per pixel is reached. If all of the initial samples have contrasts below the threshold, then the sampler can mark the pixel as finished and move onto the next pixel. The idea behind this strategy is to try to eliminate fireflies, since fireflies result from statistically improbably samples that are significantly above the true value of the pixel.</p>
<p>The PBRT adaptive sampler works decently, but has a number of shortcomings. First, the need to draw a large number of samples per pixel simultaneously makes this approach less than ideal for progressive rendering; while well suited to a bucketed renderer, a progressive renderer prefers to draw a small number of samples per pixel per iteration, and return to each pixel to draw more samples in subsequent iterations. In theory, the PBRT adaptive sampler could be made to work with a progressive renderer if sample information was stored from each iteration until enough samples were accumulated to run an adaptive sampling check, but this approach would require storing a lot of extra information. Second, while the PBRT approach can guarantee some degree of per-pixel variance minimization, each pixel isn’t actually aware of what its neighbours look like, meaning that there still can be visual noise across the image. A better, global approach would have to take into account neighbouring pixel radiance values as a second check for whether or not a pixel is sufficiently sampled.</p>
<p>My first attempt at a global approach (the test scene in this post is a globe, but that pun was not intended) was to simply have the adaptive sampler check the contrast of each pixel with it’s immediate neighbours. Every N samples, the adaptive sampler would pull the accumulated radiances buffer and flag each pixel as unconverged if the pixel has a contrast greater than some threshold from at least one of its neighbours. Pixels marked unconverged are sampled for N more iterations, while pixels marked as converged are skipped for the next N iterations. After another N iterations, the adaptive sampler would go back and reflag every pixel, meaning that a pixel previously marked as converged could be reflagged as unconverged if its neighbours changed enormously. Generally N should be a rather large number (say, 128 samples per pixel), since doing convergence checks is meaningless if the image is too noisy at the time of the check.</p>
<p>Using this strategy, I got the following image, which was set to run for a maximum of 5120 samples per pixel but wound up averaging 4500 samples per pixel, or about a 12.1% reduction in samples needed:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2015/Mar/adaptive_perpixel_globe_bdpt.png"><img src="https://blog.yiningkarlli.com/content/images/2015/Mar/preview/adaptive_perpixel_globe_bdpt.jpg" alt="Adaptive sampling per pixel, average 4500 samples per pixel, BDPT." /></a></p>
<p>At an initial glance, this looks pretty good! However, as soon as I examined where the actual samples went, I realized that this strategy doesn’t work. The following image is a heatmap showing where samples were driven, with brighter areas indicating more samples per pixel:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2015/Mar/adaptive_perpixel.png"><img src="https://blog.yiningkarlli.com/content/images/2015/Mar/preview/adaptive_perpixel.jpg" alt="Sampling heatmap for adaptive sampling per pixel. Brighter areas indicate more samples." /></a></p>
<p>Generally, my per-pixel adaptive sampler did correctly identify the caustic areas as needing more samples, but a problem becomes apparent in the backdrop areas: the per-pixel adaptive sampler drove samples at clustered “chunks” evenly, but not evenly <em>across</em> different clusters. This behavior happens because while the per-pixel sampler is now taking into account variance across neighbours, it still doesn’t have any sort of global sense across the entire image! Instead, the sampler is finding localized pockets where variance seems even across pixels, but those pockets can be quite disconnected from further out areas. While the resultant render looks okay at a glance, clustered variance patterns becomes apparent if the image contrast is increased:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2015/Mar/adaptive_perpixel_globe_bdpt_highcontrast.png"><img src="https://blog.yiningkarlli.com/content/images/2015/Mar/preview/adaptive_perpixel_globe_bdpt_highcontrast.jpg" alt="Adaptive sampling per pixel, with enhanced contrast. Note the local clustering artifacts." /></a></p>
<p>Interestingly, these artifacts are reminiscent of the artifacts that show up in not-fully-converged Metropolis Light Transport renders. This similarity makes sense, since in both cases they arise from uneven localized convergence.</p>
<p>The next approach that I tried is a more global approach adapted from <a href="http://jo.dreggn.org/home/2009_stopping.pdf">Dammertz et al.’s paper, “A Hierarchical Automatic Stopping Condition for Monte Carlo Global Illumination”</a>. For the sake of simplicity, I’ll refer to the approach in this paper as Dammertz for the rest of this post. Dammertz works by considering the variance across an entire block of pixels at once and flagging the entire block as converged or unconverged, allowing for much more global analysis. At the first variance check, the only block considered is the entire image as one enormous block; if the total variance <em>e<sub>b</sub></em> in the entire block is below a termination threshold <em>e<sub>t</sub></em>, the block is flagged as converged and no longer needs to be sampled further. If <em>e<sub>b</sub></em> is greater than <em>e<sub>t</sub></em> but still less than a splitting threshold <em>e<sub>s</sub></em>, then the block will be split into two non-overlapping child blocks for the next round of variance checking after N iterations have passed. At each variance check, this process is repeated for each block, meaning the image eventually becomes split into an ocean of smaller blocks. Blocks are kept inside of a simple unsorted list, require no relational information to each other, and are removed from the list once marked as converged, making the memory requirements very simple. Blocks are split along their major axis, with the exact split point chosen to keep error as equal as possible across the two sides of the split.</p>
<p>The actual variance metric used is also very straightforward; instead of trying to calculate an estimate of variance based on neighbouring pixels, Dammertz stores two framebuffers: one buffer I for all accumulated radiances so far, and a second buffer A for accumulated radiances from every other iteration. As the image approaches full convergence, the differences between I and A should shrink, so an estimation of variance can be found simply by comparing radiance values between I and A. The specific details and formulations can be found in section 2.1 of the paper.</p>
<p>I made a single modification to the paper’s algorithm: I added a lower bound to the block size. Instead of allowing blocks to split all the way to a single pixel, I stop splitting after a block reaches 64 pixels in a 8x8 square. I found that splitting down to single pixels could sometimes cause false positives in convergence flagging, leading to missed pixels similar to in the PBRT approach. Forcing blocks to stop splitting at 64 pixels means there is a chance of false negatives for convergence, but a small amount of unnecessary oversampling is preferable to undersampling.</p>
<p>Using this per-block adaptive sampler, I got the following image, which again is superficially extremely similar to the fixed sampler result. This render was also set to run for a maximum of 5120 samples, but wound up averaging just 2920 samples per pixel, or about a 42.9% reduction in samples needed:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2015/Mar/adaptive_perblock_globe_bdpt.png"><img src="https://blog.yiningkarlli.com/content/images/2015/Mar/preview/adaptive_perblock_globe_bdpt.jpg" alt="Adaptive sampling per block, average 2920 samples per pixel, BDPT." /></a></p>
<p>The sample heatmap looks good too! The heatmap shows that the sampler correctly identified the caustic and highlight areas as needing more samples, and doesn’t have clustering issues in areas that needed fewer samples:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2015/Mar/adaptive_perblock.png"><img src="https://blog.yiningkarlli.com/content/images/2015/Mar/adaptive_perblock.png" alt="Sampling heatmap for adaptive sampling per block. Brighter areas indicate more samples." /></a></p>
<p>Boosting the image contrast shows that the image is free of local clustering artifacts and noise is even across the entire image, which is what we would expect:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2015/Mar/adaptive_perblock_globe_bdpt_highcontrast.png"><img src="https://blog.yiningkarlli.com/content/images/2015/Mar/preview/adaptive_perblock_globe_bdpt_highcontrast.jpg" alt="Adaptive sampling per block, with enhanced contrast. Note the even noise spread and lack of local clustering artifacts." /></a></p>
<p>Looking at the same 500% crop area as earlier, the adaptive per-block and fixed sampling renders are indistinguishable:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2015/Mar/globe_fixed_adaptive_comparison.png"><img src="https://blog.yiningkarlli.com/content/images/2015/Mar/globe_fixed_adaptive_comparison.png" alt="500% crop. Left: fixed sampling, 5120 samples per pixel, BDPT. Right: adaptive per-block sampling, average 2920 samples per pixel, BDPT." /></a></p>
<p>So with that, I think Dammertz works pretty well! Also, the computational and memory overhead required for the Dammertz approach is basically negligible relative to the actual rendering process. This approach is the one that is currently in Takua a0.5.</p>
<p>I actually have an additional adaptive sampling trick designed specifically for targeting fireflies. This additional trick works in conjunction with the Dammertz approach. However, this post is already much longer than I originally planned, so I’ll save that discussion for a later post. I’ll also be getting back to the PPM/VCM posts in my series of integrator posts shortly; I have not had much time to write on my blog since the vast majority of my time is currently focused on my thesis, but I’ll try to get something posted soon!</p>
https://blog.yiningkarlli.com/2015/02/flower-vase-render.html
Flower Vase Renders
2015-02-27T00:00:00+00:00
2015-02-27T00:00:00+00:00
Yining Karl Li
<p><a href="https://blog.yiningkarlli.com/content/images/2015/Feb/flowers.cam2.png"><img src="https://blog.yiningkarlli.com/content/images/2015/Feb/preview/flowers.cam2.jpg" alt="Rendered in Takua a0.5 using BDPT. Nearly a quarter of a billion triangles." /></a></p>
<p>In order to test Takua a0.5, I’ve been using my renderer on some quick little “pretty picture” projects. I recently ran across a fantastic flower vase model by artist <a href="https://www.behance.net/andi_mix">Andrei Mikhalenko</a> and used Andrei’s model as the basis for a shading exercise. The above and following images are rendered entirely in Takua a0.5 using bidirectional pathtracing. I textured and shaded everything using Takua a0.5’s layered material system, and also made some small modifications to the model (moved some flowers around, extended the stems to the bottom of the vase, and thickened the bottom of the vase). Additionally, I further subdivided the flower petals to gain additional detail and smoothness, meaning the final rendered model weighs in at nearly a quarter of a billion triangles. Obviously using such heavy models is not practical for a single prop in real world production, but I wanted to push the amount of geometry my renderer can handle. Overall, total memory usage for each of these renders hovered around 10.5 GB. All images were rendered at 1920x1080 resolution; click on each image to see the full resolution renders.</p>
<p>For the flowers, I split all of the flowers into five randomly distributed groups and assigned each group a different flower material. Each material is a two-sided material with a different BSDF assigned to each side, with side determined by the surface normal direction. For each flower, the outside BSDF has a slightly darker reflectance than the inner BSDF, which efficiently approximates the subsurface scattering effect real flowers have, but without actually having to use subsurface scattering. In this case, using a two-sided material to fake the effect of subsurface scattering is desirable since the model is so complex and heavy. Also, the stems and branches are all bump mapped.</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2015/Feb/flowers.cam0.png"><img src="https://blog.yiningkarlli.com/content/images/2015/Feb/preview/flowers.cam0.jpg" alt="Rendered in Takua a0.5 using BDPT. Note the complex caustics from the vase and water." /></a></p>
<p>This set of renders was a good test for bidirectional pathtracing because of the complex nature of the caustics in the vase and water; note that the branches inside of the vase and water cannot be efficiently rendered by unidirectional pathtracing since they are in glass and therefore cannot directly sample the light sources. The scene is lit by a pair of rectlights, one warmer and one cooler in temperature. This lighting setup, combined with the thick glass and water volume at the bottom of the vase, produces some interesting caustic on the ground beneath the vase.</p>
<p>The combination of the complex caustics and the complex geometry in the bouquet itself meant that a fairly deep maximum ray path length was required (16 bounces). Using BDPT helped immensely with resolving the complex bounce lighting inside of the bouquet, but the caustics proved to be difficult for BDPT; in all of these renders, everything except for the caustics converged within about 30 minutes on a quad-core Intel Core i7 machine, but the caustics took a few hours to converge in the top image, and a day to converge for the second image. I’ll discuss caustic performance in BDPT compared to PPM and VCM in some upcoming posts.</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2015/Feb/flowers.cam1.png"><img src="https://blog.yiningkarlli.com/content/images/2015/Feb/preview/flowers.cam1.jpg" alt="Rendered in Takua a0.5 using BDPT. Depth of field and circular bokeh entirely in-camera." /></a></p>
<p>All depth of field is completely in-camera and in-renderer as well. No post processed depth of field whatsoever! For the time being, Takua a0.5 only supports circular apertures and therefore only circular bokeh, but I plan on adding custom aperture shapes after I finish my thesis work. In general, I think that testing my own renderer with plausibly real-world production quality scenes is very important. After all, having just a toy renderer with pictures of spheres is not very fun… the whole point of a renderer is to generate some really pretty pictures! For my next couple of posts, I’m planning on showing some more complex material/scene tests, and then moving onto discussing the PPM and VCM integrators in Takua.</p>
<p>Addendum: I should comment on the memory usage a bit more, since some folks have expressed interest in what I’m doing there. By default, the geometry actually weighs in closer to 30 GB in memory usage, so I had to implement some hackery to get this scene to fit in memory on a 16 GB machine. The hack is really simple: I added an optional half-float mode for geometry storage. In practice, using half-floats for geometry is usually not advisable due to precision loss, but in this particular scene, that precision loss becomes more acceptable due to a combination of depth of field hiding most alignment issues closer to camera, and sheer visual complexity making other alignment issues hard to spot without looking too closely. Additionally, for the flowers I also threw away all of the normals and recompute them on the fly at render-time. Recomputing normals on the fly results in a small performance hit, but it vastly preferable to going out of core.</p>
https://blog.yiningkarlli.com/2015/02/multiple-importance-sampling.html
Multiple Importance Sampling
2015-02-13T00:00:00+00:00
2015-02-13T00:00:00+00:00
Yining Karl Li
<p>A key tool introduced by Veach as part of his bidirectional pathtracing formulation is multiple importance sampling (MIS). As discussed in my <a href="">previous post</a>, the entire purpose of rendering from a mathematical perspective is to solve the light transport equation, which in the case of all pathtracing type renderers means solving the path integral formulation of light transport. Since the path integral does not have a closed form solution in all but the simplest of scenes, we have to estimate the full integral using various sampling techniques in path space, hence unidirectional pathtracing and bidirectional pathtracing and metropolis based techniques, etc. As we saw with the light source in glass case and with SDS paths, often a single path sampling technique is not sufficient for capturing a good estimate of the path integral. Instead, a good estimate often requires a combination of a number of different path sampling techniques; MIS is a critical mechanism for combining multiple sampling techniques in a manner that reduces total variance. Without MIS, directly combining sampling techniques through averaging can often have the opposite effect and <em>increase</em> total variance.</p>
<p>The following image is a recreation of the test scene used in the Veach thesis to demonstrate MIS. The scene consists of four glossy bars going from less glossy at the top to more glossy at the bottom, and four sphere lights of increasing size. The smallest sphere light has the highest emission intensity, and the largest sphere light has the lowest emission. I modified the scene to add in a large rectangular area light off camera on each side of the scene, and I added an additional bar to the bottom of the scene with gloss driven by a texture map:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2015/Feb/veach.bdpt.png"><img src="https://blog.yiningkarlli.com/content/images/2015/Feb/preview/veach.bdpt.jpg" alt="Figure 1: Recreation of the Veach MIS test scene. Rendered in Takua." /></a></p>
<p>The above scene is difficult to render using any single path sampling technique because of the various combinations of surface glossiness and emitter size/intensity. For large emitter/low gloss combinations, importance sampling by the BSDF tends to result in lower variance. In the case, the reason is that a given random ray direction is more likely to hit the large light than it is to fall within a narrow BSDF lobe, so matching the sample distribution to the BSDF lobe is more efficient. However, for small emitter/high gloss combinations, the reverse is true. If we take the standard Veach scene and sample by only BSDF and then only by light source, we can see how each strategy fails in different cases. Both of these renders would eventually converge if left to render for long enough, but the rate of convergence in difficult areas would be extremely slow:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2015/Feb/veach_bsdfsample.bdpt.png"><img src="https://blog.yiningkarlli.com/content/images/2015/Feb/preview/veach_bsdfsample.bdpt.jpg" alt="Figure 2: BSDF sampling only, 64 iterations." /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2015/Feb/veach_lightsample.bdpt.png"><img src="https://blog.yiningkarlli.com/content/images/2015/Feb/preview/veach_lightsample.bdpt.jpg" alt="Figure 3: Light sampling only, 64 iterations." /></a></p>
<p>MIS allows us to combine <em>m</em> different sampling strategies to produce a single unbiased estimator by weighting each sampling strategy by its probability distribution function (pdf). Mathematically, this is expressed as:</p>
<div>\[ \langle I_{j} \rangle_{MIS} = \sum_{i=1}^{m} \frac{1}{n_{i}} \sum_{j=1}^{n_{i}} w_{i}(X_{i,j}) \frac{f(X_{i,j})}{p_{i}(X_{i,j})} \]</div>
<p>where <em>X<sub>i,j</sub></em> are independent random variables drawn from some distribution function <em>p<sub>i</sub></em> and <em>w<sub>i</sub>(X<sub>i,j</sub>)</em> is some heuristic for weighting each sampling technique with respect to pdf. The reason MIS is able to significantly lower variance is because a good MIS weighting function should dampen contributions with low pdfs. The Veach thesis presents two good weighting heuristics, the <em>power heuristic</em> and the <em>balance heuristic</em>. The power heuristic is defined as:</p>
<div>\[ w_{i}(x) = \frac{[n_{i}p_{i}(x)]^{\beta}}{\sum_{n}^{k=1}[n_{k}p_{k}(x)]^{\beta}}\]</div>
<p>The power heuristic states that the weight for a given sampling technique should be the pdf of the sampling technique raised to a power <em>β</em> divided by the sum of the pdfs of all considered sampling techniques, with each sampling technique also raised to <em>β</em>. For the power heuristic, <em>β</em> is typically set to 2. The balance heuristic is simply the power heuristic for <em>β</em>=1. In the vast majority of cases, the balance heuristic is a near optimal weighting heuristic (and the power heuristic can cover most remaining edge cases), assuming that the base sampling strategies are decent to begin with.</p>
<p>For the standard Veach MIS demo scene, the best result is obtained by using MIS to combine BSDF and light sampling. The following image is the Veach scene again, this time rendered using MIS with 64 iterations. Note that all highlights are now roughly equally converged and the entire image matches the reference render above, apart from noise:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2015/Feb/veach_bothsample.bdpt.png"><img src="https://blog.yiningkarlli.com/content/images/2015/Feb/preview/veach_bothsample.bdpt.jpg" alt="Figure 4: Light and BSDF sampling combined using MIS, 64 iterations." /></a></p>
<p>BDPT inherently does not necessarily have an improved convergence rate over vanilla unidirectional pathtracing; BDPT gains its significant edge in convergence rate only once MIS is applied since BDPT’s efficiency comes from being able to extract a large number of path sampling techniques out of a single bidirectional path. To demonstrate the impact of MIS on BDPT, I rendered the following images using BDPT with and without MIS. The scene is a standard Cornell Box, but I replaced the back wall with a more complex scratched, glossy surface. The first image is the fully converged ground truth render, followed by with and without MIS:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2015/Feb/gloss_groundtruth.png"><img src="https://blog.yiningkarlli.com/content/images/2015/Feb/preview/gloss_groundtruth.jpg" alt="Figure 5: Cornell Box with scratched glossy back wall. Rendered using BDPT with MIS." /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2015/Feb/gloss_mis.bdpt.png"><img src="https://blog.yiningkarlli.com/content/images/2015/Feb/preview/gloss_mis.bdpt.jpg" alt="Figure 6: BDPT with MIS, 16 iterations." /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2015/Feb/gloss_nomis.bdpt.png"><img src="https://blog.yiningkarlli.com/content/images/2015/Feb/preview/gloss_nomis.bdpt.jpg" alt="Figure 7: BDPT without MIS, 16 iterations." /></a></p>
<p>As seen above, the version of BDPT without MIS is significantly less converged. BDPT without MIS will still converge to the correct solution, but in practice can often be only as good as, or worse than unidirectional pathtracing.</p>
<p>Later on, we’ll discuss MIS beyond bidirectional pathtracing. In fact, MIS is the critical component to making VCM possible!</p>
<hr />
<p><strong>Addendum 01/12/2018</strong>: A reader noticed some brightness inconsistencies in the original versions of Figures 2 and 3, which came from bugs in Takua’s light sampling code without MIS at the time.
I have replaced the original versions of Figures 1, 2, 3, and 4 with new, correct versions rendered using the current version of Takua as of the time of writing for this addendum.</p>
<p>Because of how much noise there is in Figures 2 and 3, seeing that they converge to the reference image might be slightly harder.
To make the convergence clearer, I rendered out each sampling strategy using 1024 samples per pixel, instead of just 64:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2015/Feb/veach_bsdfsample.bdpt.hq.png"><img src="https://blog.yiningkarlli.com/content/images/2015/Feb/preview/veach_bsdfsample.bdpt.hq.jpg" alt="Figure 8: BSDF sampling only, 1024 iterations." /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2015/Feb/veach_lightsample.bdpt.hq.png"><img src="https://blog.yiningkarlli.com/content/images/2015/Feb/preview/veach_lightsample.bdpt.hq.jpg" alt="Figure 9: Light sampling only, 1024 iterations." /></a></p>
<p>Note how Figures 8 and 9 match Figure 1 exactly, aside from noise.
In Figure 9, the reflection of the right-most sphere light on the top-most bar is still extremely noisy because of the extreme difficulty of finding a random light sample that happens to produce a valid bsdf response for the near-perfect specular lobe.</p>
<p>One last minor note: I’m leaving the main text of this post unchanged, but the updated renders use Takua’s modern shading system instead of the old one from 2015; in the new shading system, the metal bars use roughness instead of gloss, and use GGX instead of a normalized Phong variant.</p>
https://blog.yiningkarlli.com/2015/02/bidirectional-pathtracing-integrator.html
Bidirectional Pathtracing Integrator
2015-02-11T00:00:00+00:00
2015-02-11T00:00:00+00:00
Yining Karl Li
<p>As part of Takua a0.5’s complete rewrite, I implemented the <a href="https://graphics.cg.uni-saarland.de/fileadmin/cguds/papers/2012/georgiev_sa2012/georgiev_sa2012.pdf">vertex connection and merging</a> (VCM) light transport algorithm. Implementing VCM was one of the largest efforts of this rewrite, and is probably the single feature that I am most proud of. Since VCM subsumes bidirectional pathtracing and progressive photon mapping, I also implemented Veach-style bidirectional pathtracing (BDPT) with multiple importance sampling (MIS) and Toshiya Hachisuka’s stochastic progressive photon mapping (SPPM) algorithm. Since each one of these integrators is fairly complex and interesting by themselves, I’ll be writing a series of posts on my BDPT and SPPM implementations before writing about my full VCM implementation. My plan is for each integrator to start with a longer post discussing the algorithm itself and show some test images demonstrating interesting properties of the algorithm, and then follow up with some shorter posts detailing specific tricky or interesting pieces and also show some pretty real-world production-plausible examples of when each algorithm is particularly useful.</p>
<p>As usual, we’ll start off with an image. Of course, all images in this post are rendered entirely using Takua a0.5. The following image is a Cornell Box lit by a textured sphere light completely encased in a glass sphere, rendered using my bidirectional pathtracer integrator. For reasons I’ll discuss a bit later in this post, this scene belongs to a whole class of scenes that unidirectional pathtracing is absolutely abysmal; these scenes require a bidirectional integrator to converge in any reasonable amount of time:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2015/Feb/spherelight.bdpt.png"><img src="https://blog.yiningkarlli.com/content/images/2015/Feb/preview/spherelight.bdpt.jpg" alt="Room lit with a textured sphere light enclosed in a glass sphere, converged result rendered using bidirectional pathtracing." /></a></p>
<p>To understand why BDPT is a more robust integrator than unidirectional pathtracing, we need to start by examining the light transport equation and its path integral formulation. The light transport equation was <a href="http://dl.acm.org/citation.cfm?id=15902">introduced by Kajiya</a> and is typically presented using the formulation from <a href="https://graphics.stanford.edu/papers/veach_thesis/">Eric Veach’s thesis</a>:</p>
<div>\[ L_{\text{o}}(\mathbf{x},\, \omega_{\text{o}}) \,=\, L_e(\mathbf{x},\, \omega_{\text{o}}) \ +\, \int_{\mathcal{S}^2} L_{\text{o}}(\mathbf{x}_\mathcal{M}(\mathbf{x},\, \omega_{i}),\, -\omega_{i}) \, f_s(\mathbf{x},\, \omega_{i} \rightarrow \omega_{\text{o}}) \, d \sigma_{\mathbf{x}}^{\perp} (\omega_{i}) \]</div>
<p>Put into words instead of math, the light transport equation simply states that the amount of light leaving any point is equal to the amount of light emitted at that point plus the total amount of light arriving at that point from all directions, weighted by the surface reflectance and absorption at that point. Combined with later extensions to account for effects such as volume scattering and subsurface scattering and diffraction, the light transport equation serves as the basis for all of modern physically based rendering. In order to solve the light transport equation in a practical manner, Veach presents the path integral formulation of light transport:</p>
<div>\[ I_{j} = \int_{\Omega}^{} L_{e}(\mathbf{x}_{0})G(\mathbf{x}_{0}\leftrightarrow \mathbf{x}_{1})[\prod_{i=1}^{k-1}\rho(\mathbf{x}_{i})G(\mathbf{x}_{i}\leftrightarrow \mathbf{x}_{i+1})]W_{e}(\mathbf{x}_{k}) d\mu(\bar{\mathbf{x}}) \]</div>
<p>The path integral states that for a given pixel on an image, the amount of radiance arriving at that pixel is the integral of all radiance coming in through all paths in path space, where a path is the route taken by an individual photon from the light source through the scene to the camera/eye/sensor, and path space simply encompasses all possible paths. Since there is no closed form solution to the path integral, the goal of modern physically based ray-tracing renderers is to sample a representative subset of path space in order to produce a reasonably accurate estimate of the path integral per pixel; progressive renderers estimate the path integral piece by piece, producing a better and better estimate of the full integral with each new iteration.</p>
<p>At this point, we should take a brief detour to discuss the terms “unbiased” versus “biased” rendering. Within the graphics world, there’s a lot of confusion and preconceptions about what each of these terms mean. In actuality, an unbiased rendering algorithm is simply one where each iteration produces an exact result for the particular piece of path space being sampled. A biased rendering algorithm is one where at each iteration, an approximate result is produced for the piece of path space being sampled. However, biased algorithms are not necessarily a bad thing; a biased algorithm can be consistent, that is, converges in the limit to the same result as an unbiased algorithm. Consistency means that an estimator arrives at the accurate result in the limit; so in practice, we should care less about whether or not an algorithm is biased or unbiased so long as it is consistent. BDPT is an unbiased, consistent integrator whereas SPPM is a biased but still consistent integrator.</p>
<p>Going back to the path integral, we can quickly see where unidirectional pathtracing comes from once we view light transport through the path integral. The most obvious way to evaluate the path integral is to do exactly as the path integral says: trace a path starting from a light source, through the scene, and if the path eventually hits the camera, accumulate the radiance along the path. This approach is one form of unidirectional pathtracing that is typically referred to as light tracing (LT). However, since the camera is a fairly small target for paths to hit, unidirectional pathtracing is typically implemented going in reverse: start each path at the camera, and trace through the scene until each path hits a light source or goes off into empty space and is lost. This approach is called backwards pathtracing and is what people usually are referring to when they use the term pathtracing (PT).</p>
<p>As I discussed a few years back in <a href="http://blog.yiningkarlli.com/2013/04/importance-sampled-direct-lighting.html">a previous post</a>, pathtracing with direct light importance sampling is pretty efficient at a wide variety of scene types. However, pathtracing with direct light importance sampling will fail for any type of path where the light source cannot be directly sampled; we can easily construct a number of plausible, common setups where this situation occurs. For example, imagine a case where a light source is completely enclosed within a glass container, such as a glowing filament within a glass light bulb. In this case, for any pair consisting of a point in space and a point on the light source, the direction vector to hit the light point from the surface point through glass is not just the light point minus the surface point normalized, but instead has to be at an angle such that the path hits the light point after refracting through glass. Without knowing the exact angle required to make this connection beforehand, the probability of a random direct light sample direction arriving at the glass interface at the correct angle is extremely small; this problem is compounded if the light source itself is very small to begin with.</p>
<p>Taking the sphere light in a glass sphere scene from earlier, we can compare the result of pathtracing without glass around the light versus with glass around the light. The following comparison shows 16 iterations each, and we can see that the version with glass around the light is significantly less converged:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2015/Feb/spherelight_16_yesglass.pt.png"><img src="https://blog.yiningkarlli.com/content/images/2015/Feb/spherelight_16_yesglass.pt.png" alt="Pathtracing, 16 iterations, with glass sphere." /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2015/Feb/spherelight_16_noglass.pt.png"><img src="https://blog.yiningkarlli.com/content/images/2015/Feb/preview/spherelight_16_noglass.pt.jpg" alt="Pathtracing, 16 iterations, without glass sphere." /></a></p>
<p>Generally, pathtracing is terrible at resolving caustics, and the glass-in-light scenario is one where all illumination within the scene is through caustics. Conversely, light tracing is quite good at handling caustics and can be combined with direct sensor importance sampling (same idea as direct light importance sampling, just targeting the camera/eye/sensor instead of a light source). However, light tracing in turn is bad at handling certain scenarios that pathtracing can handle well, such as small distant spherical lights.</p>
<p>The following image again shows the sphere light in a glass sphere scene, but is now rendered for 16 iterations using light tracing. Note how the render is significantly more converged, for approximately the same computational cost. The glass sphere and sphere light render as black since in light tracing, the camera cannot be directly sampled from a specular surface.</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2015/Feb/spherelight_16_yesglass.lt.png"><img src="https://blog.yiningkarlli.com/content/images/2015/Feb/preview/spherelight_16_yesglass.lt.jpg" alt="Light tracing, 16 iterations, with glass sphere." /></a></p>
<p>Since bidirectional pathtracing subsumes both pathtracing and light tracing, I implemented pathtracing and light tracing simultaneously and used each integrator as a check on the other, since correct integrators should converge to the same result. Implementing light tracing requires BSDFs and emitters to be a bit more robust than in vanilla pathtracing; emitters have to support both emission and illumination, and BSDFs have to support bidirectional evaluation. Light tracing also requires the ability to directly sample the camera and intersect the camera’s image plane to figure out what pixel to contribute a path to; as such, I implemented a rasterize function for my thin-lens and fisheye camera models. My thin-lens camera’s rasterization function supports the same depth of field and bokeh shape capabilities that the thin-lens camera’s raycast function supports.</p>
<p>The key insight behind bidirectional pathtracing is that since light tracing and vanilla pathtracing each have certain strengths and weaknesses, combining the two sampling techniques should result in a more robust path sampling technique. In BDPT, for each pixel per iteration, a path is traced starting from the camera and a second path is traced starting from a point on a light source. The two paths are then joined into a single path, conditional on an unoccluded line of sight from the end vertices of the two paths to each other. A BDPT path of length <em>k</em> with <em>k+1</em> vertices can then be used to generate up to <em>k+2</em> path sampling techniques by connecting each vertex on each subpath to every other vertex on the other subpath. While BDPT per iteration is much more expensive than unidirectional pathtracing, the much larger number of sampling techniques leads to a significantly higher convergence rate that typically outweighs the higher computational cost.</p>
<p>Below is the same scene as above rendered with 16 iterations of BDPT, and rendered with the same amount of computation time (about 5 iterations of BDPT). Note how with just 5 iterations, the BDPT result with the glass sphere has about the same level of noise as the pathtraced result for 16 iterations <em>without</em> the glass sphere. At 16 iterations, the BDPT result with the glass sphere is noticeably more converged than the pathtraced result for 16 iterations without the glass sphere.</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2015/Feb/spherelight_16_yesglass.bdpt.png"><img src="https://blog.yiningkarlli.com/content/images/2015/Feb/preview/spherelight_16_yesglass.bdpt.jpg" alt="BDPT, 16 iterations, with glass sphere." /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2015/Feb/spherelight_5_yesglass.bdpt.png"><img src="https://blog.yiningkarlli.com/content/images/2015/Feb/preview/spherelight_5_yesglass.bdpt.jpg" alt="BDPT, 5 iterations (same compute time as 16 iterations pathtracing), with glass sphere." /></a></p>
<p>A naive implementation of BDPT would be for each pixel per iteration, trace a full light subpath, store the result, trace a full camera subpath, store the result, and then perform the connection operations between each vertex pair. However, since this approach requires storing the entirety of both subpaths for the entire iteration, there is room for some improvement. For Takua a0.5, my implementation stores only the full light subpath. At each bounce of the camera subpath, my implementation connects the current vertex to each vertex of the stored light subpath, weights and accumulates the result, and then moves onto the next bounce without having to store previous path vertices.</p>
<p>The following image is another example of a scene that BDPT is significantly better at sampling than any unidirectional pathtracing technique. The scene consists of a number of diffuse spheres and spherical lights inside of a glass bunny. In this scene, everything outside of the bunny is being lit using only caustics, while diffuse surfaces inside of the bunny are being lit using a combination of direct lighting, indirect diffuse bounces, and caustics from outside of the bunny reflecting/refracting back <em>into</em> the bunny. This last type of lighting belongs to a category of paths known as <em>specular-diffuse-specular</em> (SDS) paths that are especially difficult to sample unidirectionally.</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2015/Feb/bunnylight.bdpt.png"><img src="https://blog.yiningkarlli.com/content/images/2015/Feb/preview/bunnylight.bdpt.jpg" alt="Various diffuse spheres and sphere lights inside of a glass bunny, rendered using BDPT." /></a></p>
<p>Here is the same scene as above, but with the glass bunny removed just so seeing what is going on with the spheres is a bit easier:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2015/Feb/bunnylight_nobunny.bdpt.png"><img src="https://blog.yiningkarlli.com/content/images/2015/Feb/preview/bunnylight_nobunny.bdpt.jpg" alt="Same spheres as above, sans bunny. Rendered using BDPT." /></a></p>
<p>Comparing pathtracer versus BDPT performance for 16 interations, BDPT’s vastly better performance on this scene becomes obvious:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2015/Feb/bunnylight_16.pt.png"><img src="https://blog.yiningkarlli.com/content/images/2015/Feb/bunnylight_16.pt.png" alt="16 iterations, rendered using pathtracing." /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2015/Feb/bunnylight_16.bdpt.png"><img src="https://blog.yiningkarlli.com/content/images/2015/Feb/preview/bunnylight_16.bdpt.jpg" alt="16 iterations, rendered using BDPT." /></a></p>
<p>In the next post, I’ll write about multiple importance sampling (MIS), how it impacts BDPT, and my MIS implementation in Takua a0.5.</p>
https://blog.yiningkarlli.com/2015/01/consistent-normal-interpolation.html
Consistent Normal Interpolation
2015-01-30T00:00:00+00:00
2015-01-30T00:00:00+00:00
Yining Karl Li
<p>I recently ran into a problem with interpolated normals. Instead of supporting sphere primitives directly, Takua Rev 5 generates polygon mesh spheres and handles them the same way as any other polygon mesh is handled. However, when I ran a test using a glass sphere, a lot of fireflies appeared:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2015/Jan/badnormals.0.png"><img src="https://blog.yiningkarlli.com/content/images/2015/Jan/badnormals.0.png" alt="Polygon mesh sphere with heavy firefly artifacts." /></a></p>
<p>The fireflies are an artifact arising from how normal interpolation interacts with specular materials. Since the sphere is a polygonal mesh, normal interpolation is required to give the sphere a smooth appearance instead of a faceted one. The interpolation scheme I was using vanilla Phong normal interpolation: store a smoothed normal at each vertex, and then calculate the smooth shading normal at each point as the barycentric-coordinate-weighted sum of the smooth normals at each vertex of the current triangle. This works well for most cases, but a problem arises at grazing angles: since the smooth shading normal corresponds not to the actual geometry but to a “virtual” smoothed version of the geometry, sometimes outgoing specular rays will end up going below the tangent plane of the current hit point. Because of this, rays hitting a glass sphere with Phone normal interpolation at a grazing angle can sometimes go the wrong way, hence the artifacts in the above image.</p>
<p>Of course, the closer the actual geometry lines up to the virtual smoothed geometry, the less this grazing angle problem occurs. However, in order to completely eliminate artifacting, the polygon geometry needs to approach the limit of the virtual smoothed geometry. In this render, I regenerated the sphere with two more levels of subdivision. As a result, there are fewer fireflies, but fireflies are still present:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2015/Jan/badnormals.1.png"><img src="https://blog.yiningkarlli.com/content/images/2015/Jan/badnormals.1.png" alt="More heavily subdivided polygon mesh sphere. Fewer but still present firefly artifacts." /></a></p>
<p>Initially I thought about just getting rid of the fireflies by checking pixel intensities and clamping intensities that were significantly brighter than their immediate neighbors, which is a fairly basic/standard firefly reduction strategy. However, since in this case the fireflies occur mostly at grazing angles and therefore on silhouettes, intensity clamping can lead to some unpleasant aliasing on silhouettes.</p>
<p>Fortunately, there was a paper by Alexander Reshetov, Alexei Soupikov, and William R. Mark at SIGGRAPH Asia 2010 about dealing with this exact problem. The paper, <a href="http://dl.acm.org/citation.cfm?id=1866168">“Consistent Normal Interpolation”</a>, presents a simple method for tweaking Phong normal interpolation to guarantee that reflected rays never go below the tangent plane. The method is based on incoming ray direction and the angle between the smooth interpolated normal and true face normal. The actual method presented in the paper is very straightforward to implement, but the derivation of the algorithm is fairly interesting and involves solving a nontrivial optimization problem to find a scaling term.</p>
<p>I implemented a slightly modified version of the algorithm presented on page 5 of the paper. The modification I made is simply to account for rays hitting polygons from below the tangent plane, as in the case of internal refraction. Now interpolated normals at grazing angles no longer produce firefly artifacts:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2015/Jan/consistentnormals.png"><img src="https://blog.yiningkarlli.com/content/images/2015/Jan/consistentnormals.png" alt="Polygon sphere with consistent normal interpolation. Note the lack of firefly artifacts." /></a></p>
<p>I’m working on writing up a lot of stuff, so more soon! Stay tuned!</p>
https://blog.yiningkarlli.com/2014/12/takua-revision-5.html
Takua Render Revision 5
2014-12-28T00:00:00+00:00
2014-12-28T00:00:00+00:00
Yining Karl Li
<p><a href="https://blog.yiningkarlli.com/content/images/2014/Dec/xyzrgb_dragon.png"><img src="https://blog.yiningkarlli.com/content/images/2014/Dec/xyzrgb_dragon.png" alt="Rough blue metallic XYZRGB Dragon model in a Cornell Box, rendered entirely with Takua Render a0.5" /></a></p>
<p>I haven’t posted much at all this past year due, but I’ve been working on some stuff that I’m really excited about! For the past year and a half, I’ve been building a new, much more advanced version of Takua Render completely from scratch. In this post, I’ll give a brief introduction and runthrough of the new version of Takua, which I’ve numbered as Revision 5 or a0.5. Since I first started exploring the world of renderer construction a few years back, I’ve learned an immense amount about every part of building a renderer, ranging all the way from low level architecture all the way up to light transport and surface algorithms. I’ve also been fortunate and lucky enough to be able to meet and talk to a lot of people working on professional, industry quality renderers and people from some of the best rendering research groups in the world, and so this new version of my own renderer is an attempt at applying everything I’ve learned and building a base for even further future improvement and research projects.</p>
<p>Very broadly, the two things I’m most proud of with Takua a0.5 are the internal renderer architecture and a lot of work on integrators and light transport. Takua a0.5’s internal architecture is heavily influenced by Disney’s <a href="https://disney-animation.s3.amazonaws.com/uploads/production/publication_asset/70/asset/Sorted_Deferred_Shading_For_Production_Path_Tracing.pdf">Sorted Deferred Shading</a> paper, the internal architecture of <a href="http://graphics.cs.williams.edu/papers/OptiXSIGGRAPH10/Parker10OptiX.pdf">NVIDIA’s Optix engine</a>, and the modular architecture of <a href="https://www.mitsuba-renderer.org/">Mitsuba Render</a>. In the light transport area, Takua a0.5 implements not just unidirectional pathtracing with direct light importance sampling (PT), but also correctly implements multiple importance sampled bidirectional pathtracing (BDPT), progressive photon mapping (PPM), and the relatively new <a href="https://graphics.cg.uni-saarland.de/fileadmin/cguds/papers/2012/georgiev_sa2012/georgiev_sa2012.pdf">vertex connection and merging</a> (VCM) algorithm. I’m planning on writing a series of posts in the next few weeks/months that will dive in depth into Takua a0.5’s various features.</p>
<p>Takua a0.5 has also marked a pretty large shift in strategy in terms of targeted hardware. In previous versions of Takua, I did a lot of exploration with getting the entire renderer to run on CUDA-enabled GPUs. In the interest of increased architectural flexibility, Takua a0.5 does not have a 100% GPU mode anymore. Instead, Takua a0.5 is structured in such a way that certain individual modules can be accelerated by running on the GPU, but overall much of the core of the renderer is designed to make efficient use of the CPU to achieve high performance while bypassing a lot of the complexity of building a pure GPU renderer. Again, I’ll have a detailed post on this decision later down the line.</p>
<p>Here is a list of the some of the major new things in Takua a0.5:</p>
<ul>
<li>Completely modular plugin system
<ul>
<li>Programmable ray/shader queue/dispatch system</li>
<li>Natively bidirectional BSDF system</li>
<li>Multiple geometry backends optimized for different hardware</li>
<li>Plugin systems for cameras, lights, acceleration structures, geometry, viewers, materials, surface patterns, BSDFs, etc.</li>
</ul>
</li>
<li>Task-based concurrency and parallelism via Intel’s TBB library</li>
<li>Mitsuba/PBRT/Renderman 19 RIS style integrator system
<ul>
<li>Unidirectional pathtracing with direct light importance sampling</li>
<li>Lighttracing with camera importance sampling</li>
<li>Bidirectional pathtracing with multiple importance sampling</li>
<li>Progressive photon mapping</li>
<li>Vertex connection and merging</li>
<li>All integrators designed to be re-entrant and capable of deferred operations</li>
</ul>
</li>
<li>Native animation support
<ul>
<li>Renderer-wide keyframing/animation support</li>
<li>Transformational AND deformational motion blur</li>
<li>Motion blur support for all camera, material, surface pattern, light, etc. attributes</li>
<li>Animation/keyframe sequences can be instanced in addition to geometry instancing</li>
</ul>
</li>
</ul>
<p>The blue metallic XYZRGB dragon image is a render that was produced using only Takua a0.5. Since I now have access to the original physical Cornell Box model, I thought it would be fun to use a 100% measurement-accurate model of the Cornell Box as a test scene while working on Takua a0.5. All of these renders have no post-processing whatsoever. Here are some other renders made as tests during development:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2014/Dec/cornellbox.png"><img src="https://blog.yiningkarlli.com/content/images/2014/Dec/cornellbox.png" alt="Vanilla Cornell Box with measurements taken directly off of the original physical Cornell Box model." /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2014/Dec/dragon.png"><img src="https://blog.yiningkarlli.com/content/images/2014/Dec/dragon.png" alt="Glass Stanford Dragon producing some interesting caustics on the floor." /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2014/Dec/glassball.png"><img src="https://blog.yiningkarlli.com/content/images/2014/Dec/glassball.png" alt="Floating glass ball as another caustics test." /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2014/Dec/mirrorcube.png"><img src="https://blog.yiningkarlli.com/content/images/2014/Dec/mirrorcube.png" alt="Mirror cube." /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2014/Dec/animblur.png"><img src="https://blog.yiningkarlli.com/content/images/2014/Dec/animblur.png" alt="Deformational motion blur test using a glass rectangular prism with the top half twisting over time." /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2014/Dec/uvbox.png"><img src="https://blog.yiningkarlli.com/content/images/2014/Dec/uvbox.png" alt="A really ugly texture test that for some reason I kind of like." /></a></p>
<p>More interesting non-Cornell Box renders coming in later posts!</p>
<p>Edit: Since making this post, I found a weighting bug that was causing a lot of energy to be lost in indirect diffuse bounces. I’ve since fixed the bug and updated this post with re-rendered versions of all of the images.</p>
https://blog.yiningkarlli.com/2014/11/sky-paper.html
SIGGRAPH Asia 2014 Paper- A Framework for the Experimental Comparison of Solar and Skydome Illumination
2014-11-19T00:00:00+00:00
2014-11-19T00:00:00+00:00
Yining Karl Li
<p>One of the projects I worked on in my first year as part of Cornell University’s <a href="http://graphics.cornell.edu/">Program of Computer Graphics</a> has been published in the November 2014 issue of ACM Transactions on Graphics and is being presented at SIGGRAPH Asia 2014! The paper is “<a href="http://dl.acm.org/citation.cfm?doid=2661229.2661259">A Framework for the Experimental Comparison of Solar and Skydome Illumination</a>”, and the team on the project was my junior advisor <a href="http://www.graphics.cornell.edu/~kiderj/">Joseph T. Kider Jr.</a>, my lab-mates <a href="http://www.danknowlton.com/">Dan Knowlton</a> and <a href="http://www.jeremynewlin.info/">Jeremy Newlin</a>, myself, and my main advisor <a href="http://www.graphics.cornell.edu/people/director.html">Donald P. Greenberg</a>.</p>
<p>The bulk of my work on this project was in implementing and testing sky models inside of <a href="http://www.mitsuba-renderer.org">Mitsuba</a> and developing the paper’s sample-driven model. Interestingly, I also did a lot of climbing onto the roof of Cornell’s Rhodes Hall building for this paper; Cornell’s facilities was kind enough to give us access to the roof of Rhodes Hall to set up our capture equipment on. This usually involved Joe, Dan, and myself hauling multiple tripods and backpacks of gear up onto the roof in the morning, and then taking it all back down in the evening. Sunny clear skies can be a rare sight in Ithaca, so getting good captures took an awful lot of attempts!</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2014/Nov/siggraphasia2014paper.png"><img src="https://blog.yiningkarlli.com/content/images/2014/Nov/siggraphasia2014paper.png" alt="" /></a></p>
<p>Here is the paper abstract:</p>
<p><em>The illumination and appearance of the solar/skydome is critical for many applications in computer graphics, computer vision, and daylighting studies. Unfortunately, physically accurate measurements of this rapidly changing illumination source are difficult to achieve, but necessary for the development of accurate physically-based sky illumination models and comparison studies of existing simulation models.</em></p>
<p><em>To obtain baseline data of this time-dependent anisotropic light source, we design a novel acquisition setup to simultaneously measure the comprehensive illumination properties. Our hardware design simultaneously acquires its spectral, spatial, and temporal information of the skydome. To achieve this goal, we use a custom built spectral radiance measurement scanner to measure the directional spectral radiance, a pyranometer to measure the irradiance of the entire hemisphere, and a camera to capture high-dynamic range imagery of the sky. The combination of these computer-controlled measurement devices provides a fast way to acquire accurate physical measurements of the solar/skydome. We use the results of our measurements to evaluate many of the strengths and weaknesses of several sun-sky simulation models. We also provide a measurement dataset of sky illumination data for various clear sky conditions and an interactive visualization tool for model comparison analysis available at http://www.graphics.cornell.edu/resources/clearsky/.</em></p>
<p>The paper and related materials can be found at:</p>
<ul>
<li><a href="http://www.graphics.cornell.edu/resources/clearsky/index.htm">Project Page (Preprint paper, supplemental materials, and SIGGRPAGH Asia materials)</a></li>
<li><a href="http://dl.acm.org/citation.cfm?doid=2661229.2661259">Official Print Version (ACM Library)</a></li>
</ul>
<p>Joe Kider will be presenting the paper at <a href="http://sa2014.siggraph.org/en/">SIGGRAPH Asia 2014</a> in Shenzen as part of the <a href="http://sa2014.siggraph.org/en/attendees/technical-papers.html?view=session&type=techpapers&sessionid=3">Light In, Light Out</a> Technical Papers session. Hopefully our data will prove useful to future research!</p>
<hr />
<p><strong>Addendum 04/26/2017</strong>: I added a personal project page for this paper to my website, <a href="http://www.yiningkarlli.com/projects/skydomecompare.html">located here</a>. My personal page mirrors the same content found on the main site, including an author’s version of the paper, supplemental materials, and more.</p>
https://blog.yiningkarlli.com/2014/02/flip-meshing-pipeline.html
PIC/FLIP Simulator Meshing Pipeline
2014-02-14T00:00:00+00:00
2014-02-14T00:00:00+00:00
Yining Karl Li
<p>In my last post, I gave a summary of how the core of my new PIC/FLIP fluid simulator works and gave some thoughts on the process of building OpenVDB into my simulator. In this post I’ll go over the meshing and rendering pipeline I worked out for my simulator.</p>
<p>Two years ago, when my friend <a href="http://www.danknowlton.com/">Dan Knowlton</a> and I built our semi-Lagrangian fluid simulator, we had an immense amount of trouble with finding a good meshing and rendering solution. We used a standard marching cubes implementation to construct a mesh from the fluid levelset, but the meshes we wound up with had a lot of flickering issues. The flickering was especially apparent when the fluid had to fit inside of solid boundaries, since the liquid-solid interface wouldn’t line up properly. On top of that, we rendered the fluid using Vray, but relied on a irradiance map + light cache approach that wasn’t very well suited for high motion and large amounts of refractive fluid.</p>
<p>This time around, I’ve tried to build a new meshing/rendering pipeline that resolves those problems. My new meshing/rendering pipeline produces stable, detailed meshes that fit correctly into solid boundaries, all with minimal or no flickering. The following video is the same “dambreak” test from my previous test, but fully meshed and rendered using Vray:</p>
<div class="embed-container"><iframe src="https://player.vimeo.com/video/87050516" frameborder="0">PIC/FLIP Simulator Dam Break Test- Final Render</iframe></div>
<p>One of the main issues with the old meshing approach was that marching cubes was run directly on the same level set we were using for the simulation, which meant that the resolution of the final mesh was effectively bound to the resolution of the fluid. In a pure semi-Lagrangian simulator, this coupling makes sense, however, in a PIC/FLIP simulator, the resolution of the simulator is dependent on the particle count and not the projection step grid resolution. This property means that even on a simulation with a grid size of 128x64x64, extremely high resolution meshes should be possible if there are enough particles, as long as a level set was constructed directly from the particles completely independently of the projection step grid dimensions.</p>
<p>Fortunately, OpenVDB comes with an enormous toolkit that includes tools for constructing level sets from various type of geometry, including particles, and tools for adaptive level set meshing. OpenVDB also comes with a number of level set operators that allow for artistic tuning of level sets, such as tools for dilating, eroding, and smoothing level set. At the SIGGRAPH 2013 OpenVDB course, <a href="http://www.openvdb.org/download/openvdb_dreamworks.pdf">Dreamworks had a presentation</a> on how they used OpenVDB’s level set operator tools to extract really nice looking, detailed fluid meshes from relatively low resolution simulations. I also integrated Walt Disney Animation Studios’ <a href="http://www.disneyanimation.com/technology/partio.html">Partio</a> library for exporting particle data to standard formats so that I could get particles, level sets, and meshes.</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2014/Feb/adaptivemeshing.png"><img src="https://blog.yiningkarlli.com/content/images/2014/Feb/adaptivemeshing.png" alt="Zero adaptive meshing (on the left) versus adaptive meshing with 0.5 adaptivity (on the right). Note the significantly lower poly count in the adaptive meshing, but also the corresponding loss of detail in the mesh." /></a></p>
<p>I started by building support for OpenVDB’s adaptive level set meshing directly into my simulator and dumping out OBJ sequences straight to disk. In order to save disk space, I enabled fairly high adaptivity in the meshing. However, upon doing a first render test, I discovered a problem: since OpenVDB’s adaptive meshing optimizes the adaptivity per frame, the result is not temporally coherent with respect to mesh resolution. By itself this property is not a big deal, but it makes reconstructing temporally coherent normals difficult, which can contribute to flickering in final rendering. So, I decided that disk space was not as big deal and just disabled adaptivity in OpenVDB’s meshing for smaller simulations; in sufficiently large sims, the scale of the final render more often than not will make normal issues far less important and disk resource demands become much greater, so the tradeoffs of adaptivity become more worthwhile.</p>
<p>The next problem was getting a stable, fitted liquid-solid interface. Even with a million particles and a 1024x512x512 level set driving mesh construction, the produced fluid mesh still didn’t fit the solid boundaries of the sim precisely. The reason is simple: level set construction from particles works by treating each particle as a sphere with some radius and then unioning all of the spheres together. The first solution I thought of was to dilate the level set and then difference it with a second level set of the solid objects in the scene. Since Houdini has full OpenVDB support and I wanted to test this idea quickly with visual feedback, I prototyped this step in Houdini instead of writing a custom tool from scratch. This approach wound up not working well in practice. I discovered that in order to get a clean result, the solid level set needed to be extremely high resolution to capture all of the detail of the solid boundaries (such as sharp corners). Since the output levelset from VDB’s difference operator has to match the resolution of the highest resolution input, that meant the resultant liquid level set was also extremely high resolution. On top of that, the entire process was extremely slow, even on smaller grids.</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2014/Feb/edgecleanup.png"><img src="https://blog.yiningkarlli.com/content/images/2014/Feb/edgecleanup.png" alt="The mesh on the left has a cleaned up, stable liquid-solid interface. The mesh on the right is the same mesh as the one on the left, but before going through cleanup." /></a></p>
<p>The solution I wound up using was to process the mesh instead of the level set, since the mesh represents significantly less data and at the end of the day the mesh is what we want to have a clean liquid-solid interface. The solution is from every vertex in the liquid mesh, raycast to find the nearest point on the solid boundary to each vertex (this can be done either stochastically, or a level set version of the solid boundary can be used to inform a good starting direction). If the closest point on the solid boundary is within some epsilon distance of the vertex, move the vertex to be at the solid boundary. Obviously, this approach is far simpler than attempting to difference level sets, and it works pretty well. I prototyped this entire system in Houdini.</p>
<p>For rendering, I used Vray’s ply2mesh utility to dump the processed fluid meshes directly to .vrmesh files and rendered the result in Vray using pure brute force pathtracing to avoid flickering from temporally incoherent irradiance caching. The final result is the video at the top of this post!</p>
<p>Here are some still frames from the same simulation. The video was rendered with motion blur, these stills do not have any motion blur.</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2014/Feb/dambreak.0105.png"><img src="https://blog.yiningkarlli.com/content/images/2014/Feb/dambreak.0105.png" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2014/Feb/dambreak.0149.png"><img src="https://blog.yiningkarlli.com/content/images/2014/Feb/dambreak.0149.png" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2014/Feb/dambreak.0200.png"><img src="https://blog.yiningkarlli.com/content/images/2014/Feb/dambreak.0200.png" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2014/Feb/dambreak.0236.png"><img src="https://blog.yiningkarlli.com/content/images/2014/Feb/dambreak.0236.png" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2014/Feb/dambreak.0440.png"><img src="https://blog.yiningkarlli.com/content/images/2014/Feb/dambreak.0440.png" alt="" /></a></p>
https://blog.yiningkarlli.com/2014/01/flip-simulator.html
New PIC/FLIP Simulator
2014-01-15T00:00:00+00:00
2014-01-15T00:00:00+00:00
Yining Karl Li
<p>Over the past month or so, I’ve been writing a brand new fluid simulator from scratch. It started as a project for a course/seminar type thing I’ve been taking with <a href="http://www.cs.cornell.edu/~djames/">Professor Doug James</a>, but I’ve been working on since the course ended for fun. I wanted to try our implementing the <a href="http://www.cs.ubc.ca/~rbridson/docs/zhu-siggraph05-sandfluid.pdf">PIC/FLIP method from Zhu and Bridson</a>; in industry, PIC/FLIP has more or less become the de fact standard method for fluid simulation. Houdini and Naiad both use PIC/FLIP implementations as their core fluid solvers, and I’m aware that Double Negative’s in-house simulator is also a PIC/FLIP implementation.</p>
<p>I’ve named my simulator “Ariel”, since I like Disney movies and the name seemed appropriate for a project related to water. Here’s what a “dambreak” type simulation looks like:</p>
<div class="embed-container"><iframe src="https://player.vimeo.com/video/87331839" frameborder="0">PIC/FLIP Simulator Dam Break Test- Ariel View</iframe></div>
<p>That “dambreak” test was run with approximately a million particles, with a 128x64x64 grid for the projection step.</p>
<p>PIC/FLIP stands for Particle-In-Cell/Fluid-Implicit Particles. PIC and FLIP are actually two separate methods that each have certain shortcomings, but when used together in a weighted sum, produces a very stable fluid solver (my own solver uses approximately a 90% FLIP to 10% PIC ratio). PIC/FLIP is similar to SPH in that it’s fundamentally a particle based method, but instead of attempting to use external forces to maintain fluid volume, PIC/FLIP splats particle velocities onto a grid, calculates a velocity field using a projection step, and then copies the new velocities back onto the particles for each step. This difference means PIC/FLIP doesn’t suffer from the volume conservation problems SPH has. In this sense, PIC/FLIP can almost be thought of as a hybridization of SPH and semi-Lagrangian level-set based methods. From this point forward, I’ll refer to the method as just FLIP for simplicity, even though it’s actually PIC/FLIP.</p>
<p>I also wanted to experiment with <a href="http://www.openvdb.org/">OpenVDB</a>, so I built my FLIP solver on top of OpenVDB. OpenVDB is a sparse volumetric data structure library open sourced by Dreamworks Animation, and now integrated into a whole bunch of systems such as Houdini, Arnold, and Renderman. I played with it two years ago during my summer at Dreamworks, but didn’t really get too much experience with it, so I figured this would be a good opportunity to give it a more detailed look.</p>
<p>My simulator uses OpenVDB’s mesh-to-levelset toolkit for constructing the initial fluid volume and solid obstacles, meaning any OBJ meshes can be used to building the starting state of the simulator. For the actual simulation grid, things get a little bit more complicated; I initially started with using OpenVDB for storing the grid for the projection step with the idea that storing the projection grid sparsely should allow for scaling the simulator to really really large scenes. However, I quickly ran into the ever present memory-speed tradeoff of computer science. I found that while the memory footprint of the simulator stayed very small for large sims, it ran almost ten times slower compared to when the grid is stored using raw floats. The reason is that since OpenVDB under the hood is a B+tree, constant read/write operations against a VDB grid end up being really expensive, especially if the grid is not very sparse. The fact that VDB enforces single-threaded writes due to the need to rebalance the B+tree does not help at all. As a result, I’ve left in a switch that allows my simulator to run in either raw float of VDB mode; VDB mode allows for much larger simulations, but raw float mode allows for faster, multithreaded sims.</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2014/Jan/longgrid.0140.png"><img src="https://blog.yiningkarlli.com/content/images/2014/Jan/longgrid.0140.png" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2014/Jan/longgrid.0218.png"><img src="https://blog.yiningkarlli.com/content/images/2014/Jan/longgrid.0218.png" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2014/Jan/longgrid.0430.png"><img src="https://blog.yiningkarlli.com/content/images/2014/Jan/longgrid.0430.png" alt="" /></a></p>
<p>Here’s a video of another test scene, this time patterned after a “waterfall” type scenario. This test was done earlier in the development process, so it doesn’t have the wireframe outlines of the solid boundaries:</p>
<div class="embed-container"><iframe src="https://player.vimeo.com/video/88078336" frameborder="0">PIC/FLIP Simulator Waterfall Test- Ariel View</iframe></div>
<p>In the above videos and stills, blue indicates higher density/lower velocity, white indicate lower density/higher velocity.</p>
<p>Writing the core PIC/FLIP solver actually turned out to be pretty straightforward, and I’m fairly certain that my implementation is correct since it closely matches the result I get out of Houdini’s FLIP solver for a similar scene with similar parameters (although not exactly, since there’s bound to be some differences in how I handle certain details, such as slightly jittering particle positions to prevent artifacting between steps). Figuring out a good meshing and rendering pipeline turned out to be more difficult; I’ll write about that in my next post.</p>
https://blog.yiningkarlli.com/2013/12/takua-chair-renders.html
Takua Chair Renders
2013-12-10T00:00:00+00:00
2013-12-10T00:00:00+00:00
Yining Karl Li
<p>A while back, I did some test renders with Takua a0.4 to test out the material system. The test model was a model of an Eames Lounge Chair Wood, and the materials were glossy wood and aluminum. Each render was done with a single large, importance sampled area light and took about two minutes to complete.</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2013/Dec/eames_aluminum.png"><img src="https://blog.yiningkarlli.com/content/images/2013/Dec/eames_aluminum.png" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2013/Dec/eames_wood.png"><img src="https://blog.yiningkarlli.com/content/images/2013/Dec/eames_wood.png" alt="" /></a></p>
<p>These renders were the last tests I did with Takua a0.4 before starting the new version. More on that soon!</p>
https://blog.yiningkarlli.com/2013/11/throwback-holiday-card-2011.html
Throwback- Holiday Card 2011
2013-11-17T00:00:00+00:00
2013-11-17T00:00:00+00:00
Yining Karl Li
<p>Two years ago, I was asked to create <a href="http://cg.cis.upenn.edu/">CG@Penn</a>’s <a href="http://cg.cis.upenn.edu/HappyHolidays2011.htm">2011 Holiday Card</a>. Shortly after finishing that particular project, I started writing a breakdown post but for some reason never finished/posted it. While going through old content for the <a href="http://blog.yiningkarlli.com/2013/11/code-and-visuals-version-4.html">move to Github Pages</a>, I found some of my old unfinished posts, and I’ve decided to finish up some of them and post them over time as sort of a series of throwback posts.</p>
<p>This project is particularly interesting because almost every approach I took two years ago to finish this project, I would not bother using today. But its still interesting to look back on!</p>
<p>Amy and Joe wanted something wintery and nonreligious for the card, since it would be sent to a very wide and diverse audience. They suggested some sort of snowy landscape piece, so I decided to make a snow-covered forest. This particular idea meant I had to figure out three key elements:</p>
<ul>
<li>Conifer trees</li>
<li>Modeling snow ON the trees</li>
<li>Rendering snow</li>
</ul>
<p>Since the holiday card had to be just a single still frame and had to be done in just a few days, I knew right away that I could (and would have to!) cheat heavily with compositing, so I was willing to try more unknown elements than I normally would throw into a single project. Also, since the shot I had in mind would be a wide, far shot, I knew that I could get away with less up-close detail for the trees.</p>
<p>I started by creating a handful of different base conifer tree models in OnyxTree and throwing them directly into Maya/Vray (this was before I had even started working on Takua Render) just to see how they would look. Normally models directly out of OnyxTree need some hand-sculpting and tweaking to add detail for up-close shots, but here I figured if they looked good enough, I could skip those steps. The result looked okay enough to move on:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2013/Nov/basic_trees.jpg"><img src="https://blog.yiningkarlli.com/content/images/2013/Nov/basic_trees.jpg" alt="" /></a></p>
<p>The textures for the bark and leaves were super simple. To make the bark texture’s diffuse layer, I pulled a photograph of bark off of Google, modified it to tile in Photoshop, and adjusted the contrast and levels until it was the color I wanted. The displacement layer was simply the diffuse layer converted to black and white and with contrast and brightness adjusted. Normally this method won’t work well for up close shots, but again, since I knew the shot would be far away, I could get away with some cheating. Here’s a crop from the bark textures:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2013/Nov/bark.png"><img src="https://blog.yiningkarlli.com/content/images/2013/Nov/bark.png" alt="" /></a></p>
<p>The pine needles were also super cheatey. I pulled a photo out of one of my reference libraries, dropped an opacity mask on top, and that was all for the diffuse color. Everything else was hacked in the leaf material’s shader; since the tree would be far away, I could get away with basic transparency instead of true subsurface scattering. The diffuse map with opacity flattened to black looks like this:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2013/Nov/pineleaves.png"><img src="https://blog.yiningkarlli.com/content/images/2013/Nov/pineleaves.png" alt="" /></a></p>
<p>With the trees roughed in, the next problem to tackle was getting snow onto the trees. Today, I would immediately spin up Houdini to create this effect, but back then, I didn’t have a Houdini license and hadn’t played with Houdini enough to realize how quickly it could be done. Not knowing better back then, I used 3dsmax and a plugin called <a href="http://www.zwischendrin.com/en/detail/261">Snowflow</a> (I used the demo version since this project was a one-off). To speed up the process, I used a simplified, decimated version of the tree mesh for Snowflow. Any inaccuracies between the resultant snow layer and the full tree mesh were acceptable, since they would look just like branches and leaves poking through the snow:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2013/Nov/snowflow.jpg"><img src="https://blog.yiningkarlli.com/content/images/2013/Nov/snowflow.jpg" alt="" /></a></p>
<p>I tried a couple of different variations on snow thickness, which looked decent enough to move on with:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2013/Nov/snowtest.jpg"><img src="https://blog.yiningkarlli.com/content/images/2013/Nov/snowtest.jpg" alt="" /></a></p>
<p>The next step was a fast snow material that would look reasonably okay from a distance, and render quickly. I wasn’t sure if the snow should have a more powdery, almost diffuse look, or if it should have a more refractive, frozen, icy look. I wound up trying both and going with a 50-50 blend of the two:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2013/Nov/snowmaterialtest.png"><img src="https://blog.yiningkarlli.com/content/images/2013/Nov/snowmaterialtest.png" alt="From left to right: refractive frozen ice, powdery diffuse, 50-50 blend" /></a></p>
<p>The next step was to compose a shot, make a very quick, simple lighting setup, and do some test renders. After some iterating, I settled for this render as a base for comp work:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2013/Nov/test4.png"><img src="https://blog.yiningkarlli.com/content/images/2013/Nov/test4.png" alt="" /></a></p>
<p>The base render is very blueish since the lighting setup was a simple, grey-blueish dome light over the whole scene. The shadows are blotchy since I turned Vray’s irradiance cache settings all the way down for faster rendertimes; I decided that I would rather deal with the blotchy shadows in post and have a shot at making the deadline rather than wait for a very long rendertime. I wound up going with the thinner snow at the time since I wanted the trees to be more recognizable as trees, but in retrospect, that choice was probably a mistake.</p>
<p>The final step was some basic compositing. In After Effects, I applied post-processed DOF using a z-depth layer and Frischluft, color corrected the image, cranked up the exposure, and added vignetting to get the final result:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2013/Nov/card.jpg"><img src="https://blog.yiningkarlli.com/content/images/2013/Nov/card.jpg" alt="" /></a></p>
<p>Looking back on this project two years later, I don’t think the final result looks really great. The image looks okay for two days of rushed work, but there is enormous room for improvement. If I could go back and change one thing, I would have chosen to use the much heavier snow cover version of the trees for the final composition. Also, today I would approach this project very very differently; instead of ping-ponging between multiple programs for each component, I would favor a almost pure-Houdini pipeline. The trees could be modeled as L-systems in Houdini, perhaps with some base work done in Maya. The snow could absolutely be simmed in Houdini. For rendering and lighting, I would use either my own Takua Render or some other fast physically based renderer (Octane, or perhaps Renderman 18’s iterative pathtracing mode) to iterate extremely quickly without having to compromise on quality.</p>
<p>So that’s the throwback breakdown of the CG@Penn Holiday 2011 card! I learned a lot from this project, and looking back and comparing how I worked two years ago to how I work today is always a good thing to do.</p>
https://blog.yiningkarlli.com/2013/11/code-and-visuals-version-4.html
Code and Visuals Version 4.0
2013-11-16T00:00:00+00:00
2013-11-16T00:00:00+00:00
Yining Karl Li
<p>I’d like to introduce the newest version of my computer graphics blog, Code and Visuals! On the surface, everything has been redesigned with a new layer of polish; everywhere, the site is now simpler, cleaner, and the layout is now fully responsive. Under the hood, I’ve moved from Blogger to <a href="http://jekyllrb.com/">Jekyll</a>, hosted on <a href="http://pages.github.com/">Github Pages</a>.</p>
<p>As part of the move to Jekyll, I’ve opted to clean up a lot of old posts as well. This blog started as some combination of a devblog, doodleblog, and photoblog, but quickly evolved into a pure computer graphics blog. In the interest of keeping historical context intact, I’ve ported over most of my older non-computer graphics posts, with minor edits and touchups here and there. A handful of posts I didn’t really like I’ve chosen to leave behind, but they can still be found on the <a href="http://yiningkarlli.blogspot.com">old Blogger-based version of this blog</a>.</p>
<p>The Atom feed URL for Code and Visuals is still the same as before, so that should transition over smoothly.</p>
<p>Why the move from Blogger to Jekyll/Github Pages? Here are the main reasons:</p>
<ul>
<li>Markdown/Github support. Blogger’s posting interface is all kinds of terrible. With Jekyll/Github Pages, writing a new post is super nice: simply write a new post in a Markdown file, push to Github, and done. I love Markdown and I love Github, so its a great combo for me.</li>
<li>Significantly faster site. Previous versions of this blog have always been a bit pokey speed-wise, since they relied on dynamic page generators (originally my hand-rolled PHP/MySQL CMS, then Wordpress, and then Blogger). However, Jekyll is a static page generator; the site is converted from Markdown and template code into static HTML/CSS once at generation time, and then simply served as pure HTML/CSS.</li>
<li>Easier templating system. Jekyll’s templating system is build on <a href="http://liquidmarkup.org/">Liquid</a>, which made building this new theme really fast and easy.</li>
<li>Transparency. This entire blog’s source is now <a href="https://github.com/betajippity/betajippity.github.io">available on Github</a>, and the theme is separately <a href="https://github.com/betajippity/codeandvisuals-theme">available here</a>.</li>
</ul>
<p>I’ve been looking to replace Blogger for some time now. Before trying out Jekyll, I was tinkering with <a href="https://ghost.org/">Ghost</a>, and even fully built out a working version of Code and Visuals on a self-hosted Ghost instance. In fact, this current theme was originally built for Ghost and then ported to Jekyll after I decided to use Jekyll (both the Ghost and Jekyll versions of this theme are in the Github repo). However, Ghost as a platform is still extremely new and isn’t quite ready for primetime yet; while Ghost’s Markdown support and Node.js underpinnings are nice, Ghost is still missing crucial features like the ability to have an archive page. Plus, at the end of the day, Jekyll is just plain simpler; Ghost is still a CMS, Jekyll is just a collection of text files.</p>
<p>I intend to stay on a Jekyll/Github Pages based solution for a long time; I am very very happy with this system. Over time, I’ll be moving all of my other couple of non-computer graphics blogs over to Jekyll as well. I’m still not sure if my main website needs to move to Jekyll though, since it already is coded up as a series of static pages and requires a slightly more complex layout on certain pages.</p>
<p>Over the past few months I haven’t posted much, since over the summer almost all of my Pixar related work was under heavy NDA (and still is and will be for the foreseeable future, with the exception of <a href="http://blog.yiningkarlli.com/2013/07/pixar-optix-lighting-preview-demo.html">our SIGGRAPH demo</a>), and a good deal of my work at Cornell’s Program for Computer Graphics is under wraps as well while we work towards paper submissions. However, I have some new personal projects I’ll write up soon, in addition to some older projects that I never posted about.</p>
<p>With that, welcome to Code and Visuals Version 4.0!</p>
https://blog.yiningkarlli.com/2013/07/pixar-optix-lighting-preview-demo.html
Pixar Optix Lighting Preview Demo
2013-07-27T00:00:00+00:00
2013-07-27T00:00:00+00:00
Yining Karl Li
<p>For the past two months or so, I’ve been working at Pixar Animation Studio as a summer intern with <a href="http://graphics.pixar.com/research/">Pixar’s Research Group</a>. The project I’m on for the summer is a realtime, GPU based lighting preview tool implemented on top of <a href="http://www.nvidia.com/object/optix.html">NVIDIA’s OptiX framework</a>, entirely inside of <a href="http://www.thefoundry.co.uk/products/katana/">The Foundry’s Katana</a>. I’m incredibly pleased to be able to say that our project was demoed at SIGGRAPH 2013 at the NVIDIA booth, and that NVIDIA has a recording of the entire demo online!</p>
<p>The demo was done by our project’s lead, Danny Nahmias, and got an overwhelmingly positive reception. Check out the recording here:</p>
<div class="embed-container"><iframe src="https://player.vimeo.com/video/71150839" frameborder="0">Using NVIDIA® OptiX™ for Lighting Preview in a Katana-Based Production Pipeline</iframe></div>
<p>FXGuide also did a podcast about our demo! Check it out <a href="http://www.fxguide.com/fxpodcasts/fxpodcast-258-siggraph-2013-final-report/">here</a>.</p>
<p>I’m just an intern, and the vast majority of the cool work being done on this project is from Danny Nahmias, Phillip Rideout, Mark Meyer, and others, but I’m very very proud, and consider myself extraordinarily lucky, to be part of this team!</p>
<p>Edit: I’ve replaced the original Ustream embed with a Vimeo mirror since the Ustream embed was crashing Chrome for some people. The original Ustream link is <a href="http://ustream.tv/recorded/36266865">here</a>.</p>
https://blog.yiningkarlli.com/2013/04/giant-mesh-test.html
Giant Mesh Test
2013-04-29T00:00:00+00:00
2013-04-29T00:00:00+00:00
Yining Karl Li
<p>My friend/schoolmate <a href="https://vimeo.com/user10815579">Zia Zhu</a> is an amazing modeler, and recently she was kind enough to lend me a ZBrush sculpt she did for use as a high-poly test model for Takua Render. The model is a sculpture of Venus, and is made up of slightly over a million quads, or about two million triangles once triangulated inside of Takua Render.</p>
<p>Here are some nice, pretty test renders I did. As usual, everything was rendered with Takua Render, and there has been absolutely zero post-processing:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2013/Apr/venus1.png"><img src="https://blog.yiningkarlli.com/content/images/2013/Apr/venus1.png" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2013/Apr/venus21.png"><img src="https://blog.yiningkarlli.com/content/images/2013/Apr/venus21.png" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2013/Apr/venus31.png"><img src="https://blog.yiningkarlli.com/content/images/2013/Apr/venus31.png" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2013/Apr/venus41.png"><img src="https://blog.yiningkarlli.com/content/images/2013/Apr/venus41.png" alt="" /></a></p>
<p>Each one of these renders was lit using a single, large area light (with importance sampled direct lighting, of course). The material on the model is just standard lambert diffuse white; I’ll do another set of test renders once I’ve finished rewriting my subsurface scatter system. Each render was set to 2800 samples per pixels and took about 20 minutes to render on a single GTX480. In other words, not spectacular, but not bad either.</p>
<p>The key takeaway from this series of tests was that Takua’s performance still suffers significantly when datasets become extremely large; while the render took about 20 minutes, setup time (including memory transfer, etc) took nearly 5 minutes, which I’m not happy about. I’ll be taking some time to rework Takua’s memory manager.</p>
<p>On a happier note, KD-tree construction performed well! The KD-tree for the Venus sculpt was built out to a depth of 30 and took less than a second to build.</p>
<p>Here’s a bonus image of what the sculpt looks like in the GL preview mode:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2013/Apr/venus_gl.png"><img src="https://blog.yiningkarlli.com/content/images/2013/Apr/venus_gl.png" alt="" /></a></p>
<p>Again, all credit for the actual model goes to the incredibly talented <a href="https://vimeo.com/user10815579">Zia Zhu</a>!</p>
https://blog.yiningkarlli.com/2013/04/importance-sampled-direct-lighting.html
Importance Sampled Direct Lighting
2013-04-26T00:00:00+00:00
2013-04-26T00:00:00+00:00
Yining Karl Li
<p>Takua Render now has correct, fully working importance sampled direct lighting, supported for any type of light geometry! More importantly, the importance sampled direct lighting system is now fully integrated with the overall GI pathtracing integrator.</p>
<p>A naive, standard pathtracing implementation shoots out rays and accumulates colors until a light source is reached, upon which the total accumulated color is multiplied by the emittance of the light source and added to the framebuffer. As a result, even the simplest pathtracing integrator does account for both the indirect and direct illumination within a scene, but since sampling light sources is entirely dependent on the BRDF at each point, correctly sampling the direct illumination component in the scene is extremely inefficient. The canonical example of this inefficiency is a scene with a single very small, very intense, very far away light source. Since the probability of hitting such a small light source is so small, convergence is extremely slow.</p>
<p>To demonstrate/test this property, I made a simple test scene with an extremely bright sun-like object illuminating the scene from a huge distance away:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2013/Apr/directtestscene1.png"><img src="https://blog.yiningkarlli.com/content/images/2013/Apr/directtestscene1.png" alt="" /></a></p>
<p>Using naive pathtracing without importance sampled direct lighting produces an image like this after 16 samples per pixel:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2013/Apr/indirect16.png"><img src="https://blog.yiningkarlli.com/content/images/2013/Apr/indirect16.png" alt="" /></a></p>
<p>Mathematically, the image is correct, but is effectively useless since so few contributing ray paths have actually been found. Even after 5120 samples, the image is still pretty useless:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2013/Apr/indirect5120.png"><img src="https://blog.yiningkarlli.com/content/images/2013/Apr/indirect5120.png" alt="" /></a></p>
<p>Instead, a much better approach is to accumulate colors just like before, but not bother waiting until a light source is hit by the ray path through pure BRDF sampling to multiply emittance. Instead, at each ray bounce, a new indirect ray is generated via the BRDF like before, AND to generate a new direct ray towards a randomly chosen light source via multiple importance sampling and multiply the accumulated color by the resultant emittance. Multiple importance sampled direct lighting works by balancing two different sampling strategies: sampling by light source and sampling by BRDF, and then weighting the two results with some sort of heuristic (such as the power heuristic described in <a href="http://graphics.stanford.edu/papers/veach_thesis/">Eric Veach’s thesis</a>).</p>
<p>Sampling by light source is the trickier part of this technique. The idea is to generate a ray that we know will hit a light source, and then weight the contribution from that ray by the probability of generating that ray to remove the bias introduced by artificially choosing a ray direction. There’s a few good ways to do this: one way is to generate an evenly distributed random point on a light source as the target for the direct lighting ray, and then weight the result using the probability distribution function with respect to surface area, transformed into a PDF with respect to solid angle.</p>
<p>Takua Render at the moment uses a slightly different approach, for the sake of simplicity. The approach I’m using is similar to the one described in my <a href="http://blog.yiningkarlli.com/2013/04/working-towards-importance-sampled-direct-lighting.html">earlier post on the topic</a>, but with a disk instead of a sphere. The approach works like this:</p>
<ol>
<li>Figure out a bounding sphere for the light source</li>
<li>Construct a ray from the point to be lit to the center of the bounding sphere. Let’s call the direction of this ray D.</li>
<li>Find a great circle on the bounding sphere with a normal N, such that N is lined up exactly with D.</li>
<li>Move the great circle along its normal towards the point to be lit by a distance of exactly the radius of the bounding sphere</li>
<li>Treat the great circle as a disk and generate uniformly distributed random points on the disk to shoot rays towards.</li>
<li>Weight light samples by the projected solid angle of the disk on the point being lit.</li>
</ol>
<p>Alternatively, the weighting can simply be based on the normal solid angle instead of the projected solid angle, since the random points are chosen with a cosine weighted distribution.</p>
<p>The nice thing about this approach is that it allows for importance sampled direct lighting even for shapes that are difficult to sample random points on; effectively, the problem of sampling light sources is abstracted away, at the cost of a slight loss in efficiency since some percentage of rays fired at the disk have to miss the light in order for the weighting to remain unbiased.</p>
<p>I also started work on the surface area PDF to solid angle PDF method, so I might post about that later too. But for now, everything works! With importance sampled direct lighting, the scene from above is actually renderable in a reasonable amount of time. With just 16 samples per pixel, Takua Render now can generate this image:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2013/Apr/direct18.png"><img src="https://blog.yiningkarlli.com/content/images/2013/Apr/direct18.png" alt="" /></a></p>
<p>…and after 5120 samples per pixel, a perfectly clean render:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2013/Apr/direct5120.png"><img src="https://blog.yiningkarlli.com/content/images/2013/Apr/direct5120.png" alt="" /></a></p>
<p>The other cool thing about this scene is that most of the scene is actually being lit through pure indirect illumination. With only direct illumination and no GI, the render looks like this:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2013/Apr/directonly.png"><img src="https://blog.yiningkarlli.com/content/images/2013/Apr/directonly.png" alt="" /></a></p>
https://blog.yiningkarlli.com/2013/04/quick-update-on-future-plans.html
Quick Update on Future Plans
2013-04-20T02:00:00+00:00
2013-04-20T02:00:00+00:00
Yining Karl Li
<p>Just a super quick update on my future plans:</p>
<p>Next year, starting in September, I’ll be joining <a href="http://www.graphics.cornell.edu/people/director.html">Dr. Don Greenberg</a> and <a href="http://www.graphics.cornell.edu/~kiderj/index.htm">Dr. Joseph T. Kider</a> and others at <a href="http://www.graphics.cornell.edu/">Cornell’s Program for Computer Graphics</a>. I’ll be pursuing a Master of Science in Computer Graphics there, and will most likely be working on something involving rendering (which I suppose is not surprising).</p>
<p>Between the end of school and September, I’ll be spending the summer at <a href="http://www.pixar.com/">Pixar Animation Studios</a> once again, this time as part of <a href="http://graphics.pixar.com/research/people.html">Pixar’s Research Group</a>.</p>
<p>Obviously I’m quite excited by all of this!</p>
<p>Now, back to working on my renderer.</p>
https://blog.yiningkarlli.com/2013/04/working-towards-importance-sampled-direct-lighting.html
Working Towards Importance Sampled Direct Lighting
2013-04-20T01:00:00+00:00
2013-04-20T01:00:00+00:00
Yining Karl Li
<p>I haven’t made a post in a few weeks now since I’ve been working on a number of different things all of which aren’t quite done yet. Since its been a few weeks, here’s a writeup of one of the things I’m working on and where I am with that.</p>
<p>One of the major features I’ve been working towards for the past few weeks is full multiple importance sampling, which will serve a couple of purposes. First, importance sampling the direct lighting contribution in the image should allow for significantly higher convergence rates for the same amount of compute, allowing for much smoother renders for the same render time. Second, MIS will serve as groundwork for future bidirectional integration schemes, such as Metropolis transport and photon mapping. I’ve been working with my friend <a href="http://www.linkedin.com/pub/xing-du/3a/626/a23">Xing Du</a> on understanding the math behind MIS and figuring out how exactly the math should translate into implementation.</p>
<p>So first off, some ground truth tests. All ground truth tests are rendered using brute force pathtracing with hundreds of thousands of iterations per pixel. Here is the test scene I’ve been using lately, with all surfaces reduced to lambert diffuse for the sake of simplification:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2013/Apr/groundtruth.png"><img src="https://blog.yiningkarlli.com/content/images/2013/Apr/groundtruth.png" alt="Ground truth global illumination render, representing 512000 samples per pixel. All lights sampled by BRDF only." /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2013/Apr/direct_montecarlo.png"><img src="https://blog.yiningkarlli.com/content/images/2013/Apr/direct_montecarlo.png" alt="Ground truth for direct lighting contribution only, with all lights sampled by BRDF only." /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2013/Apr/indirect_only.png"><img src="https://blog.yiningkarlli.com/content/images/2013/Apr/indirect_only.png" alt="Ground truth for indirect lighting contribution only." /></a></p>
<p>The motivation behind importance sampling lights by directly sampling objects with emissive materials comes from the difficulty of finding useful samples from the BRDF only; for example, for the lambert diffuse case, since sampling from only the BRDF produces outgoing rays in totally random (or, slightly better, cosine weighted random) directions, the probability of any ray coming from a diffuse surface actually hitting a light is relatively low, meaning that the contribution of each sample is likely to be low as well. As a result, finding the direct lighting contribution via just BRDF sampling.</p>
<p>For example, here’s the direct lighting contribution only, after 64 samples per pixel with only BRDF sampling:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2013/Apr/direct_test.2.png"><img src="https://blog.yiningkarlli.com/content/images/2013/Apr/direct_test.2.png" alt="Direct lighting contribution only, all lights sampled by BRDF only, 64 samples per pixel." /></a></p>
<p>Instead of sampling direct lighting contribution by shooting a ray off in a random direction and hoping that maybe it will hit a light, a much better strategy would be to… shoot the ray towards the light source. This way, the contribution from the sample is guaranteed to be useful. There’s one hitch though: the weighting for a sample chosen using the BRDF is relatively simple to determine. For example, in the lambert diffuse case, since the probability of any particular random sample within a hemisphere is the same as any other sample, the weighting per sample is even with all other samples. Once we selectively choose the ray direction specifically towards the light though, the weighting per sample is no longer even. Instead, we must weight each sample by the probability of a ray going in that particular direction towards the light, which we can calculate by the solid angle subtended by the light source divided by the total solid angle of the hemisphere.</p>
<p>So, a trivial example case would be if a point was being lit by a large area light subtending exactly half of the hemisphere visible from the point. In this case, the area light subtends Pi steradians, making its total weight Pi/(2*Pi), or one half.</p>
<p>The tricky part of calculating the solid angle weighting is in calculating the fractional unit-spherical surface area projection for non-uniform light sources. In other words, figuring out what solid angle a sphere subtends is easy, but figuring out what solid angle a Stanford Bunny subtends is…. less easy.</p>
<p>The initial approach that Xing and I arrived at was to break complex meshes down into triangles and treat each triangle as a separate light, since calculating the solid angle subtended by a triangle is once again easy. However, since treating a mesh as a cloud of triangle area lights is potentially very expensive, for each iteration the direct lighting contribution from all lights in the scene becomes potentially untenable, meaning that each iteration of the render will have to randomly select a small number of lights to directly sample.</p>
<p>As a result, we brainstormed some ideas for potential shortcuts. One shortcut idea we came up with was that instead of choosing an evenly distributed point on the surface of the light to serve as the target for our ray, we could instead shoot a ray at the bounding sphere for the target light and weight the resulting sample by the solid angle subtended not by the light itself, but by the bounding sphere. Our thinking was that this approach would dramatically simplify the work of calculating the solid angle weighting, while still maintaining mathematical correctness and unbiasedness since the number of rays fired at the bounding sphere that will miss the light should exactly offset the overweighting produced by using the bounding sphere’s subtended solid angle.</p>
<p>I went ahead and tried out this idea, and produced the following image:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2013/Apr/direct_test.3.png"><img src="https://blog.yiningkarlli.com/content/images/2013/Apr/direct_test.3.png" alt="Direct lighting contribution only, all lights sampled by direct sampling weighted by subtended solid angle, 64 samples per pixel." /></a></p>
<p>First off, for the most part, it works! The resulting direct illumination matches the ground truth and the BRDF-sampling render, but is significantly more converged than the BRDF-sampling render for the same number of samples. BUT, there is a critical flaw: note the black circle around the light source on the ceiling. That black circle happens to fall exactly within the bounding sphere for the light source, and results from a very simple mathematical fact: calculating the solid angle subtended by the bounding sphere for a point INSIDE of the bounding sphere is undefined. In other words, this shortcut approach will fail for any points that are too close to a light source.</p>
<p>One possible workaround I tried was to have any points inside of a light’s bounding sphere to fall back to pure BRDF sampling, but this approach is also undesirable, as a highly visible discontinuity between the differently sampled area develops due to vastly different convergence rates.</p>
<p>So, while the overall solid angle weighting approach checks out, our shortcut does not. I’m now working on implementing the first approach described above, which should produce a correct result, and will post in the next few days.</p>
https://blog.yiningkarlli.com/2013/03/stratified-versus-uniform-sampling.html
Stratified versus Uniform Sampling
2013-03-06T00:00:00+00:00
2013-03-06T00:00:00+00:00
Yining Karl Li
<p>As part of Takua Render’s new pathtracing core, I’ve implemented a system allowing for multiple sampling methods instead of just uniform sampling. The first new sampling method I’ve added in addition to uniform sampling is stratified sampling. Basically, in stratified sampling, instead of spreading samples per iteration across the entire probability region, the probability region is first divided into a number of equal sized, non-overlapping subregions, and then for each iteration, a sample is drawn with uniform probability from within a single subregion, called a strata. The result of stratified sampling is that samples are guaranteed to be more evenly spread across the entire probability domain instead of clustered within a single area, resulting in less visible noise for the same number of samples compared to uniform sampling. At the same time, since stratified sampling still maintains a random distribution within each strata, the aliasing problems associated with a totally even sample distribution are still avoided.</p>
<p>Here’s a video showing a scene rendered in Takua Render with uniform and then stratified sampling. The video also shows a side-by-side comparison in its last third.</p>
<div class="embed-container"><iframe src="https://player.vimeo.com/video/61209575" frameborder="0">Takua Render Sampler Methods Comparison</iframe></div>
<p>In the renders in the above video, stratified sampling is being used to choose new ray directions from diffuse surface bounces; instead of choosing a random point over the entire cosine-weighted hemisphere at an intersection point, the renderer first chooses a strata with the same steradian as all other strata, and then chooses a random sample within that solid angle. The strata is chosen sequentially for primary bounces, and then chosen randomly for all secondary bounces to maintain unbiased sampling over the whole render. As a result of the sequential strata selection for primary bounces, images rendered in Takua Render will not converged to an unbiased solution until N iterations have elapsed, where N is the number of strata the probability region is divided into. The number of strata can be set by the user as a value in the scene description which is then squared to get the total strata number. So, if a user specifies a strata level of 16, then the probability region will be divided into 256 strata and a unbiased result will not be reached until 256 or more samples per pixel have been taken.</p>
<p>Here’s the Lamborghini model from last post at 256 samples per pixel with stratified (256 strata) and uniform sampling, to demonstrate how much less perceptible noise there is with the stratified sampler. From a distance, the uniform sampler renders may seem slightly darker side by side due to the higher percentage of noise, but if you compare them using the lightbox, you can see that the lighting and brightness is the same.</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2013/Mar/lambo_strat.png"><img src="https://blog.yiningkarlli.com/content/images/2013/Mar/lambo_strat.png" alt="Stratified sampling, 256 strata, 256 samples per pixel" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2013/Mar/uniform.png"><img src="https://blog.yiningkarlli.com/content/images/2013/Mar/uniform.png" alt="Uniform sampling, 256 samples per pixel" /></a></p>
<p>…and up-close crops with 400% zoom:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2013/Mar/lambo_strat_crop.png"><img src="https://blog.yiningkarlli.com/content/images/2013/Mar/lambo_strat_crop.png" alt="Stratified sampling, 256 strata, 256 samples per pixel, 400% crop" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2013/Mar/uniform_zoom.png"><img src="https://blog.yiningkarlli.com/content/images/2013/Mar/uniform_zoom.png" alt="Uniform sampling, 256 samples per pixel, 400% crop" /></a></p>
<p>At some point soon I will also be implementing Halton sequence sampling and [0,2]-sequence sampling, but for the time being, stratified sampling is already providing a huge visual boost over uniform! In fact, I have a small secret to confess: all of the renders in the last post were rendered with the stratified sampler!</p>
https://blog.yiningkarlli.com/2013/03/first-progress-on-new-pathtracing-core.html
First Progress on New Pathtracing Core
2013-03-04T00:00:00+00:00
2013-03-04T00:00:00+00:00
Yining Karl Li
<p>I’ve started work on a completely new pathtracing core to replace the one used in Rev 2. The purpose of totally rewriting the entire pathtracing integrator and brdf systems is to produce something much more modular and robust; as much as possible, I am now decoupling brdf and new ray direction calculation from the actual pathtracing loop.</p>
<p>I’m still in the earliest stages of this rewrite, but I have some test images! Each of the following images was rendered out to somewhere around 25000 samples per pixel (a lot!), at about 5/6 samples per pixel per second. I let the renders run without a hard ending point and terminated them after I walked away for a while and came back, hence the inexact but enormous samples per pixel counts. Each scene was lit with my standard studio-styled lighting setup and in addition to the showcased model, uses a smooth backdrop that consists of about 10000 triangles.</p>
<p>Approximately 100000 face Stanford Dragon:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2013/Mar/dragon.png"><img src="https://blog.yiningkarlli.com/content/images/2013/Mar/dragon.png" alt="" /></a></p>
<p>Approximately 150000 face Deloreon model:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2013/Mar/deloreon.png"><img src="https://blog.yiningkarlli.com/content/images/2013/Mar/deloreon.png" alt="" /></a></p>
<p>Approximately 250000 face Lamborghini Aventador model:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2013/Mar/lambo_back.png"><img src="https://blog.yiningkarlli.com/content/images/2013/Mar/lambo_back.png" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2013/Mar/lambo_front.png"><img src="https://blog.yiningkarlli.com/content/images/2013/Mar/lambo_front.png" alt="" /></a></p>
https://blog.yiningkarlli.com/2013/03/short-stack-kd-tree-traversal.html
Short-stack KD-Tree Traversal
2013-03-01T00:00:00+00:00
2013-03-01T00:00:00+00:00
Yining Karl Li
<p>In my last post, I talked about implementing history flag based kd-tree traversal. While the history flag based stackless traverse worked perfectly fine in terms of traversing the tree and finding the nearest intersection, I discovered over the past week that its performance is… less than thrilling. Unfortunately, the history flag system results in a huge amount of redundant node visits, since the entire system is state based and therefore necessarily needs to visit every node in a branch of the tree both to move down and up the branch.</p>
<p>So instead, I decided to try out a short-stack based approach. My initial concern with short-stack based approaches was the daunting memory requirements that keeping a short stack for a few hundred threads, however, I realized that realistically, a short stack never needs to be any larger than the maximum depth of the kd-tree being traversed. Since I haven’t yet had a need to test a tree with a depth beyond 100, the memory usage required for keeping short stacks is reasonably predictable and manageable; as a precaution, however, I’ve also decided to allow for the system to fall back to a stackless traverse in the case that a tree’s depth causes short stack memory usage to become unreasonable.</p>
<p>The actual short-stack traverse I’m using is a fairly standard while-while traverse based on the <a href="http://kunzhou.net/2008/kdtree.pdf">2008 Kun Zhou realtime kd-tree paper</a> and the <a href="http://graphics.stanford.edu/papers/i3dkdtree/">2007 Daniel Horn GPU kd-tree paper</a>. I’ve added one small addition though: in addition to keeping a short stack for traversing the kd-tree, I’ve also added an optional second short stack that tracks the last N intersection test objects. The reason for keeping this second short stack is that kd-trees allow for objects to be split across multiple nodes; by tracking which objects we have already encountered, we can safely detect and skip objects that have already been tested. The object tracking short stack is meant to be rather small (say, no more than 10 to 15 objects at a time), and simply loops back and overwrites the oldest values in the stack when it overflows.</p>
<p>The new while-while traversal is significantly faster than the history flag approach, to the order of a 10x or better performance increase in some cases.</p>
<p>In order to validate that the entire kd traversal system works, I did a quick and dirty port of the old Rev 2 pathtracing integrator to run on top of the new Rev 3 framework. The following test images contain about 20000 faces and objects, and clocked in at about 6 samples per pixel per second with a tree depth of 15. Each image was rendered to 1024 samples per pixel:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2013/Mar/greencow.png"><img src="https://blog.yiningkarlli.com/content/images/2013/Mar/greencow.png" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2013/Mar/glasscow.png"><img src="https://blog.yiningkarlli.com/content/images/2013/Mar/glasscow.png" alt="" /></a></p>
<p>I also attempted to render these images without any kd-tree acceleration as a control. Without kd-tree acceleration, each sample per pixel took upwards of 5 seconds, and I wound up terminating the renders before they got even close to completion.</p>
<p>The use of my old Rev 2 pathtracing core is purely temporary, however. The next task I’ll be tackling is a total rewrite of the entire pathtracing system and associated lighting and brdf evaluation systems. Previously, this systems have basically been monolithic blocks of code, but with this rewrite, I want to create a more modular, robust system that can recycle as much code as possible between GPU and CPU implementations, the GL debugger, and eventually other integration methods, such as photon mapping.</p>
https://blog.yiningkarlli.com/2013/02/stackless-kd-tree-traversal.html
Stackless KD-Tree Traversal
2013-02-22T00:00:00+00:00
2013-02-22T00:00:00+00:00
Yining Karl Li
<p>I have a working, reasonably optimized, speedy GPU stackless kd-tree traversal implementation! Over the past few days, I implemented the history flag-esque approach I outlined in <a href="http://blog.yiningkarlli.com/2012/09/thoughts-on-stackless-kd-tree-traversal.html">this post</a>, and it works quite well!</p>
<p>The following image is a heatmap of a kd-tree built for the Stanford Dragon, showing the cost of tracing a ray through each pixel in the image. Brighter values mean more node traversals and intersection tests had to be done for that particular ray. The image was rendered entirely using Takua Render’s CUDA pathtracing engine, and took roughly 100 milliseconds to complete:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2013/Feb/dragon.png"><img src="https://blog.yiningkarlli.com/content/images/2013/Feb/dragon.png" alt="" /></a></p>
<p>…and a similar heatmap, this time generated for a scene containing two mesh cows, two mesh helixes, and some cubes and spheres in a box:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2013/Feb/cow.png"><img src="https://blog.yiningkarlli.com/content/images/2013/Feb/cow.png" alt="" /></a></p>
<p>Although room for even further optimization still exists, as it always does, I am quite happy with the results so far. My new kd-tree construction system and stackless traversal system are both several orders of magnitude faster and more efficient than my older attempts.</p>
<p>Here’s a bit of a cool image: in my OpenGL debugging view, I can now follow the kd-tree traversal for a single ray at a time and visualize the exact path and nodes encountered. This tool has been extremely useful for optimizing… without a visual debugging tool, no wonder my previous implementations had so many problems! The scene here is the same cow/helix scene, but rotated 90 degrees. The bluish green line coming in from the left is the ray, and the green boxes outline the nodes of the kd-tree that traversal had to check to get the correct intersection.</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2013/Feb/kdboxes_notree.png"><img src="https://blog.yiningkarlli.com/content/images/2013/Feb/kdboxes_notree.png" alt="" /></a></p>
<p>…and here’s the same image as above, but with all nodes that were skipped drawn in red. As you can see, the system is now efficient enough to cull the vast vast majority of the scene for each ray:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2013/Feb/kdboxes_yestree.png"><img src="https://blog.yiningkarlli.com/content/images/2013/Feb/kdboxes_yestree.png" alt="" /></a></p>
<p>The size of the nodes relative to the density of the geometry in their vicinity also speaks towards the efficiency of the new kd-tree construction system: empty spaces are quickly skipped through with enormous bounding boxes, whereas high density areas have much smaller bounding boxes to allow for efficient culling.</p>
<p>Over the next day or so, I fully expect I’ll be able to reintegrate the actual pathtracing core, and have some nice images! Since the part of Takua that needed rewriting the most was the underlying scene and kd-tree system, I will be able to reuse a lot of the BRDF/emittance/etc. stuff from Takua Rev 2.</p>
https://blog.yiningkarlli.com/2013/02/revision-3-kd-treeobjcore.html
Revision 3 KD-Tree/ObjCore
2013-02-15T00:00:00+00:00
2013-02-15T00:00:00+00:00
Yining Karl Li
<p>The one piece of Takua Render that I’ve been proudest of so far has been the KD-Tree and obj mesh processing systems that I built. So of course, over the past week I completely threw away the old versions of KdCore and ObjCore and totally rewrote new versions entirely from scratch. The motive behind this rewrite came mostly from the fact that over the past year, I’ve learned a lot more about KD-Trees and programming in general; as a result, I’m pleased to report that the new versions of KdCore and ObjCore are significantly faster and more memory efficient than previous versions. KdCore3 is now able to process a million objects into an efficient, optimized KD-Tree with a depth of 20 and a minimum of 5 objects per leaf node in roughly one second.</p>
<p>Here’s my kitchen scene, exported to Takua Render’s AvohkiiScene format, and processed through KdCore3. White lines are the wireframe lines for the geometry itself, red lines represent KD-Tree node boundaries:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2013/Feb/kitchen_kd_wireframe.png"><img src="https://blog.yiningkarlli.com/content/images/2013/Feb/kitchen_kd_wireframe.png" alt="" /></a></p>
<p>…and the same image as above, but with only the KD-Tree. You can use the lightbox to switch between the two images for comparisons:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2013/Feb/kitchen_kd.png"><img src="https://blog.yiningkarlli.com/content/images/2013/Feb/kitchen_kd.png" alt="" /></a></p>
<p>One of the most noticeable improvements in KdCore3 over KdCore2, aside from the speed increases, is in how KdCore3 manages empty space. In the older versions of KdCore, empty space was often repeatedly split into multiple nodes, meaning that ray traversal through empty space was very inefficient, since repeated intersection tests would be required only for a ray to pass through the KD-Tree without actually hitting anything. The images in <a href="http://blog.yiningkarlli.com/2012/06/more-kd-tree-fun.html">this old post</a> demonstrate what I mean. The main source of this problem came from how splits were being chosen in KdCore2; in KdCore2, the chosen split was the lowest cost split regardless of axis. As a result, splits were often chosen that resulted in long, narrow nodes going through empty space. In KdCore3, the best split is chosen as the lowest cost split on the longest axis of the node. As a result, empty space is culled much more efficiently.</p>
<p>Another major change to KdCore3 is that the KD-Tree is no longer built recursively. Instead, KdCore3 builds the KD-Tree layer by layer through an iterative approach that is well suited for adaptation to the GPU. Instead of attempting to guess how deep to build the KD-Tree, KdCore3 now just takes a maximum depth from the user and builds the tree no deeper than the given depth. The entire tree is also no longer stored as a series of nodes with pointers to each other, but instead all nodes are stored in a flat array with a clever indexing scheme to allow nodes to implicitly know where their parent and child nodes are within the array. Furthermore, instead of building as a series of nodes with pointers, the tree builds directly into the array format. This array storage format again makes KdCore3 more suitable to a GPU adaptation, and also makes serializing the Kd-Tree out to disk significantly easier for memory caching purposes.</p>
<p>Another major change is how split candidates are chosen; in KdCore2, the candidates along each axis were the median of all contained object center-points, the middle of the axis, and some randomly chosen candidates. In KdCore3, the user can specify a number of split candidates to try along each axis, and then KdCore3 will simply divide each axis into that number of equally spaced points and use those points as candidates. As a result, KdCore3 is far more efficient than KdCore2 at calculating split candidates, can often find a better candidate with more deterministic results due to the removal of random choices, and offers the user more control over the quality of the final split.</p>
<p>The following series of images demonstrate KD-Trees built by KdCore3 for the Stanford Dragon with various settings. Again, feel free to use the lightbox for comparisons.</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2013/Feb/dragonkd_level02.png"><img src="https://blog.yiningkarlli.com/content/images/2013/Feb/dragonkd_level02.png" alt="Max depth 2, min objects per node 20, min volume .0001% of whole tree" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2013/Feb/dragonkd_level05.png"><img src="https://blog.yiningkarlli.com/content/images/2013/Feb/dragonkd_level05.png" alt="Max depth 5, min objects per node 20, min volume .0001% of whole tree" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2013/Feb/dragonkd_level10.png"><img src="https://blog.yiningkarlli.com/content/images/2013/Feb/dragonkd_level10.png" alt="Max depth 10, min objects per node 20, min volume .0001% of whole tree" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2013/Feb/dragonkd_level15.png"><img src="https://blog.yiningkarlli.com/content/images/2013/Feb/dragonkd_level15.png" alt="Max depth 15, min objects per node 20, min volume .0001% of whole tree" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2013/Feb/dragonkd_level20.png"><img src="https://blog.yiningkarlli.com/content/images/2013/Feb/dragonkd_level20.png" alt="Max depth 20, min objects per node 20, min volume .0001% of whole tree" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2013/Feb/dragonkd_level20_2.png"><img src="https://blog.yiningkarlli.com/content/images/2013/Feb/dragonkd_level20_2.png" alt="Max depth 20, min objects per node 5, min volume .0001% of whole tree" /></a></p>
<p>KdCore3 is also capable of figuring out when the number of nodes in the tree makes traversing the tree more expensive than brute force intersection testing all of the objects in the tree, and will stop tree construction beyond that point. I’ve also given KdCore3 an experiment method for finding best splits based on a semi-Monte-Carlo approach. In this mode, instead of using evenly split candidates, KdCore3 will make three random guesses, and then based on the relative costs of the guesses, begin making additional guesses with a probability distribution weighted towards where ever the lower relative cost is. With this approach, KdCore3 will eventually arrive at the absolute optimal cost split, although getting to this point may take some time. The number of guesses KdCore3 will attempt can be limited by the user, of course.</p>
<p>Finally, another one of the major improvements I made in KdCore3 was simply better use of C++. Over the past two years, my knowledge of how to write fast, effective C++ has evolved immensely, and I now write code very differently than how I did when I wrote KdCore2 and KdCore1. For example, KdCore3 avoids relying on class inheritance and virtual method table lookup (KdCore2 relied on inheritance quite heavily). Normally, virtual method lookup doesn’t add a noticeable amount of execution time to a single virtual method, but when repeated for a few million objects, the slowdown becomes extremely apparent. In talking with my friend Robert Mead, I realized that virtual method table lookup in almost, if not all implementations today necessarily means a minimum of three pointer lookups in memory to find a function, whereas a direct function call is a single pointer lookup.</p>
<p>If I have time to later, I’ll post some benchmarks of KdCore3 versus KdCore2. However, for now, here’s a final pair of images showcasing a scene with highly variable density processed through KdCore3. Note the keavy amount of nodes clustered where large amounts of geometry exist, and the near total emptyness of the KD-Tree in areas where the scene is sparse:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2013/Feb/scene_kd_wireframe.png"><img src="https://blog.yiningkarlli.com/content/images/2013/Feb/scene_kd_wireframe.png" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2013/Feb/scene_kd.png"><img src="https://blog.yiningkarlli.com/content/images/2013/Feb/scene_kd.png" alt="" /></a></p>
<p>Next up: implementing some form of highly efficient stackless KD-Tree traversal, possibly even using that <a href="http://blog.yiningkarlli.com/2012/09/thoughts-on-stackless-kd-tree-traversal.html">history based approach I wrote about before</a>!</p>
https://blog.yiningkarlli.com/2013/02/bounding-boxes-for-ellipsoids.html
Bounding Boxes for Ellipsoids
2013-02-08T00:00:00+00:00
2013-02-08T00:00:00+00:00
Yining Karl Li
<p>Update (2014): Tavian Barnes has written a <a href="https://tavianator.com/2014/ellipsoid_bounding_boxes.html">far better / more detailed post</a> on this topic; instead of reading my post, I suggest you go read Tavian’s post instead.</p>
<p>Warning: this post is going to be pretty math-heavy.</p>
<p>Let’s talk about spheres, or more generally, ellipsoids. Specifically, let’s talk about calculating axis aligned bounding boxes for arbitrarily transformed ellipsoids, which is a bit of an interesting problem I recently stumbled upon while working on Takua Rev 3. I’m making this post because finding a solution took a lot of searching and I didn’t find any single collected source of information on this problem, so I figured I’d post it for both my own reference and for anyone else who may find this useful.</p>
<p>So what’s so hard about calculating tight axis aligned bounding boxes for arbitrary ellipsoids?</p>
<p>Well, consider a basic, boring sphere. The easiest way to calculate a tight axis aligned bounding box (or AABB) for a mesh is to simply min/max all of the vertices in the mesh to get two points representing the min and max points of the AABB. Similarly, getting a tight AABB for a box is easy: just use the eight vertices of the box for the min/max process. A naive approach to getting a tight AABB for a sphere seems simple then: along the three axes of the sphere, have one point on each end of the axis on the surface of the sphere, and then min/max. Figure 1. shows a 2D example of this naive approach, to extend the example to 3D, simply add two more points for the Z axis (I drew the illustrations quickly in Photoshop, so apologies for only showing 2D examples):</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2013/Feb/figure1.png"><img src="https://blog.yiningkarlli.com/content/images/2013/Feb/figure1.png" alt="Figure 1." /></a></p>
<p>This naive approach, however, quickly fails if we rotate the sphere such that its axes are no longer lined up nicely with the world axes. In Figure 2, our sphere is rotated, resulting in a way too small AABB if we min/max points on the sphere axes:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2013/Feb/figure2.png"><img src="https://blog.yiningkarlli.com/content/images/2013/Feb/figure2.png" alt="Figure 2." /></a></p>
<p>If we scale the sphere such that it becomes an ellipsoid, the same problem persists, as the sphere is just a subtype of ellipsoid. In Figures 3 and 4, the same problem found in Figures 1/2 is illustrated with an ellipsoid:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2013/Feb/figure3.png"><img src="https://blog.yiningkarlli.com/content/images/2013/Feb/figure3.png" alt="Figure 3." /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2013/Feb/figure4.png"><img src="https://blog.yiningkarlli.com/content/images/2013/Feb/figure4.png" alt="Figure 4." /></a></p>
<p>One possible solution is to continue using the naive min/max axes approach, but simply expand the resultant AABB by some percentage such that it encompasses the whole sphere. However, we have no way of knowing what percentage will give an exact bound, so the only feasible way to use this fix is by making the AABB always larger than a tight fit would require. As a result, this solution is almost as undesirable as the naive solution, since the whole point of this exercise is to create as tight of an AABB as possible for as efficient intersection culling as possible!</p>
<p>Instead of min/maxing the axes, we need to use some more advanced math to get a tight AABB for ellipsoids.</p>
<p>We begin by noting our transformation matrix, which we’ll call M. We’ll also need the transpose of M, which we’ll call MT. Next, we define a sphere S using a 4x4 matrix:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[ r 0 0 0 ]
[ 0 r 0 0 ]
[ 0 0 r 0 ]
[ 0 0 0 -1]
</code></pre></div></div>
<p>where r is the radius of the sphere. So for a unit diameter sphere, r = .5. Once we have built S, we’ll take its inverse, which we’ll call SI.</p>
<p>We now calculate a new 4x4 matrix R = M*SI*MT. R should be symmetric when we’re done, such that R = transpose(R). We’ll assign R’s indices the following names:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>R = [ r11 r12 r13 r14 ]
[ r12 r22 r23 r24 ]
[ r13 r23 r23 r24 ]
[ r14 r24 r24 r24 ]
</code></pre></div></div>
<p>Using R, we can now get our bounds:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>zmax = (r23 + sqrt(pow(r23,2) - (r33*r22)) ) / r33;
zmin = (r23 - sqrt(pow(r23,2) - (r33*r22)) ) / r33;
ymax = (r13 + sqrt(pow(r13,2) - (r33*r11)) ) / r33;
ymin = (r13 - sqrt(pow(r13,2) - (r33*r11)) ) / r33;
xmax = (r03 + sqrt(pow(r03,2) - (r33*r00)) ) / r33;
xmin = (r03 - sqrt(pow(r03,2) - (r33*r00)) ) / r33;
</code></pre></div></div>
<p>…and we’re done!</p>
<p>Just to prove that it works, a screenshot of a transformed ellipse inside of a tight AABB in 3D from Takua Rev 3’s GL Debug view:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2013/Feb/takua_ellipse.png"><img src="https://blog.yiningkarlli.com/content/images/2013/Feb/takua_ellipse.png" alt="" /></a></p>
<p>I’ve totally glossed over the mathematical rationale behind this method in this post and focused just on how to quickly get a working implementation, but if you want to read more about the actual math behind how it works, these are the two sources I pulled this from:</p>
<p><a href="http://stackoverflow.com/a/4369956">Stack Overflow post by user fd</a></p>
<p><a href="http://www.iquilezles.org/www/articles/ellipses/ellipses.htm">Article by Inigo Quilez</a></p>
<p>In other news, Takua Rev 3’s new scene system is now complete and I am working on a brand new, better, faster, stackless KD-tree implementation. More on that later!</p>
https://blog.yiningkarlli.com/2013/01/revision-3-old-renders.html
Revision 3, Old Renders
2013-01-18T00:00:00+00:00
2013-01-18T00:00:00+00:00
Yining Karl Li
<p>At the beginning of the semester, I decided to re-architect Takua again, hence the lack of updates for a couple of weeks now. I’ll talk more in-depth about the details of how this new architecture works in a later post, so for now I’ll just quickly describe the motivation behind this second round of re-architecting. As I <a href="http://blog.yiningkarlli.com/2012/09/takuaavohkii-render.html">wrote about before</a>, I’ve been keeping parallel CPU and GPU branches of my renderer so far, but the two branches have increasingly diverged. On top of that, the GPU branch of my renderer, although significantly better organized than the experimental CUDA renderer from spring 2012, still is rather suboptimal; after TAing <a href="http://cis565-fall-2012.github.com/">CIS565</a> for a semester, I’ve developed what I think are some better ways of architecting CUDA code. Over winter break, I began to wonder if merging the CPU and GPU branches might be possible, and if such a task could be done, how I might go about doing it.</p>
<p>This newest re-structuring of Takua accomplishes that goal. I’m calling this new version of Takua “Revision 3”, as it is the third major rewrite.</p>
<p>My new architecture centers around a couple of observations. First, we can observe that the lowest common denominator (so to speak) for structured data in CUDA and C++ is… a struct. Similarly, the easiest way to recycle code between CUDA and C++ is to implement code as inlineable, C style functions that can either be embedded in a CUDA kernel at compile time, or wrapped within a C++ class for use in C++. Therefore, one possible way to unify CPU C++ and GPU CUDA codebases could be to implement core components of the renderer using structs and C-style, inlineable functions, allowing easy integration into CUDA kernels, and then write thin wrapper classes around said structs and functions to allow for nice, object oriented C++ code. This exact system is how I am building Takua Revision 3; the end result should be a unified codebase that can compile to both CPU and GPU versions, and allow for both versions to develop in near lockstep.</p>
<p>Again, I’ll go into a more detailed explanation once this process is complete.</p>
<p>I’ll leave this post with a slightly orthogonal note; whilst in the process of merging code, I found some images from Takua Revision 1 that I never posted for some reason. Here’s a particularly cool pair of images from when I was implementing depth of field. The first image depicts a glass Stanford dragon without any depth of field, and the second image depicts the same exact scene with some crazy shallow aperture (I don’t remember the exact settings). You can tell these are from the days of Takua Revision 1 by the ceiling; I often made the entire ceiling a light source to speed up renders back then, until Revision 2’s huge performance increases rendered cheats like that unnecessary.</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2013/Jan/glassdragon.png"><img src="https://blog.yiningkarlli.com/content/images/2013/Jan/glassdragon.png" alt="Glass Stanford dragon without depth of field" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2013/Jan/glassdragon_dof.png"><img src="https://blog.yiningkarlli.com/content/images/2013/Jan/glassdragon_dof.png" alt="Glass Stanford dragon with depth of field" /></a></p>
https://blog.yiningkarlli.com/2012/12/texture-mapping.html
Texture Mapping
2012-12-18T00:00:00+00:00
2012-12-18T00:00:00+00:00
Yining Karl Li
<p>A few weeks back I started work on another piece of super low-hanging fruit: texture mapping! Before I delve into the details, here’s a test render showing three texture mapped spheres with varying degrees of glossiness in a glossy-walled Cornell box. I was also playing with logos for Takua render and put a test logo idea on the back wall for fun:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2012/Dec/texture3.png"><img src="https://blog.yiningkarlli.com/content/images/2012/Dec/texture3.png" alt="" /></a></p>
<p>…and the same scene with the camera tilted down just to show off the glossy floor (because I really like the blurry glossy reflections):</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2012/Dec/texture2.png"><img src="https://blog.yiningkarlli.com/content/images/2012/Dec/texture2.png" alt="" /></a></p>
<p>My texturing system can, of course, support textures of arbitrary resolution. The black and white grid and colored UV tile textures in the above render are square 1024x1024, while the Earth texture is rectangular 1024x512. Huge textures are handled just fine, as demonstrated by the following render using a giant 2048x2048, color tweaked version of <a href="http://simoncpage.co.uk/blog/2012/03/ipad-hd-retina-wallpaper/">Simon Page’s Space Janus wallpaper</a>:</p>
<p><img src="https://blog.yiningkarlli.com/content/images/2012/Dec/texture4.png" alt="" /></p>
<p>Of course UV transformations are supported. Here’s the same texture with a 35 degree UV rotation applied and tiling switched on:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2012/Dec/sampleScene_uvrotated.png"><img src="https://blog.yiningkarlli.com/content/images/2012/Dec/sampleScene_uvrotated.png" alt="" /></a></p>
<p>Since memory is always at a premium, especially on the GPU, I’ve implemented textures in a fashion inspired by geometry instancing and node based material systems, such as the system for Maya. Inside of my renderer, I represent texture files as a file node containing the raw image data, streamed from disk via <a href="http://nothings.org/stb_image.c">stb_image</a>. I then apply transformations, UV operations, etc through a texture properties node, which maintains a pointer to the relevant texture file node, and then materials point to whatever texture properties nodes they need. This way, texture data can be read and stored once in memory and recycled as many times as needed, meaning that a well formatted scene file can altogether eliminate the need for redundant texture read/storage in memory. This system allows me to create amusing scenes like the following one, where a single striped texture is reused in a number of materials with varied properties:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2012/Dec/stripes.png"><img src="https://blog.yiningkarlli.com/content/images/2012/Dec/stripes.png" alt="" /></a></p>
<p>Admittedly I made that stripe texture really quickly in Photoshop without too much care for straightness of lines, so it doesn’t actually tile very well. Hence why the sphere in the lower front shows a discontinuity in its texture… that’s not glitchy UVing, just a crappy texture!</p>
<p>I’ve also gone ahead and extended my materials system to allow any material property to be driven with a texture. In fact, the stripe room render above is using the same stripe texture to drive reflectiveness on the side walls, resulting in reflective surfaces where the texture is black and diffuse surfaces where the texture is white. Here’s another example of texture driven material properties showing emission being driven using the same color-adjusted Space Janus texture from before:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2012/Dec/texture_light_big.png"><img src="https://blog.yiningkarlli.com/content/images/2012/Dec/texture_light_big.png" alt="" /></a></p>
<p>Even refractive and reflective index of refraction can be driven with textures, which can yield some weird/interesting results. Here are a pair of renders showing a refractive red cube with uniform IOR, and with IOR driven with a Perlin noise map:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2012/Dec/stripe_glass.1.png"><img src="https://blog.yiningkarlli.com/content/images/2012/Dec/stripe_glass.1.png" alt="Uniform refractive IOR" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2012/Dec/stripe_glass_uv.0.png"><img src="https://blog.yiningkarlli.com/content/images/2012/Dec/stripe_glass_uv.0.png" alt="Refractive IOR driven with a Perlin noise texture map" /></a></p>
<p>The nice thing about a node-style material representation is that I should be able to easily plug in procedural functions in place of textures whenever I get around to implementing some (that way I can use procedural Perlin noise instead of using a noise texture).</p>
<p>Here’s an admittedly kind of ugly render using the color UV grid texture to drive refractive color:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2012/Dec/stripe_glass_color.png"><img src="https://blog.yiningkarlli.com/content/images/2012/Dec/stripe_glass_color.png" alt="" /></a></p>
<p>For some properties, I’ve had to add a requirement to specify a range of valid values by the user when using a texture map, since RGB values don’t map well to said properties. An example would be glossiness, where a gloss value range of 0% to 100% leaves little room for detailed adjustment. Of course this issue can be fixed by adding support for floating point image formats such as OpenEXR, which is coming very soon! In the following render, the back wall’s glossiness is being driven using the stripe texture (texture driven IOR is also still in effect on the red refractive cube):</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2012/Dec/stripe_gloss.0.png"><img src="https://blog.yiningkarlli.com/content/images/2012/Dec/stripe_gloss.0.png" alt="" /></a></p>
<p>Of course, even with nice instancing schemes, textures potentially can take up a gargantuan amount of memory, which poses a huge problem in the GPU world where onboard memory is at a premium. I still need to think more about how I’m going to deal with memory footprints larger than on-device memory, but at the moment my plan is to let the renderer allocate and overflow into pinned host memory whenever it detects that the needed footprint is within some margin of total available device memory. This concern is also a major reason why I’ve decided to stick with CUDA for now… until OpenCL gets support for a unified address space for pinned memory, I’m not wholly sure how I’m supposed to deal with memory overflow issues in OpenCL. I haven’t reexamine OpenCL in a little while now though, so perhaps it is <a href="http://blog.vsampath.com/2012/05/ed-opencl-vs-cuda-mid-2012-edition.html">time to take another look</a>.</p>
<p>Unfortunately, something I discovered while in the process of extending my material system to support texture driven properties is that my renderer could probably use a bit of refactoring for the sake of organization and readability. Since I now have some time over winter break and am planning on making my Github repo for Takua-RT public soon, I’ll probably undertake a bit of code refactoring over the next few weeks.</p>
https://blog.yiningkarlli.com/2012/12/blurred-glossy-reflections.html
Blurred Glossy Reflections
2012-12-07T00:00:00+00:00
2012-12-07T00:00:00+00:00
Yining Karl Li
<p><a href="https://blog.yiningkarlli.com/content/images/2012/Dec/glossy_glossy_test.png"><img src="https://blog.yiningkarlli.com/content/images/2012/Dec/glossy_glossy_test.png" alt="" /></a></p>
<p>Over the past few months I haven’t been making as much progress on my renderer as I would have liked, mainly because another major project has been occupying most of my attention: TAing/restructuring the <a href="http://cis565-fall-2012.github.com/">GPU Programming</a> course here at Penn. I’ll probably write a post at the end of the semester with detailed thoughts and comments about that later. More on that later!</p>
<p>I recently had a bit of extra time, which I used to tackle a piece of super low hanging fruit: blurred glossy reflections. The simplest brute force approach blurred glossy reflections is to take the reflected ray direction from specular reflection, construct a lobe around that ray, and sample across the lobe instead of only along the reflected direction. The wider the lobe, the blurrier the glossy reflection gets. The following diagram, borrowed from Wikipedia, illustrates this property:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2012/Dec/glossylobe.png"><img src="https://blog.yiningkarlli.com/content/images/2012/Dec/glossylobe.png" alt="" /></a></p>
<p>In a normal raytracer or rasterization based renderer, blurred glossy reflections require something of a compromise between speed and visual quality (much like many other effects!), since using a large number of samples within the glossy specular lobe to achieve a high quality reflection can be prohibitively expensive. This cost-quality tradeoff is therefore similar to the tradeoffs that must be made in any distributed raytracing effect. However, in a pathtracer, we’re already using a massive number of samples, so we can fold the blurred glossy reflection work into our existing high sample count. In a GPU renderer, we have massive amounts of compute as well, making blurred glossy reflections far more tractable than in a traditional system.</p>
<p>The image at the top of this post shows three spheres of varying gloss amounts in a modified Cornell box with a glossy floor and reflective colored walls, rendered entirely inside of Takua-RT. Glossy to glossy light transport is an especially inefficient scenario to resolve in pathtracing, but throwing brute force GPU compute at it allows for arriving at a good solution reasonably quickly: the above image took around a minute to render at 800x800 resolution. Here is another test of blurred glossy reflections, this time in a standard diffuse Cornell box:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2012/Dec/glossytest_1.png"><img src="https://blog.yiningkarlli.com/content/images/2012/Dec/glossytest_1.png" alt="" /></a></p>
<p>…and some tests showing varying degrees of gloss, within a modified Cornell box with glossy left and right walls. Needless to say, all of these images were also rendered entirely inside of Takua-RT.</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2012/Dec/glossytest_4.png"><img src="https://blog.yiningkarlli.com/content/images/2012/Dec/glossytest_4.png" alt="Full specular reflection" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2012/Dec/glossytest_3.png"><img src="https://blog.yiningkarlli.com/content/images/2012/Dec/glossytest_3.png" alt="Approximately 10% blurred glossy reflection" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2012/Dec/glossytest_2.png"><img src="https://blog.yiningkarlli.com/content/images/2012/Dec/glossytest_2.png" alt="Approximately 30% blurred glossy reflection" /></a></p>
<p>Finally, here’s another version of the first image in this post, but with the camera in the wrong place. You can see a bit of the stand-in sky I have right now. I’m working on a sun & sky system right now, but since its not ready yet, I have a simple gradient serving as a stand-in right now. I’ll post more about sun & sky when I’m closer to finishing with it… I’m not doing anything <a href="http://skyrenderer.blogspot.com/">fancy like Peter Kutz is doing</a> (his sky renderer blog is definitely worth checking out, by the way), just standard Preetham et. al. style.</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2012/Dec/glossy_glossy_sky.png"><img src="https://blog.yiningkarlli.com/content/images/2012/Dec/glossy_glossy_sky.png" alt="" /></a></p>
https://blog.yiningkarlli.com/2012/09/thoughts-on-stackless-kd-tree-traversal.html
Thoughts on Stackless KD-Tree Traversal
2012-09-15T00:00:00+00:00
2012-09-15T00:00:00+00:00
Yining Karl Li
<p>Edit: <a href="http://yiningkarlli.blogspot.com/2012/09/thoughts-on-stackless-kd-tree-traversal.html?showComment=1353951085399#c9086262641390319736">Erwin Coumans</a> in the comments section has pointed me to a <a href="http://twvideo01.ubm-us.net/o1/vault/gdc09/slides/takahiroGDC09s_1.pdf">GDC 2009 talk by Takahiro Harada</a> proposing something called Tree Traversal using History Flags, which is essentially the same as the idea proposed in this post, with the small exception that Harada’s technique uses a bit field to track previously visited nodes on the up traverse. I think that Harada’s technique is actually better than the pointer check I wrote about in this post, since keeping a bit field would allow for tracking the previously visited node without having to go back to global memory to do a node check. In other words, the bit field method allows for less thrashing of global memory, which I should think allows for a nice performance edge. So, much as I suspected, the idea in this post in one that folks smarter than me have arrived upon previously, and my knowledge of the literature on this topic is indeed incomplete. Much thanks to Erwin for pointing me to the Harada talk! The original post is preserved below, in case anyone still has an interest in reading it.</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2012/Sep/orbital_kd_05.png"><img src="https://blog.yiningkarlli.com/content/images/2012/Sep/orbital_kd_05.png" alt="" /></a></p>
<p>Of course, one of the biggest challenges to implementing a CUDA pathtracer is the lack of recursion on pre-Fermi GPUs. Since I intend for Takua-RT to be able to run on any CUDA enabled CPU, I necessarily have to work with the assumption that I won’t have recursion support. Getting around this problem in the core pathtracer is not actually a significant issue, as building raytracing systems that operate in an iterative fashion as opposed to in a recursive fashion is a well-covered topic.</p>
<p>Traversing a kd-tree without recursion, however, is a more challenging proposition. So far as I can tell from a very cursory glance at existing literature on the topic, there are presently two major approaches: fully stack-less methods that require some amount of pre-processing of the kd-tree, such as the <a href="http://graphics.cs.uni-sb.de/fileadmin/cguds/papers/2007/popov_07_GPURT/Popov_et_al._-_Stackless_KD-Tree_Traversal_for_High_Performance_GPU_Ray_Tracing.pdf">rope-based method presented in Popov et. al. [2007]</a>, and methods utilizing a short stack or something similar, such as the <a href="http://www.kunzhou.net/2008/kdtree.pdf">method presented in Zhou et. al. [2008]</a>. I’m in the process of reading both of these papers more carefully, and will probably explore at least one of these approaches soon. In the meantime, however, I thought it might be a fun exercise to try to come up with some solution of my own, which I’ll summarize in this post. I have to admit that I have no idea if this is actually a novel approach, or if its something that somebody also came up with and rejected a long time ago and I just haven’t found yet. My coverage of the literature in this area is highly incomplete, so if you, the reader, are aware of a pre-existing version of this idea, please let me know so that I can attribute it properly!</p>
<p>The basic idea I’m starting with is that when traversing a KD-tree (or any other type of tree, for that matter), at a given node, there’s only a finite number of directions one can go in, and a finite number of previous nodes one could have arrived at the current node from. In other words, one could conceivably define a finite-state machine type system for traversing a KD-tree, given an input ray. I say finite-state machine type, because what I shall define here isn’t actually strictly a FSM, as this method requires storing information about the previous state in addition to the current state. So here we go:</p>
<p>We begin by tracking two pieces of information: what the current node we are at is, and what direction we had to take from the previous node to get to the current node. There are three possible directions we could have come from:</p>
<ol>
<li>Down from the current node’s parent node</li>
<li>Up from the current node’s left child</li>
<li>Up from the current node’s right child</li>
</ol>
<p>Similarly, there are only three directions we can possibly travel in from the current node:</p>
<ol>
<li>Up to the current node’s parent node</li>
<li>Down to the current node’s left child</li>
<li>Down to the current node’s right child</li>
</ol>
<p>When we travel up from the current node to its parent, we can easily figure out if we are traveling up from the right or the left by looking at whether the current node is the parent node’s left or right child.</p>
<p>Now we need a few rules on which direction to travel in given the direction we came from and some information on where our ray currently is in space:</p>
<ol>
<li>If we came down from the parent node and if the current node is not a leaf node, intersection test our ray with both children of the current node. If the ray only intersects one of the children, traverse down to that child. If the ray intersects both of the children, traverse down to the left child.</li>
<li>If we came down from the parent node and if the current node is a leaf node, carry out intersection tests between the ray and the contents of the node and store the nearest intersection.</li>
<li>If we came up from the left child, intersection test our ray with the right child of the current node. If we have an intersection, traverse down the right child. If we don’t have an intersection, traverse upwards to the parent.</li>
<li>If we came up from the right child, traverse upwards to the parent.</li>
</ol>
<p>That’s it. With those four rules, we can traverse an entire KD-Tree in a DFS fashion, while skipping branches that our ray does not intersect for a more efficient traverse, and avoiding any use of recursion or the use of a stack in memory.</p>
<p>There is, of course, the edge case that our ray is coming in to the tree from the “back”, so that the right child of each node is “in front” of the left child instead of “behind”, but we can easily deal with this case by simply testing which side of the KD-tree we’re entering from and swapping left and right in our ruleset accordingly.</p>
<p>I haven’t actually gotten around to implementing this idea yet (as of September 15th, when I started writing this post, although this post may get published much later), so I’m not sure what the performance looks like. There are some inefficiencies in how many nodes our traverse will end up visiting, but on the flipside, we won’t need to keep much of anything in memory except for two pieces of state information and the actual KD-tree itself. On the GPU, I might run into implementation level problems that could impact performance, such as too many branching statements or memory thrashing if the KD-tree is kept in global memory and a gazillion threads try to traverse it at once, so these issues will need to be addressed later.</p>
<p>Again, if you, the reader, knows of this idea from a pre-existing place, please let me know! Also, if you see a gaping hole in my logic, please let me know too!</p>
<p>Since this has been a very text heavy post, I’ll close with some pictures of a KD-tree representing the scene from the <a href="http://blog.yiningkarlli.com/2012/09/takuaavohkii-render.html">Takua-RT post</a>. They don’t really have much to do with the traverse method presented in this post, but they are KD-tree related!</p>
<div class="embed-container"><iframe src="https://player.vimeo.com/video/53735317" frameborder="0">"Orbital" KD-Tree</iframe></div>
<p>Vimeo’s compression really does not like thin narrow lines, so here are some stills:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2012/Sep/orbital_kd_02.png"><img src="https://blog.yiningkarlli.com/content/images/2012/Sep/orbital_kd_02.png" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2012/Sep/orbital_kd_03.png"><img src="https://blog.yiningkarlli.com/content/images/2012/Sep/orbital_kd_03.png" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2012/Sep/orbital_kd_04.png"><img src="https://blog.yiningkarlli.com/content/images/2012/Sep/orbital_kd_04.png" alt="" /></a></p>
https://blog.yiningkarlli.com/2012/09/takuaavohkii-render.html
TAKUA/Avohkii Render
2012-09-10T00:00:00+00:00
2012-09-10T00:00:00+00:00
Yining Karl Li
<p><div class="embed-container"><iframe src="https://player.vimeo.com/video/53735318" frameborder="0">Takua-RT "Orbital" Demo</iframe></div></p>
<p>One question I’ve been asking myself ever since my friend <a href="http://peterkutz.com/">Peter Kutz</a> and I wrapped our little <a href="http://gpupathtracer.blogspot.com/">GPU Pathtracer experiment</a> is “why am I writing Takua Render as a CPU-only renderer?” One of the biggest lessons learned from the GPU Pathtracer experiment was that GPUs can indeed provide vast quantities of compute suitable for use in pathtracing rendering. After thinking for a bit at the beginning of the summer, I’ve decided that since I’m starting my renderer from scratch and don’t have to worry about the tons of legacy that real-world big boy renderers like RenderMan have to deal with, there is no reason why I shouldn’t architect my renderer to use whatever compute power is available.</p>
<p>With that being said, from this point forward, I will be concurrently developing CPU and GPU based implementations of Takua Render. I call this new overall project TAKUA/Avohkii, mainly because Avohkii is a cool name. Within this project, I will continue developing the C++ based x86 version of Takua, which will retain the name of just Takua, and I will also work on a CUDA based GPU version, called Takua-RT, with full feature parity. I’m also planning on investigating the idea of an ARM port, but that’s an idea for later. I’m going to stick with CUDA for the GPU version now since I know CUDA better than OpenCL and since almost all of the hardware I have access to develop and test on right now is NVIDIA based (the SIG Lab runs on NVIDIA cards…), but that could change down the line. The eventual goal is to have a set of renderers that together cover as many hardware bases as possible, and can all interoperate and intercommunicate for farming purposes.</p>
<p>I’ve already gone ahead and finished the initial work of porting Takua Render to CUDA. One major lesson learned from the GPU Pathtracer experiment was that enormous CUDA kernels tend to run into a number of problems, much like massive monolithic GL shaders do. One problem in particular is that enormous kernels tend to take a long time to run and can result in the GPU driver terminating the kernel, since NVIDIA’s drivers by default assume that device threads taking longer than 2 seconds to run are hanging and cull said threads. In the GPU Pathtracer experiment, we used a giant monolithic kernel for a single ray bounce, which ran into problems as geometry count went up and subsequently intersection testing and therefore kernel execution time also increased. For Takua-RT, I’ve decided to split a single ray bounce into a sequence of micro-kernels that launch in succession. Basically, each operation is now a kernel; each intersection test is a kernel, BRDF evaluation is a kernel, etc. While I suppose I lose a bit of time in having numerous kernel launches, I am getting around the kernel time-out problem.</p>
<p>Another important lesson learned was that culling useless kernel launches is extremely important. I’m checking for empty threads at the end of each ray bounce and culling via string compaction for now, but this can of course be further extended to the micro-kernels for intersection testing later.</p>
<p>Anyhow, enough text. Takua-RT, even in its super-naive unoptimized CUDA-port state right now, is already so much faster than the CPU version that I can render frames with fairly high convergence in seconds to minutes. That means the renderer is now fast enough for use on rendering animations, such as the one at the top of this post. No post-processing whatsoever was applied, aside from my name watermark in the lower right hand corner. The following images are raw output frames from Takua-RT, this time straight from the renderer, without even watermarking:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2012/Sep/animTest.1.png"><img src="https://blog.yiningkarlli.com/content/images/2012/Sep/animTest.1.png" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2012/Sep/animTest.60.png"><img src="https://blog.yiningkarlli.com/content/images/2012/Sep/animTest.60.png" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2012/Sep/animTest.100.png"><img src="https://blog.yiningkarlli.com/content/images/2012/Sep/animTest.100.png" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2012/Sep/animTest.169.png"><img src="https://blog.yiningkarlli.com/content/images/2012/Sep/animTest.169.png" alt="" /></a></p>
<p>Each of these frames represents 5000 iterations of convergence, and took about a minute to render on a NVIDIA Geforce GTX480. The flickering in the glass ball in animated version comes from having a low trace depth of 3 bounces, including for glass surfaces.</p>
https://blog.yiningkarlli.com/2012/09/jello-kd-tree.html
Jello KD-Tree
2012-09-09T00:00:00+00:00
2012-09-09T00:00:00+00:00
Yining Karl Li
<p>I’ve started an effort to clean up, rewrite, and enhance my ObjCore library, and part of that effort includes taking my <a href="http://blog.yiningkarlli.com/2012/06/more-kd-tree-fun.html">KD-Tree viewer from Takua Render</a> and making it just a standard component of ObjCore. As a result, I can now plug the latest version of ObjCore into any of my projects that use it and quickly wire up support for viewing the KD-Tree view for that project. Here’s the <a href="http://blog.yiningkarlli.com/2012/05/more-fun-with-jello.html">jello sim from a few months back</a> visualized as a KD-Tree:</p>
<div class="embed-container"><iframe src="https://player.vimeo.com/video/53735319" frameborder="0">Jello Sim KD-Tree</iframe></div>
<p>I’ve adopted a new standard grey background for OpenGL tests, since I’ve found that the higher amount of contrast this darker grey provides plays nicer with Vimeo’s compression for a clearer result. But of course I’ll still post stills too.</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2012/Sep/kd_jello0.png"><img src="https://blog.yiningkarlli.com/content/images/2012/Sep/kd_jello0.png" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2012/Sep/kd_jello1.png"><img src="https://blog.yiningkarlli.com/content/images/2012/Sep/kd_jello1.png" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2012/Sep/kd_jello2.png"><img src="https://blog.yiningkarlli.com/content/images/2012/Sep/kd_jello2.png" alt="" /></a></p>
<p>Hopefully at the end of this clean up process, I’ll have ObjCore in a solid enough of a state to post to Github.</p>
https://blog.yiningkarlli.com/2012/09/volumetric-renderer-revisited.html
Volumetric Renderer Revisited
2012-09-05T00:00:00+00:00
2012-09-05T00:00:00+00:00
Yining Karl Li
<p>I’ve been meaning to add animation support to my <a href="http://blog.yiningkarlli.com/2011/10/a-volumetric-renderer-for-rendering-volumes.html">volume renderer</a> for demoreel purposes for a while now, so I did that this week! Here’s a rotating cloud animation:</p>
<div class="embed-container"><iframe src="https://player.vimeo.com/video/53634239" frameborder="0">Animated Cloud Render Test</iframe></div>
<p>…and of course, a still or two:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2012/Sep/cloud1.png"><img src="https://blog.yiningkarlli.com/content/images/2012/Sep/cloud1.png" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2012/Sep/cloud2.png"><img src="https://blog.yiningkarlli.com/content/images/2012/Sep/cloud2.png" alt="" /></a></p>
<p>Instead of just rotating the camera around the cloud, I wanted for the cloud itself to rotate but have the noise field it samples stay stationary, resulting in a cool kind of morphing effect with the cloud’s actual shape. In order to author animations easily, I implemented a fairly rough, crude version of Maya integration. I wrote a script that will take spheres and point lights in Maya and build a scene file for my volume renderer using the Maya spheres to define cloud clusters and the point lights to define… well… lights. With an easy bit of scripting, I can do this for each frame in a keyframed animation in Maya and then simply call the volume renderer once for each frame. Here’s what the above animation’s Maya scene file looks like:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2012/Sep/maya.png"><img src="https://blog.yiningkarlli.com/content/images/2012/Sep/maya.png" alt="" /></a></p>
<p>Also, since my pseudo-blackbody trick was originally intended to simulate the appearance of a fireball, I tried creating an animation of a fireball by just scaling a sphere:</p>
<div class="embed-container"><iframe src="https://player.vimeo.com/video/53714601" frameborder="0">Animated Pseudo-Blackbody Test</iframe></div>
<p>…and as usual again, stills:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2012/Sep/blackbody1.png"><img src="https://blog.yiningkarlli.com/content/images/2012/Sep/blackbody1.png" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2012/Sep/blackbody2.png"><img src="https://blog.yiningkarlli.com/content/images/2012/Sep/blackbody2.png" alt="" /></a></p>
<p>So that’s that for the volume renderer for now! I think this might be the end of the line for this particular incarnation of the volume renderer (it remains the only piece of tech I’m keeping around that is more or less unmodified from its original CIS460/560 state). I think the next time I revisit the volume renderer, I’m either going to port it entirely to CUDA, as my good friend <a href="https://vimeo.com/user6054073">Adam Mally</a> did with his, or I’m going to integrate it into my renderer project, <a href="http://peterkutz.com/">Peter Kutz</a> style.</p>
https://blog.yiningkarlli.com/2012/08/more-experiments-with-trees.html
More Experiments with Trees
2012-08-16T00:00:00+00:00
2012-08-16T00:00:00+00:00
Yining Karl Li
<p>Every once in a while, I <a href="http://blog.yiningkarlli.com/2011/03/autumn-tree.html">return</a> to <a href="http://blog.yiningkarlli.com/2011/03/vray-tree.html">trying</a> to make a good looking tree. Here’s a frame from my latest attempt:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2012/Aug/leaves.png"><img src="https://blog.yiningkarlli.com/content/images/2012/Aug/leaves.png" alt="" /></a></p>
<p>Have I finally managed to create a tree that I’m happy with? Well….. no. But I do think this batch comes closer than previous attempts! I’ve had a workflow for creating base tree geometry for a while now that I’m fairly pleased with, which is centered around using OnyxTREE as a starting point and then custom sculpting in Maya and Mudbox. However, I haven’t tried actually animating trees before, and shading trees properly has remained a challenge. So, my goal this time around was to see if I could make any progress in animating and shading trees.</p>
<p>As a starting point, I played with just using the built in wind simulation tools in OnyxTREE, which was admittedly difficult to control. I found that having medium to high windspeeds usually led to random branches glitching out and jumping all over the place. I also wanted to make a weeping willow style tree, and even medium-low windspeeds often resulted in the hilarious results:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2012/Aug/crazywind.png"><img src="https://blog.yiningkarlli.com/content/images/2012/Aug/crazywind.png" alt="It turns out OnyxTREE runs fine in Wine on OSX. Huh" /></a></p>
<p>A bigger problem though was the sheer amount of storage space exporting animated tree sequences from Onyx to Maya requires. The only way to bring Onyx simulations into programs that aren’t 3ds Max is to export the simulation as an obj sequence from Onyx and then import the sequence into whatever program. Maya doesn’t have a native method to import obj sequences, so I wrote a custom Python script to take care of it for me. Here’s a short compilation of some results:</p>
<div class="embed-container"><iframe src="https://player.vimeo.com/video/53572074" frameborder="0">Windy Tree Maya Tests</iframe></div>
<p>One important thing I discovered was that the vertex numbering in each obj frame exported from Onyx remains consistent; this fact allowed for an important improvement. Instead of storing a gazillion individual frames of obj meshes, I experimented with dropping a large number of intermediate frames and leaving a relatively smaller number of keyframes which I then used as blendshape frames with more scripting hackery. This method works rather well; in the above video, the weeping willow at the end uses this approach. There is, however, a significant flaw with this entire Onyx based animation workflow: geometry clipping. Onyx’s system does not resolve cases where leaves and entire branches clip through each other… while from a distance the trees look fine, up close the clipping can become quite apparent. For this reason, I’m thinking about abandoning the Onyx approach altogether down the line and perhaps experimenting with building my own tree rigs and procedurally animating them. That’s a project for another day, however.</p>
<p>On the shading front, my basic approach is still the same: use a Vray double sided material with a waxier, more specular shader for the “front” of the leaves and a more diffuse shader for the “back”. In real life, leaves of course display an enormous amount of subsurface scattering, but leaves are a special case for subsurface scatter: they’re really really thin! Normally subsurface scattering is a rather expensive effect to render, but for thin material cases, the Vray double sided material can quite efficiently approximate the subsurface effect for a fraction of the rendertime.</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2012/Aug/doublesidedmat.png"><img src="https://blog.yiningkarlli.com/content/images/2012/Aug/doublesidedmat.png" alt="" /></a></p>
<p>Bark is fairly straightforward to, it all comes down to the displacement and bump mapping. Unfortunately, the limbs in the tree models I made this time around were straight because I forgot to go in and vary them up/sculpt them. Because of the straightness, my tree twigs don’t look very good this time, even with a decent shader. Must remember for next time! Creating displacement bark maps from photographs or images sourced from Google Image Search or whatever is really simple; take your color texture into Photoshop, slam it to black and white, and adjust contrast as necessary:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2012/Aug/barkmaps.png"><img src="https://blog.yiningkarlli.com/content/images/2012/Aug/barkmaps.png" alt="" /></a></p>
<p>Here’s a few seconds of rendered output with the camera inside of the tree’s leaf canopy, pointed skyward. It’s not exactly totally realistic looking, meaning it needs more work of course, but I do like the green-ess of the whole thing. More importantly, you can see the subsurface effect on the leaves from the double sided material!</p>
<div class="embed-container"><iframe src="https://player.vimeo.com/video/53569412" frameborder="0">Windy Tree Render Test</iframe></div>
<p>Something that continues to prove challenging is how my shaders hold up at various distances. The same exact shader (with a different leaf texture), looks great from a distance, but loses realism when the camera is closer. I did a test render of the weeping willow from further away using the same shader, and it looks a lot better. Still not perfect, but closer than previous attempts:</p>
<div class="embed-container"><iframe src="https://player.vimeo.com/video/53569411" frameborder="0">Willow Wind Test</iframe></div>
<p>…and of course, a pretty still or two:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2012/Aug/willow1.png"><img src="https://blog.yiningkarlli.com/content/images/2012/Aug/willow1.png" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2012/Aug/willow2.png"><img src="https://blog.yiningkarlli.com/content/images/2012/Aug/willow2.png" alt="" /></a></p>
<p>A fun experiment I tried was building a shader that can imitate the color change that occurs as fall comes around. This shader is in no way physically based, it’s using just a pure mix function controlled through keyframes. Here’s a quick test showing the result:</p>
<div class="embed-container"><iframe src="https://player.vimeo.com/video/46474571" frameborder="0">Tree Color Test</iframe></div>
<p>Eventually building a physically based leaf BSSDF system might be a fun project for my own renderer. Speaking of which, I couldn’t resist throwing the weeping willow model through my KD-tree library to get a tree KD-tree:</p>
<div class="embed-container"><iframe src="https://player.vimeo.com/video/53546737" frameborder="0">Tree KD-Tree</iframe></div>
<p>Since the Vimeo compression kind of borks thin lines, here’s a few stills:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2012/Aug/kd1.png"><img src="https://blog.yiningkarlli.com/content/images/2012/Aug/kd1.png" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2012/Aug/kd2.png"><img src="https://blog.yiningkarlli.com/content/images/2012/Aug/kd2.png" alt="" /></a></p>
<p>Alright, that’s all for this time! I will most likely return to trees yet again perhaps a few weeks or months from now, but for now, much has been learned!</p>
https://blog.yiningkarlli.com/2012/07/random-point-sampling-on-surfaces.html
Random Point Sampling On Surfaces
2012-07-14T00:00:00+00:00
2012-07-14T00:00:00+00:00
Yining Karl Li
<p>Just a heads up, this post is admittedly more of a brain dump for myself than it is anything else.</p>
<p>A while back I implemented a couple of fast methods to generate random points on geometry surfaces, which will be useful for a number of applications, such as direct lighting calculations involving area lights.</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2012/Jul/randpoint0.png"><img src="https://blog.yiningkarlli.com/content/images/2012/Jul/randpoint0.png" alt="" /></a></p>
<p>The way I’m sampling random points varies by geometry type, but all methods are pretty simple. Right now the system is implemented such that I can give the renderer a global point density to follow, and points will be generated according to that density value. This means the number of points generated on each piece of geometry is directly linked to the geometry’s surface area.</p>
<p>For spheres, the method I use is super simple: get the surface area of the sphere, generate random UV coordinates, and map those coordinates back to the surface of the sphere. This method is directly pulled from <a href="http://mathworld.wolfram.com/SpherePointPicking.html">this Wolfram Mathworld page</a>, which also describes why the most naive approach to point picking on a sphere is actually wrong.</p>
<p>My approach for ellipsoids unfortunately is a bit brute force. Since getting the actual surface area for an ellipsoid is actually fairly mathematically tricky, I just <a href="http://en.wikipedia.org/wiki/Ellipsoid#Approximate_formula">approximate it</a> and then use plain old rejection sampling to get a point.</p>
<p>Boxes are the easiest of the bunch; find the surface area of each face, randomly select a face weighted by the proportion of the total surface area that face comprises, and then pick a random x and y coordinate on that face. The method I use for meshes is similar, just on potentially a larger scale: find the surface area of all of the faces in the mesh and select a face randomly weighted by the face’s proportion of the total surface area. Then instead of generating random cartesian coordinates, I generate a random barycentric coordinate, and I’m done.</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2012/Jul/randpoint3.png"><img src="https://blog.yiningkarlli.com/content/images/2012/Jul/randpoint3.png" alt="" /></a></p>
<p>The method that I’m using right now is purely random, so there’s no guarantee of equal spacing between points initially. Of course, as one picks more and more points, the spacing between any given set of points will converge on something like equally spaced, but that would take a lot of random points. I’ve been looking at this <a href="http://peterwonka.net/Publications/2009.EGF.Cline.PoissonOnSurfaces.pdf">Dart Throwing On Surfaces Paper</a> for ideas, but at least for now, this solution should work well enough for what I want it for (direct lighting). But we shall see!</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2012/Jul/randpoint4.png"><img src="https://blog.yiningkarlli.com/content/images/2012/Jul/randpoint4.png" alt="" /></a></p>
<p>Also, as I am sure you can guess from the window chrome on that last screenshot, I’ve successfully tested Takua Render on Linux! Specifically, on Fedora!</p>
https://blog.yiningkarlli.com/2012/07/thoughts-on-ray-bounce-depth.html
Thoughts on Ray Bounce Depth
2012-07-05T00:00:00+00:00
2012-07-05T00:00:00+00:00
Yining Karl Li
<p>I finally got around to doing a long overdue piece of analysis on Takua Render: looking at the impact of ray bounce depth on performance and on the final image.</p>
<p>Of course, in real life, light can bounce around (almost) indefinitely before it is either totally absorbed or enters our eyeballs. Unfortunately, simulating this behavior completely is extremely difficult in any type of raytracing solution because in a raytrace solution, letting a ray bounce around indefinitely until it does something interesting can lead to extremely extremely long render times. Thus, one of the first shortcuts that most raytracing (and therefore pathtracing) systems take is cutting off rays after they bounce a certain number of times. This strategy should not have much of an impact on the final visual quality of a rendered image, since the more a light ray bounces around, the less each successive bounce contributes to the final image anyway.</p>
<p>With that in mind, I did some tests with Takua Render in hopes of finding a good balance between ray bounce depth and quality/speed. The following images or a glossy white sphere in a Cornell Box were rendered on a quad-core 2.5 GhZ Core i5 machine.</p>
<p>For a reference, I started with a render with a maximum ray bounce depth of 50 and 200 samples per pixel:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2012/Jul/depthtest_50_1325.png"><img src="https://blog.yiningkarlli.com/content/images/2012/Jul/depthtest_50_1325.png" alt="Max Bounce Depth of 50, 200 iterations, took 1325 seconds to render." /></a></p>
<p>Then I ran a test render with a maximum of just 2 bounces; essentially, this represents the direct lighting part of the solution only, albeit generated in a Monte Carlo fashion. Since I made the entire global limit 2 bounces, no reflections show up on the sphere of the walls, just the light overhead. Note the total lack of color bleeding and the dark shadow under the ball.</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2012/Jul/depthtest_02_0480.png"><img src="https://blog.yiningkarlli.com/content/images/2012/Jul/depthtest_02_0480.png" alt="Max Bounce Depth of 2, 200 iterations, took 480 seconds to render." /></a></p>
<p>The next test was with a maximum of 5 bounces. In this test, nice effects like color bleeding and indirect illumination are back! However, compared to the reference render, the area under the sphere still has a bit of dark shadowing, much like what one would expect if an ambient occlusion pass had been added to the image. While not totally accurate to the reference render, this image under certain artistic guidelines might actually be acceptable, and renders considerably faster.</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2012/Jul/depthtest_05_0811.png"><img src="https://blog.yiningkarlli.com/content/images/2012/Jul/depthtest_05_0811.png" alt="Max Bounce Depth of 5, 200 iterations, took 811 seconds to render." /></a></p>
<p>Differencing the 5 bounce render from the reference 50 bounce render shows that the 5 bounce one is ever so slightly dimmer and that most of the difference between the two images is in the shadow area under the sphere. Ignore the random fireflying pixels, which is just a result of standard pathtracing variance in the renders:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2012/Jul/05-50_diff.png"><img src="https://blog.yiningkarlli.com/content/images/2012/Jul/05-50_diff.png" alt="5 bounce test differenced with the 50 bounce reference." /></a></p>
<p>The next test was 10 bounces. At 10 bounces, the resultant images is essentially visually indistinguishable from the 50 bounce reference, as shown by the differenced image included. This result implies that beyond 10 bounces, the contributions of successive bounces to the final image are more or less negligible.</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2012/Jul/depthtest_10_0995.png"><img src="https://blog.yiningkarlli.com/content/images/2012/Jul/depthtest_10_0995.png" alt="Max Bounce Depth of 10, 200 iterations, took 995 seconds to render." /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2012/Jul/10-50_diff.png"><img src="https://blog.yiningkarlli.com/content/images/2012/Jul/10-50_diff.png" alt="10 bounce test differenced with the 50 bounce reference. Note that there is essentially no difference." /></a></p>
<p>Finally, a test with a maximum of 20 bounces is still essentially indistinguishable from both the 10 bounce test and the 50 bounce reference:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2012/Jul/depthtest_20_1277.png"><img src="https://blog.yiningkarlli.com/content/images/2012/Jul/depthtest_20_1277.png" alt="Max Bounce Depth of 20, 200 iterations, took 1277 seconds to render." /></a></p>
<p>Interestingly, render times do not scale linearly with maximum bounce depth! The reason for this relationship (or lack thereof) can be found in the fact that the longer a ray bounces around, the more likely it is to find a light source and terminate. At 20 bounces, the odds of a ray finding a light source is very very close to the odds of a ray finding a light source at 50 bounces, explaining the smallness of the gap in render time between 20 and 50 bounces (especially compared to the difference in render time between, say, 2 and 5 bounces).</p>
https://blog.yiningkarlli.com/2012/06/more-kd-tree-fun.html
More KD-Tree Fun
2012-06-16T00:00:00+00:00
2012-06-16T00:00:00+00:00
Yining Karl Li
<p>Lately progress on my <a href="http://www.yiningkarlli.com/projects/takuarender">Takua Render</a> project has slowed down a bit, since over this summer I am interning at <a href="http://www.dreamworksanimation.com/">Dreamworks Animation</a> during weekdays. However, in the evenings and on weekends I am still been working at stuff!</p>
<p>Something that I never got around to doing for no particularly good reason was visualizing my KD-tree implementation. As such, I’ve known for a long time that my KD-tree is suboptimal, but have not actually been able to quickly determine to what degree my KD-tree is inefficient. However, since I now have a number of OpenGL based diagnostic views for Takua Render, I figured I no longer had a good excuse to not visualize my KD-tree. So last night I did just that! Here is what I got for the Stanford Dragon:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2012/Jun/kd1.png"><img src="https://blog.yiningkarlli.com/content/images/2012/Jun/kd1.png" alt="" /></a></p>
<p>Just as I suspected, my KD-tree implementation was far from perfect. Some rough statistics I had my renderer output told me that even with the KD-tree, the renderer was still performing hundreds to even thousands of intersection tests against meshes. The above image explains why: each of those KD-tree leaf nodes are enormous, and therefore contain an enormous amount of objects!</p>
<p>Fortunately, after a bit of tinkering, I discovered that there’s nothing actually wrong with the KD-tree implementation itself. Instead, the sparseness of the tree is coming from how I tuned the tree building operation. With a bit of tinkering, I managed to get a fairly improved tree:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2012/Jun/kd2.png"><img src="https://blog.yiningkarlli.com/content/images/2012/Jun/kd2.png" alt="" /></a></p>
<p>…and with a bit more of tuning and playing with maximum recursion depths:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2012/Jun/kd3.png"><img src="https://blog.yiningkarlli.com/content/images/2012/Jun/kd3.png" alt="" /></a></p>
<p>Previously, my KD-tree construction routine based the construction on only a maximum recursion depth; after the tree reached a certain height, the construction would stop. I’ve now modified the construction routine to use three separate criteria: a maximum recursion depth, minimum node bounding box volume, and a minimum number of objects per node. If any node meets any of the above three conditions, it is turned into a leaf node. As a result, I can now get extremely dense KD-trees that only have on average a low-single-digit number of objects per leaf node, as opposed to the average hundreds of objects per leaf node before:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2012/Jun/kd4.png"><img src="https://blog.yiningkarlli.com/content/images/2012/Jun/kd4.png" alt="" /></a></p>
<p>In theory, this improvement should allow for a fairly significant speedup, since the number of intersections per mesh should now be dramatically lower, leading to much higher ray throughput! I’m currently running some benchmarks to determine just how much of a performance boost better KD-trees will give me, and I’ll post about those results soon!</p>
https://blog.yiningkarlli.com/2012/05/subsurface-scattering-and-new-name.html
Subsurface Scattering and New Name
2012-05-20T00:00:00+00:00
2012-05-20T00:00:00+00:00
Yining Karl Li
<p>I implemented subsurface scattering in my renderer!</p>
<p>Here’s a Stanford Dragon in a totally empty environment with just one light source providing illumination. The dragon is made up of a translucent purple jelly-like material, showing off the subsurface scattering effect:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2012/May/dragonsss_bright.png"><img src="https://blog.yiningkarlli.com/content/images/2012/May/dragonsss_bright.png" alt="" /></a></p>
<p><a href="http://en.wikipedia.org/wiki/Subsurface_scattering">Subsurface scattering</a> is an important behavior that light exhibits upon hitting some translucent materials; normal transmissive materials will simply transport light through the material and out the other side, but subsurface scattering materials will attenuate and scatter light before releasing the light somewhere not necessarily along a line from the entry point. This is what gives skin and translucent fruit and marble and a whole host of other materials their distinctive look.</p>
<p>There are currently a whole host of methods to rapidly approximate subsurface scattering, including some screen-space techniques that are actually fast enough for use in realtime renderers. However, my implementation at the moment is purely brute-force monte-carlo; while extremely physically accurate, it is also very very slow. In my implementation, when a ray enters a subsurface scattering material, I generate a random scatter direction via isotropic scattering, and then calculate light accumulation attenuation based on an absorption coefficient defined for the material. This approach is very similar to the one taken by <a href="http://peterkutz.com/">Peter</a> and me in our <a href="http://www.blogger.com/gpupathtracer.blogspot.com">GPU pathtracer</a>.</p>
<p>At some point in the future I might try out a faster approximation method, but for the time being, I’m pretty happy with the visual result that brute-force monte-carlo scattering produces.</p>
<p>Here’s the same subsurface scattering dragon from above, but now in the Cornell Box. Note the cool colored soft shadows beneath the dragon:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2012/May/subsurfacetest.png"><img src="https://blog.yiningkarlli.com/content/images/2012/May/subsurfacetest.png" alt="" /></a></p>
<p>Also, I’ve finally settled on a name for my renderer project: Takua Render! So, that is what I shall be calling my renderer from now on!</p>
https://blog.yiningkarlli.com/2012/05/more-fun-with-jello.html
More Fun with Jello
2012-05-05T02:00:00+00:00
2012-05-05T02:00:00+00:00
Yining Karl Li
<p>At <a href="http://www.graphics.cornell.edu/~kiderj/">Joe</a>’s request, I made another jello video! Joe suggested I make a video that shows the simulation both in the actual simulator’s GL view, and rendered out from Maya, so this video does just that. The starting portion of the video shows what the simulation looks like in the simulator GL view, and then shifts to the final render (done with Vray, my pathtracer still is not ready yet!). The GL and final render views don’t quite line up with each other perfectly, but its close enough that you get the idea.</p>
<p>There is a slight change in the tech involved too- I’ve upgraded my jello simulator’s spring array so that simulations should be more stable now. The change isn’t terribly dramatic; all I did was add in more bend and shear springs in my simulation, so jello cubes now “try” harder to return to a perfect cube shape.</p>
<p>This video is making use of my Vray white-backdrop studio setup! The pitcher was just a quick 5 minute model, nothing terribly interesting there.</p>
<div class="embed-container"><iframe src="https://player.vimeo.com/video/41545296" frameborder="0">Fun with Jello</iframe></div>
<p>…and of course, some stills:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2012/May/jello_01.png"><img src="https://blog.yiningkarlli.com/content/images/2012/May/jello_01.png" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2012/May/jello_02.png"><img src="https://blog.yiningkarlli.com/content/images/2012/May/jello_02.png" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2012/May/jello_03.png"><img src="https://blog.yiningkarlli.com/content/images/2012/May/jello_03.png" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2012/May/jello_04.png"><img src="https://blog.yiningkarlli.com/content/images/2012/May/jello_04.png" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2012/May/jello_05.png"><img src="https://blog.yiningkarlli.com/content/images/2012/May/jello_05.png" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2012/May/jello_06.png"><img src="https://blog.yiningkarlli.com/content/images/2012/May/jello_06.png" alt="" /></a></p>
https://blog.yiningkarlli.com/2012/05/smoke-sim-volumetric-renderer.html
Smoke Sim + Volumetric Renderer
2012-05-05T01:00:00+00:00
2012-05-05T01:00:00+00:00
Yining Karl Li
<p>Something I’ve had on my list of things to do for a few weeks now is mashing up my <a href="http://blog.yiningkarlli.com/2011/10/a-volumetric-renderer-for-rendering-volumes.html">volumetric renderer</a> from CIS460 with my <a href="http://blog.yiningkarlli.com/2012/03/smoke-sim-preconditioning-and-huge.html">smoke simulator</a> from CIS563.</p>
<p>Now I can cross that off of my list! Here is a 100x100x100 grid smoke simulation rendered out with pseudo Monte-Carlo black body lighting (described in my volumetric renderer post):</p>
<div class="embed-container"><iframe src="https://player.vimeo.com/video/41543438" frameborder="0">Smoke Simulator Pseudo-Blackbody Test</iframe></div>
<p>The actual approach I took to integrating the two was to simply pipeline them instead of actually merging the codebases. I added a small extension to the smoke simulator that lets it output the smoke grid to the same voxel file format that the volumetric renderer reads in, and then wrote a small Python script that just iterates over all voxel files in a folder and calls the volumetric renderer over and over again.</p>
<p>I’m actually not entirely happy with the render… I don’t think I picked very good settings for the pseudo-black body, so a lot of the render is overexposed and too bright. I’ll probably tinker with that some later and re-render the whole thing, but before I do that I want to move the volumetric renderer onto the GPU with CUDA. Even with multithreading via OpenMP, the rendertimes per frame are still too high for my liking… Anyway, here are some stills!</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2012/May/smoke_vr1.png"><img src="https://blog.yiningkarlli.com/content/images/2012/May/smoke_vr1.png" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2012/May/smoke_vr2.png"><img src="https://blog.yiningkarlli.com/content/images/2012/May/smoke_vr2.png" alt="" /></a></p>
https://blog.yiningkarlli.com/2012/04/april-23rd-cis565-progress-summary.html
April 23rd CIS565 Progress Summary- Speed and Refraction
2012-04-23T00:00:00+00:00
2012-04-23T00:00:00+00:00
Yining Karl Li
<p>This post is the third update for the GPU Pathtracer project Peter and I are working on!</p>
<p>Over the past few weeks, the GPU Pathtracer has gained two huge improvements: refraction, and major speed gains! In just 15 seconds on Peter’s NVIDIA GTX530 (on a more powerful card in the lab, we get even better speeds) , we can now get something like this:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2012/Apr/JustFifteenSeconds.png"><img src="https://blog.yiningkarlli.com/content/images/2012/Apr/JustFifteenSeconds.png" alt="" /></a></p>
<p>Admittedly Peter has been contributing more interesting code than I have, which makes sense since in this project Peter is clearly the veteran rendering expert and I am the newcomer. But, I am learning a lot, and Peter is getting more cool stuff done since I can get other stuff done and out of the way!</p>
<p>The posts for this update are:</p>
<ol>
<li><a href="http://gpupathtracer.blogspot.com/2012/04/thirty-seconds.html">Performance Optimization</a>: Speed boosts through zero-weight ray elimination</li>
<li><a href="http://gpupathtracer.blogspot.com/2012/04/cool-error-render.html">Cool Error Render</a>: Fun debug images from getting refraction to work</li>
<li><a href="http://gpupathtracer.blogspot.com/2012/04/transmission.html">Transmission</a>: Glass spheres!</li>
<li><a href="http://gpupathtracer.blogspot.com/2012/04/convergence.html">Fast Convergence</a>: Tricks for getting more raw speed</li>
</ol>
<p>As always, check the posts for details and images!</p>
https://blog.yiningkarlli.com/2012/04/april-14th-cis563-progress-summary.html
April 14th CIS563 Progress Summary- Meshes and Meshes and Meshes
2012-04-14T00:00:00+00:00
2012-04-14T00:00:00+00:00
Yining Karl Li
<p>This post is the second update for the <a href="http://chocolatefudgesyrup.blogspot.com/">MultiFluids project</a>!</p>
<p>The past week for Dan and me has been all about meshes: mesh loading, mesh interactions, and mesh reconstruction! We integrated in a OBJ to Signed Distance Field convertor, which allowed us to then implement liquid-against-mesh interactions and use meshes to define starting liquid positions. We also figured out how to run marching cubes on signed distance fields, allowing us to export OBJ mesh sequences of our fluid simulation and bring our sims into Maya for rendering!</p>
<p>Here is a really cool render from this week:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2012/Apr/reddragon.png"><img src="https://blog.yiningkarlli.com/content/images/2012/Apr/reddragon.png" alt="" /></a></p>
<p>The posts for this week are:</p>
<ol>
<li><a href="http://chocolatefudgesyrup.blogspot.com/2012/04/surface-reconstruction-via-marching.html">Surface Reconstruction via Marching Cubes</a>: Level set goes in, OBJ comes out</li>
<li><a href="http://chocolatefudgesyrup.blogspot.com/2012/04/mesh-interactions.html">Mesh Interactions</a>: Using meshes as interactable objects</li>
<li><a href="http://chocolatefudgesyrup.blogspot.com/2012/04/meshes-as-starting-liquid-volumes-and.html">Meshes as Starting Liquid Volumes and Maya Integration</a>: Cool tests with a liquid Stanford Dragon</li>
</ol>
<p>Check out the posts for details, images, and videos!</p>
https://blog.yiningkarlli.com/2012/04/april-5th-cis565-progress-summary.html
April 5th CIS565 Progress Summary- Interactivity, Alpha Review, Fresnel Reflections, Antialiasing
2012-04-05T00:00:00+00:00
2012-04-05T00:00:00+00:00
Yining Karl Li
<p>This post is the second update for the <a href="http://gpupathtracer.blogspot.com/">GPU Pathtracer project</a>!</p>
<p>Since the last update, Peter and I added an interactive camera to the renderer to allow realtime movement around the scene! We also had our Alpha Review, which went quite well, and Peter implemented a reflection model. Initially the reflection model used was <a href="http://en.wikipedia.org/wiki/Schlick's_approximation">Schlick’s Approximation</a>, but later Peter replaced that with the full <a href="http://en.wikipedia.org/wiki/Fresnel_equations">Fresnel equations</a>. I also added super-sampled anti-aliasing for a smoother image.</p>
<p>The posts for this update:</p>
<ol>
<li><a href="http://gpupathtracer.blogspot.com/2012/04/interactivity-and-moveable-camera.html">Interactivity and Moveable Camera</a>: We can move around the scene!</li>
<li><a href="http://gpupathtracer.blogspot.com/2012/04/alpha-review-presentation.html">Alpha Review Presentation</a>: Slides and other stuff from our Alpha Review</li>
<li><a href="http://gpupathtracer.blogspot.com/2012/04/specular-reflection-test.html">Specular Reflection Test</a>: The first test with Shlick’s Approximation</li>
<li><a href="http://gpupathtracer.blogspot.com/2012/04/fresnel-reflections.html">Fresnel Reflections</a>: Some details on our reflection model</li>
<li><a href="http://gpupathtracer.blogspot.com/2012/04/abstract-art.html">Abstract Art</a>: Some fun buggy renders Peter produced while debugging</li>
<li><a href="http://gpupathtracer.blogspot.com/2012/04/anti-aliasing.html">Anti-Aliasing</a>: Super-sampled anti-aliasing!</li>
</ol>
<p>A nice image from the last post:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2012/Apr/GreenBallAntiAliasing.png"><img src="https://blog.yiningkarlli.com/content/images/2012/Apr/GreenBallAntiAliasing.png" alt="" /></a></p>
<p>Check the posts for tons of details, images, and even some video!</p>
https://blog.yiningkarlli.com/2012/04/april-1st-cis563-progress-summary.html
April 1st CIS563 Progress Summary- Framework Improvements and Bounding Volumes
2012-04-01T02:00:00+00:00
2012-04-01T02:00:00+00:00
Yining Karl Li
<p>Here’s the first progress update/blog digest for the <a href="http://chocolatefudgesyrup.blogspot.com/">MultiFluids project</a>!</p>
<p>Dan and I started by taking our starting framework and tearing it down to its core. We then rebuilt the base code up with our own custom additions, leaving just the core solver intact. From there, we started building some of the basic features our project will require!</p>
<p>Here are the posts for this update:</p>
<ol>
<li><a href="http://chocolatefudgesyrup.blogspot.com/2012/03/framework-improvements-and-particles.html">Framework Improvements and Particles with Properties</a>: Tearing the base code down to the ground and rebuilding it better, faster, and with more features</li>
<li><a href="http://chocolatefudgesyrup.blogspot.com/2012/03/bounding-volumes-lesson-1-dont-just.html">Bounding Volumes & Lesson 1: Don’t just assume base code is perfect</a>: Dan discovers some flaws in the base code!</li>
<li><a href="http://chocolatefudgesyrup.blogspot.com/2012/04/multiple-arbitrary-bounding-volumes.html">Multiple Arbitrary Bounding Volumes</a>: All-important object interaction</li>
</ol>
<p>A frame from one of our test videos:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2012/Apr/sphereinbox.png"><img src="https://blog.yiningkarlli.com/content/images/2012/Apr/sphereinbox.png" alt="" /></a></p>
<p>Check the posts for details and videos!</p>
https://blog.yiningkarlli.com/2012/04/april-1st-cis565-progress-summary.html
April 1st CIS565 Progress Summary- Camera and Pathtracing
2012-04-01T01:00:00+00:00
2012-04-01T01:00:00+00:00
Yining Karl Li
<p>Here’s the first progress summary/blog digest for the <a href="http://gpupathtracer.blogspot.com/">GPU Pathtracer project</a>!</p>
<p>Over the past few days, Peter and I established our framework, got random number generation working on the GPU, built an accumulator, figured out parallelized camera ray projection, got spherical intersection tests working, and got a basic path-traced image!</p>
<p>Here are the posts for this update:</p>
<ol>
<li><a href="http://gpupathtracer.blogspot.com/2012/04/random-number-generator.html">Random Number Generation</a>: Fun with parallelized random number generators and seeding</li>
<li><a href="http://gpupathtracer.blogspot.com/2012/03/first-rays-on-gpu.html">First Rays on the GPU</a>: Parallel raycasting!</li>
<li><a href="http://gpupathtracer.blogspot.com/2012/03/accumulating-iterations.html">Accumulating Iterations</a>: The heart of any monte-carlo based renderer</li>
<li><a href="http://gpupathtracer.blogspot.com/2012/03/we-have-path-tracing.html">We Have Path Tracing</a>: First working renders!</li>
</ol>
<p>Here’s an image from our very first working render! More soon!</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2012/Apr/First.png"><img src="https://blog.yiningkarlli.com/content/images/2012/Apr/First.png" alt="" /></a></p>
https://blog.yiningkarlli.com/2012/03/cis563cis565-final-project-github-repos.html
CIS563/CIS565 Final Project Github Repos!
2012-03-27T00:00:00+00:00
2012-03-27T00:00:00+00:00
Yining Karl Li
<p><a href="https://blog.yiningkarlli.com/content/images/2012/Mar/octocat.png"><img src="https://blog.yiningkarlli.com/content/images/2012/Mar/octocat.png" alt="" /></a></p>
<p>For both <a href="http://chocolatefudgesyrup.blogspot.com/">MultiFluids</a> and the <a href="http://gpupathtracer.blogspot.com/">GPU Pathtracer</a>, we will be making our source code accessibly to all on Github!</p>
<p>Of course commercial coding projects and whatnot have very good reasons for keeping their source code locked down and proprietary, but open source is something I very strongly believe in. Open code allows other people to see what one does and give feedback and suggestions for improvement, and also allows other people interested in similar projects to potentially learn and build off of. Everybody wins!</p>
<p>The MultiFluids repository can be found here: <a href="https://github.com/betajippity/MultiFluids">https://github.com/betajippity/MultiFluids</a></p>
<p>The GPU Pathtracer repository can be found here: <a href="https://github.com/peterkutz/GPUPathTracer/">https://github.com/peterkutz/GPUPathTracer/</a></p>
<p>…and of course, the relevant blog posts:</p>
<p>GPU Pathtracer: <a href="http://gpupathtracer.blogspot.com/2012/03/github-repository.html">http://gpupathtracer.blogspot.com/2012/03/github-repository.html</a></p>
<p>MultiFluids: <a href="http://chocolatefudgesyrup.blogspot.com/2012/03/github-and-windowsosx.html">http://chocolatefudgesyrup.blogspot.com/2012/03/github-and-windowsosx.html</a></p>
https://blog.yiningkarlli.com/2012/03/cis563cis565-final-projects-multiple.html
CIS563/CIS565 Final Projects- Multiple Interacting Fluids and GPU Pathtracing
2012-03-19T00:00:00+00:00
2012-03-19T00:00:00+00:00
Yining Karl Li
<p>Over the next month and a half, I will be working on a pair of final projects for two of my classes, CIS565 (<a href="http://cis565-spring-2012.github.com/">GPU Programming</a>, taught by <a href="http://www.seas.upenn.edu/~pcozzi/">Patrick Cozzi</a>), and CIS563 (<a href="http://www.seas.upenn.edu/~cis563/">Physically Based Animation</a>, taught by <a href="http://www.graphics.cornell.edu/~kiderj/">Joe Kider</a>).</p>
<p>For CIS563, I will be teaming up with my fellow classmate and good friend <a href="http://www.danknowlton.com/">Dan Knowlton</a> to develop a liquid fluid simulator capable of simulating multiple fluids interacting against each other. Dan is without a doubt one of the best in our class and easily my equal or superior in all things graphics, so working with him should be a lot of fun. Our project is going to be based primarily on the paper <a href="http://dl.acm.org/citation.cfm?id=1141960">Multiple Interacting Fluids</a> by Losasso et. al. and as a starting point we will be using <a href="http://www.cs.columbia.edu/~batty/">Chris Batty</a>’s <a href="https://github.com/christopherbatty/Fluid3D">Fluid 3D</a> framework.</p>
<p>For CIS565, I will be working with my fellow Pixarian and friend <a href="http://peterkutz.com/">Peter Kutz</a>, who is somewhat of a physically based rendering titan at Penn. Working with Peter should be a very interesting and exciting learning experience. Peter and I will be developing a CUDA based GPU Pathtracer with the goal of generating convincing photorealistic images extremely rapidly. We will be developing our GPU pathtracer from scratch, although we will obviously draw inspiration from both Peter’s <a href="http://photorealizer.blogspot.com/">Photorealizer</a> project and my own CPU pathtracer project.</p>
<p>For both projects, we will be keeping blogs where we will post development updates, so I won’t post too much about development details to this here personal blog. Instead, I’m thinking about posting a weekly digest of progress on both projects with links to interesting highlights on the project blogs.</p>
<p>Dan and I will be blogging at <a href="http://chocolatefudgesyrup.blogspot.com/">http://chocolatefudgesyrup.blogspot.com/</a>. We’ve titled our project “Chocolate Syrup” for two reasons: firstly, Dan likes to codename his project with types of confectionaries, and secondly, chocolate syrup is one type of highly viscous fluid we aim for our simulator to be able to handle!</p>
<p>Peter and I will be blogging at <a href="http://gpupathtracer.blogspot.com/">http://gpupathtracer.blogspot.com/</a>. For now we have decided to call our project “Peter and Karl’s GPU Pathtracer”, for obvious reasons.</p>
<p>Details for each project can be found in the first post of each blog, which are the project proposals.</p>
<p>Multiple Interacting Fluids Proposal: <a href="http://chocolatefudgesyrup.blogspot.com/2012/03/project-proposal.html">http://chocolatefudgesyrup.blogspot.com/2012/03/project-proposal.html</a></p>
<p>GPU Pathtracer Proposal: <a href="http://gpupathtracer.blogspot.com/2012/03/project-proposal.html">http://gpupathtracer.blogspot.com/2012/03/project-proposal.html</a></p>
<p>Both of these projects should be very very cool, and I’ll be posting often to both development blogs!</p>
https://blog.yiningkarlli.com/2012/03/pathtracer-with-kd-tree.html
Pathtracer with KD-Tree
2012-03-12T00:00:00+00:00
2012-03-12T00:00:00+00:00
Yining Karl Li
<p>I have finished my KD-Tree rewrite! My new KD-Tree implements the Surface-Area Heuristic for finding optimal splitting planes, and stops splitting once a node has either reached a certain sufficiently small surface area, or has a sufficiently small number of elements contained within itself. Basically, very standard KD-Tree stuff, but this time, properly implemented. As a result, I can now render meshes much quicker than before.</p>
<p>Here’s a cow in a Cornell Box. Each iteration of the cow took about 3 minutes, which is a huge improvement over my old raytracer, but still leaves a lot of room for improvement:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2012/Mar/bovinetest.png"><img src="https://blog.yiningkarlli.com/content/images/2012/Mar/bovinetest.png" alt="" /></a></p>
<p>…and of course, the obligatory Stanford Dragon test. Each iteration took about 4 minutes for both of these images (the second one I let converge for a bit longer than the first one), and I made these renders a bit larger than the cow one:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2012/Mar/dragon2.png"><img src="https://blog.yiningkarlli.com/content/images/2012/Mar/dragon2.png" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2012/Mar/dragon1.png"><img src="https://blog.yiningkarlli.com/content/images/2012/Mar/dragon1.png" alt="" /></a></p>
<p>So! Of course the KD-Tree could still use even more work, but for now it works well enough that I think I’m going to start focusing on other things, such as more interesting BSDFs and other performance enhancements.</p>
https://blog.yiningkarlli.com/2012/03/first-pathtraced-image.html
First Pathtraced Image!
2012-03-11T00:00:00+00:00
2012-03-11T00:00:00+00:00
Yining Karl Li
<p>Behold, the very first image produced using my pathtracer!</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2012/Mar/frame_3.png"><img src="https://blog.yiningkarlli.com/content/images/2012/Mar/frame_3.png" alt="" /></a></p>
<p>Granted, the actual image is not terribly interesting- just a cube inside of a standard <a href="http://en.wikipedia.org/wiki/Cornell_Box">Cornell box</a> type setup, but it was rendered entirely using my own pathtracer! Aside from being converted from a BMP file to a PNG, this render has not been modified in any way whatsoever outside of my renderer (I have yet to name it). This render is the result of a thousand iterations. Here are some comparisons of the variance in the render at various iteration levels (click through to the full size versions to get an actual sense of the variance levels):</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2012/Mar/pass0-15.png"><img src="https://blog.yiningkarlli.com/content/images/2012/Mar/pass0-15.png" alt="Upper Left: 1 iteration. Upper Right: 5 iterations. Lower Left: 10 iterations. Lower Right: 15 iterations." /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2012/Mar/pass0-750.png"><img src="https://blog.yiningkarlli.com/content/images/2012/Mar/pass0-750.png" alt="Upper Left: 1 iteration. Upper Right: 250 iterations. Lower Left: 500 iterations. Lower Right: 750 iterations." /></a></p>
<p>Each iteration took about 15 seconds to finish.</p>
<p>Unfortunately, I have not been able to move as quickly with this project as I would like, due to other schoolwork and TAing for CIS277. Nonetheless, here’s where I am right now:</p>
<p>Currently the renderer is in a very very basic primitive state. Instead of extending my raytracer, I’ve opted for a completely from scratch start. The only piece of code brought over from the raytracer was the OBJ mesh system I wrote, since that was written to be fairly modular anyway. Right now my pathtracer works entirely through indirect lighting and only supports diffuse surfaces… like I said, very basic! Adding direct lighting should speed up render convergence, especially for scenes with small light sources. Also, right now the pathtracer only uses single direction pathtracing from the camera into the scene… adding bidirectional pathtracing should lead to another performance boost.</p>
<p>I’m still working on rewriting my KD-tree system, that should be finished within the next few days.</p>
<p>Something that is fairly high on my list of things to do right now is redesign the architecture for my renderer… right now, for each iteration, the renderer traces a path through a pixel all the way to its recursion depth before moving on to the next pixel. As soon as possible I want to move the renderer to use an iterative (as opposed to recursive) accumulated approach for each iteration (slightly confusing terminology, here i mean iteration as in each render pass), which, oddly enough, is something that my old raytracer already does. I’ve already started moving towards the accumulated approach; right now, I store the first set of raycasts from the camera and reuse those rays in each iteration.</p>
<p>One cool thing that storing the initial ray cast allows me to do is to generate a z-depth version of the render for “free”:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2012/Mar/frame_3z.png"><img src="https://blog.yiningkarlli.com/content/images/2012/Mar/frame_3z.png" alt="" /></a></p>
<p>Okay, hopefully by my next post I’ll have the KD-tree rewrite done!</p>
https://blog.yiningkarlli.com/2012/03/smoke-sim-preconditioning-and-huge.html
Smoke Sim- Preconditioning and Huge Grids
2012-03-07T00:00:00+00:00
2012-03-07T00:00:00+00:00
Yining Karl Li
<p>I have added preconditioning to my <a href="http://blog.yiningkarlli.com/2012/03/smoke-simulation-basics.html">smoke simulator</a>! For the preconditioner, I am using <a href="http://en.wikipedia.org/wiki/Incomplete_Cholesky_factorization">Incomplete Cholesky</a>, which is the preconditioner recommended in chapter 4 of the <a href="http://www.cs.ubc.ca/~rbridson/fluidsimulation/fluids_notes.pdf">Bridson Fluid Course Notes</a>. I’ve also troubleshooted by vorticity implementation, so the simulation should produce more interesting/stable vortices now.</p>
<p>The key reason for implementing the preconditioner is simple: speed. With a faster convergence comes an added bonus: being able to do larger grids due to less time required per solve. Because of that speed increase, I can now run my simulations on 3D grids.</p>
<p>In previous years, the CIS563 smoke simulator framework usually hit a performance cliff at grids beyond around 50x50x50, but last year <a href="http://peterkutz.com/">Peter Kutz</a> managed to push his smoke simulator to 90x90x36 by implementing a sparse A-Matrix structure, as opposed to storing every single data point, including empty ones, for the grid. This year’s smoke simulation framework was updated to include some of Peter’s improvements, and so Joe reckons that we should be able to push our smoke simulation grids pretty far. I’ve been scaling up starting from 10x10x10, and now I’m at 100x100x50:</p>
<div class="embed-container"><iframe src="https://player.vimeo.com/video/38057955" frameborder="0">Smoke Simulator 100x100x50 Test</iframe></div>
<p>This simulation took about 24 hours to run on a 2008 MacBook Pro with a 2.8 Ghz Core 2 Duo, but that is actually pretty good for fluid simulation! According to my rather un-scientific estimates, the simulation would take about 4 or 5 days without the preconditioner, and even longer without the sparse A-Matrix. I bet I can still push this further, and I’m starting to think about multithreading the simulation with <a href="http://openmp.org/wp/">OpenMP</a> to get even more performance and even larger grids. We shall see.</p>
<p>One more thing: rendering this thing. So far I have not been doing any fancy rendering, just using the default OpenGL render that our framework came with. However, I want to get this into my <a href="http://blog.yiningkarlli.com/2011/10/a-volumetric-renderer-for-rendering-volumes.html">volumetric renderer</a> at some point and maybe even try out the pseudo-black body stuff with it. Eventually I want to try rendering this out with my pathtracer too!</p>
https://blog.yiningkarlli.com/2012/03/smoke-simulation-basics.html
Smoke Simulation Basics!
2012-03-03T00:00:00+00:00
2012-03-03T00:00:00+00:00
Yining Karl Li
<p>For <a href="http://www.seas.upenn.edu/~cis563/">CIS563</a> (Physically Based Animation), our current assignment is to write a fluid simulator capable of simulating smoke inside of a box. For this assignment, we’re using a semi-lagrangian approach based on <a href="http://www.cs.ubc.ca/~rbridson/">Robert Bridson</a>’s 2007 SIGGRAPH <a href="http://www.cs.ubc.ca/~rbridson/fluidsimulation/fluids_notes.pdf">Course Notes on Fluid Simulation</a>.</p>
<p>I won’t go into the nitty-gritty details of the math behind the simulation (for that, consult the Bridson notes), but I’ll give a quick summary. Basically, we start with a specialized grid structure called the MAC (marker and cell) grid, where each grid cell stores information relevant to the point in space the cell represents, such as density, velocity, temperature, etc. We update values across the grid by pretending a particle carried the cell’s values into the cell and using the velocity to extrapolate in time the particle’s previous position, and look up the values from the grid cell the particle was previously in. We then use that information to perform advection and projection and solve the system through a <a href="http://en.wikipedia.org/wiki/Preconditioned_conjugate_gradient_method#The_preconditioned_conjugate_gradient_method">preconditioned conjugate gradient solver</a>.</p>
<p>So far I have implemented density advection, projection, buoyancy (via temperature advection), and vorticity. For the integration scheme I’m just using basic Eularian, which was the default for the framework we started with. Eularian seems stable enough for the smoke sim, but I might try to go ahead and implement RK4 later anyway, since I suspect RK4 won’t smooth out details as much as basic Eularian.</p>
<p>I’m still missing the actual preconditioner, so for now I’m only testing the simulation on a 2D grid, since otherwise the simulation times will be really really long.</p>
<p>Here is a test on a 100x100 2D grid!</p>
<div class="embed-container"><iframe src="https://player.vimeo.com/video/37842004" frameborder="0">Smoke Simulator 100x100x1 Test</iframe></div>
https://blog.yiningkarlli.com/2012/02/jello-sim-maya-integration.html
Jello Sim Maya Integration
2012-02-25T00:00:00+00:00
2012-02-25T00:00:00+00:00
Yining Karl Li
<p>I ported my <a href="http://blog.yiningkarlli.com/2012/02/multijello-simulation.html">jello simulation</a> to Maya!</p>
<p>Well, sort of.</p>
<p>Instead of building a full Maya plugin like my good friend <a href="http://www.danknowlton.com/blog.php?id=295">Dan Knowlton did</a>, I opted for a simpler approach: I write out the vertex positions for each jello cube for each time step to a giant text file, and then use a custom Python script in Maya to read the vertex positions from the text file and animate a cube inside of Maya. It is a bit hacky and not nearly as elegant as the full-Maya-plugin approach, but it works in a pinch.</p>
<p>I think beng able to integrate my coding projects into artistic projects is very important, since at the end of the day, the main point of computer graphics is to be able to produce a good looking image. As such, I thought putting some jello into my kitchen scene would be fun, so here is the result, rendered out with Vray (some day I want to replace Vray with my own renderer though!):</p>
<div class="embed-container"><iframe src="https://player.vimeo.com/video/37534077" frameborder="0">Jello Test</iframe></div>
<p>The rendering process I’m using isn’t perfect yet… the fact that the jello cubes are being simulated with relatively few vertices is extremely apparent in the above video, as can be seen in how angular the edges of the jello become when it wiggles. At the moment, I can think of two possible fixes: one, simple run the simulation with a higher vertex count, or two, render the jello as a subdivision surface with creased edges. Since the second option should in theory allow for better looking renders without impacting simulation time, I think I will try the subdivision method forst.</p>
<p>But for now, here are some pretty still frames:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2012/Feb/jello_kitchen_01.png"><img src="https://blog.yiningkarlli.com/content/images/2012/Feb/jello_kitchen_01.png" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2012/Feb/jello_kitchen_021.png"><img src="https://blog.yiningkarlli.com/content/images/2012/Feb/jello_kitchen_021.png" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2012/Feb/jello_kitchen_03.png"><img src="https://blog.yiningkarlli.com/content/images/2012/Feb/jello_kitchen_03.png" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2012/Feb/jello_kitchen_04.png"><img src="https://blog.yiningkarlli.com/content/images/2012/Feb/jello_kitchen_04.png" alt="" /></a></p>
https://blog.yiningkarlli.com/2012/02/multijello-simulation.html
Multijello Simulation
2012-02-18T00:00:00+00:00
2012-02-18T00:00:00+00:00
Yining Karl Li
<p>The first assignment of the semester for <a href="http://www.seas.upenn.edu/~cis563/">CIS563</a> is to write a jello simulator using a particle-mass-spring system. The basic jello system involves building a particle grid where all of the particles are connected using a variety of springs, such as bend and shear springs, and then applying forces across the spring grid. In order to step the entire simulation forward in time, we also have to implement a stable integration scheme, such as RK4. For each step forward in time, we have to do intersection tests for each particle against solid objects in the simulation, such as the ground plane or boxes or spheres.</p>
<p>The particle-mass-spring we used is based directly on the <a href="http://www.pixar.com/companyinfo/research/pbm2001/">Baraff/Witkin 2001 SIGGRAPH Physically Based Animation Course Notes</a>.</p>
<p>For the actual assignment, we were only required to support a single jello interacting against boxes, spheres, cylinders, and the ground. However, I think basic primitives are a tad boring… so I went ahead and integrated mesh collisions as well. The mesh collision stuff is actually using the same OBJ mesh system and KD-Tree system that I am using for my pathtracer! I am planning on cleaning up my OBJ/KD-Tree system and releasing it on Github or something soon, as I think I will still find even more uses for it in graphics projects.</p>
<p>Of course, a natural extension of mesh support is jello-on-jello interaction, which is why I call my simulator “multijello” instead of just singular jello. For jello-on-jello, my approach is to update one jello at a time, and for each jello, treat all other jellos in the simulation as just more OBJ meshes. This solution yields pretty good results, although some interpenetration happens if the time step is too large or if jello meshes are too sparse.</p>
<p>Here’s a video showcasing some things my jello simulator can do:</p>
<div class="embed-container"><iframe src="https://player.vimeo.com/video/37098929" frameborder="0">Experiments in Jello Simulation</iframe></div>
https://blog.yiningkarlli.com/2012/01/pathtracer-time.html
Pathtracer Time
2012-01-04T00:00:00+00:00
2012-01-04T00:00:00+00:00
Yining Karl Li
<p>This semester I am setting out on an independent study under the direction of <a href="http://www.graphics.cornell.edu/~kiderj/">Joe Kider</a> to build a pathtracer (obviously inspired by my friend and fellow DMD student <a href="http://peterkutz.com/computergraphics/">Peter Kutz</a>). Global illumination rendering techniques are becoming more and more relevant in industry today as hardware performance in the past few years has begun to reach a point where GI in commercial productions is suddenly no longer unfeasibly expensive. Some houses like Sony Imageworks have already moved to full GI renderers like Arnold, while other studios like Pixar are in the process of adopting GI based renderers or extending their existing renderers to support GI lighting. This industry move, coupled with the fact that GI quite simply produces <a href="http://vimeo.com/15630517">very</a> <a href="http://vimeo.com/7809605">pretty</a> <a href="http://vimeo.com/5407991">results</a>, sparked my initial interesting in GI techniques like pathtracing. Having built a basic raytracer last semester, I decided in typical over-confident style: “how hard could it be?”</p>
<p>Here’s my project abstract:</p>
<p><em>Both path tracing and bidirectional scatter distribution functions (BSDFs) are ideas that have existed within the field of computer graphics for many years and have seen numerous implementations in a variety of rendering pack- ages. Similarly, creating images of convincing plant life is a technical challenge that a host of solutions now exist for. However, achieving dynamic plant effects such as the change of a plants coloring during the transition from summer to fall is a task that to date has been mostly been accomplished using procedural techniques and various compositing tricks.</em></p>
<p><em>The goal of this project is to build a path tracing based renderer that is designed specifically with the intent to facil- itate achieving dynamic plant effects with a more physically based approach by introducing a new time component to the existing bidirectional scatter distribution model. By allowing BSDFs to vary over not only space but also over time, plant effects such as leaf decay could be achieved through shaders with appearances that are driven through physically based mathematical models instead of procedural techniques. In other words, this project has two main prongs: develop a robust path tracer with at least basic functionality, and then develop and implement a time- dependent BSDF model within the path tracer.</em></p>
<p>…and here’s some background that I wrote up for my proposal…</p>
<p><em><strong>1. INTRODUCTION</strong></em></p>
<p><em>Efficiently rendering convincing images with direct and indirect lighting has been a major problem in the field of computer graphics since the field’s very inception, as con- vincingly realistic graphics in games and movies depends upon lighting that can accurately mimic that of reality. Known generally as global illumination, the indirect light- ing problem has in the past decade seen a number of solu- tions such as path tracing and photon mapping that can generate convincingly realistic images with reasonable computational resource consumption and efficiency.
One of the key discoveries that enabled the development of modern global illumination techniques is the concept of Bidirectional Scattering Distribution Functions, or BSDFs. Developed as a superset and generalization of two other concepts known as bidirectional reflectance distribution functions (BRDFs) and bidirectional transmittance distribu- tion functions (BTDFs), BSDF is a general mathematical function that describes how light is scattered by a certain surface, given the material properties of the surface. BSDFs are useful today for representing the material properties of an object at a single point in time; however, in reality mate- rial properties can change and morph over time, as exem- plified by the natural phenomena of leaf color changes from summer to fall.
This project will attempt to build a prototype of a path tracing renderer with a BSDF model modified to include an additional time component to allow for material properties to change over time in a way representative of how material properties change over time in reality. The hope is that
such a renderer will prove to be useful in future attempts to recreate natural phenomena using physically based models, such as leaf decay.</em></p>
<p>…and the actual goal of the project…</p>
<p><em><strong>1.1 Design Goals</strong></em></p>
<p><em>The project’s goal is to develop a reasonably robust and efficient path tracing renderer with a BSDF model modified to include an additional time component. In order to prove the feasibility of such a modified BSDF model, the end goal is to be able to use the renderer to produce images of plant life with changing surface material properties, in addition to standard test image such as Cornell Box tests that validate the functionality of the underlying basic path tracer.</em></p>
<p>…and finally, what I’m hoping I’ll actually be able to produce at the end of this independent study:</p>
<p><em>1.2 <strong>Projects Proposed Features and Functionality</strong></em></p>
<p><em>The proposed renderer should allow a user to load a sce- ne with an arbitrary number of lights, materials, and objects and render out a realistic, global illumination based render. The renderer should be able to render implicitly defined objects such as spheres and cubes in addition to meshes defined in the .obj format. The renderer should also allow users to specify changes in object/light/camera transfor- mations over time in addition to changes in materials and BSDFs over time and render out a series of frames showing the scene at various points in time. A graphical interface would be a nice additional feature, but is not a priority of this project.</em></p>
<p>I’ll be posting at least weekly updates to this blog showing my progress. In my next post, I’ll go over some of the papers and sources Joe gave me to look over and explain some of the basic mechanics of how a pathtracer works. Apologies for the casual reader for this particular post being extremely text heavy; I shall have images to show soon!</p>
https://blog.yiningkarlli.com/2011/12/basic-raytracer-and-fun-with-kd-trees.html
Basic Raytracer and Fun with KD-Trees
2011-12-22T00:00:00+00:00
2011-12-22T00:00:00+00:00
Yining Karl Li
<p>The last assignment of the year for CIS460/560 (I’m still not sure what I’m supposed to call that class) is the dreaded RAYTRACER ASSIGNMENT.</p>
<p>The assignment is actually pretty straightforward: implement a recursive, direct lighting only raytracer with support for <a href="http://en.wikipedia.org/wiki/Blinn%E2%80%93Phong_shading_model">Blinn-Phong shading</a> and support for basic primitive shapes (spheres, boxes, and polygon extrusions). In other words, pretty much a barebones implementation of the <a href="http://dl.acm.org/citation.cfm?id=358882">original Turner Whitted raytracing paper</a>.</p>
<p>I’ve been planning on writing a global-illumination renderer (perhaps based on pathtracing or photon mapping?) for a while now, so my own personal goal with the raytracer project was to use it as a testbed for some things that I know I will need for my GI renderer project. With that in mind, I decided from the start that my raytracer should support rendering OBJ meshes and include some sort of acceleration system for OBJ meshes.</p>
<p>The idea behind the acceleration system goes like this: in the raytracer, one obviously needs to cast rays into the scene and track how they bounce around to get a final image. That means that every ray needs to have intersection tests against objects in the scene in order to determine what ray is hitting what object. Intersection testing against mathematically defined primitives is simple, but OBJ meshes present more of a problem; since an OBJ mesh is composed of a bunch of triangles or polygons, the naive way to intersection test against an OBJ mesh is to check for ray intersections with every single polygon inside of the mesh. This naive approach can get extremely expensive extremely quickly, so a better approach would be to use some sort of spatial data structure to quickly figure out what polygons are within the vicinity of the ray and therefore need intersection testing.</p>
<p>After talking with Joe and trawling around on Wikipedia for a while, I picked a <a href="http://en.wikipedia.org/wiki/K-d_tree">KD-Tree</a> as my spatial data structure for accelerated mesh intersection testing. I won’t go into the details of how KD-Trees work, as the Wikipedia article does a better job of it than I ever could. I will note, however, that the main resources I ended up pulling information from while looking up KD-Tree stuff are Wikipedia, Jon McCaffrey’s old CIS565 slides on spatial data structure, and the fantastic <a href="http://www.pbrt.org/">PBRT book</a> that Joe pointed me towards.</p>
<p>Implementing the KD-Tree for the first time took me the better part of two weeks, mainly because I was misunderstanding how the surface area splitting heuristic works. Unfortunately, I probably can’t post actual code for my raytracer, since this is a class assignment that will repeated in future incarnations of the class. However, I can show images!</p>
<p>The KD-Tree meant I could render meshes in a reasonable amount of time, so I rendered an airplane:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2011/Dec/2.png"><img src="https://blog.yiningkarlli.com/content/images/2011/Dec/2.png" alt="" /></a></p>
<p>The airplane took about a minute or so to render, which got me wondering how well my raytracer would work if I threw the full 500000+ poly <a href="http://en.wikipedia.org/wiki/Stanford_Dragon">Stanford Dragon</a> at it. This render took about five or six minutes to finish (without the KD-Tree in place, this same image takes about 30 minutes to render):</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2011/Dec/5.png"><img src="https://blog.yiningkarlli.com/content/images/2011/Dec/5.png" alt="" /></a></p>
<p>Of course, the natural place to go after one dragon is three dragons. Three dragons took about 15 minutes to render, which is pretty much exactly a three-fold increase over one dragon. That means my renderer’s performance scales more or less linearly, which is good.</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2011/Dec/4.png"><img src="https://blog.yiningkarlli.com/content/images/2011/Dec/4.png" alt="" /></a></p>
<p>For fun, and because I like space shuttles, here is a space shuttle. Because the space shuttle has a really low poly count, this image took under a minute to render:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2011/Dec/6.png"><img src="https://blog.yiningkarlli.com/content/images/2011/Dec/6.png" alt="" /></a></p>
<p>For reflections, I took a slightly different approach from the typical recursive method. The normal recursive approach to a raytracer is to begin with one ray, and trace that ray completely through recursion to its recursion depth limit before moving onto the next pixel and ray. However, such an approach might not actually be idea in a GI renderer. For example, from what I understand, in pathtracing a better raytracing approach is to actually trace everything iteratively; that is, trace the first bounce for all rays and store where the rays are, then trace the second bounce for all rays, then the third, and so on and so forth. Basically, such an approach allows one to set an unlimited trace depth and just let the renderer trace and trace and trace until one stops the renderer, but the corresponding cost of such a system is slightly higher memory usage, since ray positions need to be stored for the previous iteration.</p>
<p>Adding reflections did impact my render times pretty dramatically. I have a suspicion that both my intersection code and my KD-Tree are actually far from ideal, but I’ll have to look at that later. Here’s a test with reflections with the airplane:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2011/Dec/0.png"><img src="https://blog.yiningkarlli.com/content/images/2011/Dec/0.png" alt="" /></a></p>
<p>…and here is a test with three reflective dragons. This image took foooorrreeevvveeeerrrr to render…. I actually do not know how long, as I let it run overnight:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2011/Dec/render_test.png"><img src="https://blog.yiningkarlli.com/content/images/2011/Dec/render_test.png" alt="" /></a></p>
<p>I also added support for multiple lights with varying color support:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2011/Dec/1.png"><img src="https://blog.yiningkarlli.com/content/images/2011/Dec/1.png" alt="" /></a></p>
<p>Here are some more images rendered with my raytracer:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2011/Dec/7.png"><img src="https://blog.yiningkarlli.com/content/images/2011/Dec/7.png" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2011/Dec/3.png"><img src="https://blog.yiningkarlli.com/content/images/2011/Dec/3.png" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2011/Dec/render_test1.png"><img src="https://blog.yiningkarlli.com/content/images/2011/Dec/render_test1.png" alt="" /></a></p>
<p>In conclusion, the raytracer was a fun final project. I don’t think my raytracer is even remotely suitable for actual production use, and I don’t plan on using it for any future projects (unlike my <a href="http://blog.yiningkarlli.com/2011/10/a-volumetric-renderer-for-rendering-volumes.html">volumetric renderer</a>, which I think I will definitely be using in the future). However, I will definitely be using stuff I learned from the raytracer in my future GI renderer project, such as the KD-tree stuff and the iterative raytracing method. I will probably have to give my KD-tree a total rewrite, since it is really really far from optimal here, so that is something I’ll be starting over winter break! Next stop, GI renderer, CIS563, and CIS565!</p>
<p>As an amusing parting note, here is the first proper image I ever got out of my raytracer. Awww yeeeaaahhhhhh:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2011/Dec/supersweet_raytraced_image.png"><img src="https://blog.yiningkarlli.com/content/images/2011/Dec/supersweet_raytraced_image.png" alt="" /></a></p>
https://blog.yiningkarlli.com/2011/10/a-volumetric-renderer-for-rendering-volumes.html
A Volumetric Renderer for Rendering Volumes
2011-10-14T00:00:00+00:00
2011-10-14T00:00:00+00:00
Yining Karl Li
<p>The first assignment of the semester for CIS460 was to write, from scratch in C++, a volumetric renderer. Quite simply, a volumetric renderer is a program that can create a 2D image from a 3D discretized data set. Such data set is more often referred to as a voxel grid. In other words, a volumetric renderer makes pictures from voxels. Such renderers are useful in visualizing medical imaging data and some forms of 3D scans and blah blah blah…</p>
<p>…or you can make pretty clouds.</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2011/Oct/cloud07.png"><img src="https://blog.yiningkarlli.com/content/images/2011/Oct/cloud07.png" alt="" /></a></p>
<p>One of the first things I ever tried to make when I first was introduced to Maya was a cloud. I quickly learned that there simply is no way to get a nice fluffy cloud using polygonal modeling techniques. Ever since then I’ve kept the idea of making clouds parked in the back of my head, so when we were assigned the task of writing a volumetric renderer that could produce clouds, obviously I was pretty excited.</p>
<p>The coolest part of studying computer graphics from the computer science side of things has got to be the whole idea of “well, I want to make X, but I can’t seem to find any tool that can do X, so I guess…. I’LL JUST WRITE MY OWN PROGRAM TO MAKE X.”</p>
<p>I won’t go into detailed specifics about implementing the volumetric renderer, as that is a topic well covered by many papers written by authors much smarter than me. Also, future CIS460 students may stumble across this blog, and half the fun of the assignment is figuring out the detailed implementation for oneself. I don’t want to ruin that for them ;) Instead, I’ll give a general run-through of how this works.</p>
<p>The way the volumetric renderer works is pretty simple. You start with a big ol’ grid of voxels, called… the voxel grid or voxel buffer. From the camera, you shoot an imaginary ray through each pixel of what will be the final picture and trace that ray to see if it enters the voxel buffer. If the ray does indeed hit the voxel buffer, then you slowly sample along the ray a teeny step at a time and accumulate the color of the pixel based on the densities of the voxels traveled through. Lighting information is easy too: for each voxel reached, figure out how much stuff there is between that voxel and any light sources, and use a fancy equation to weight the amount of shadow a voxel receives. “But where does that voxel grid come from?”, you may wonder. In the case of my renderer, the voxel grid can either be loaded in from text files containing voxel data in a custom format, or the grid can be generated by sampling a Perlin noise function for each voxel in the grid.</p>
<p>So obviously volumetric renderers are pretty good for rendering clouds, as one can simply represent a cloud as a bunch of discrete points where each point has some density value. However, discretizing the world has a distinct disadvantage: artifacting. In the above render, some pixel-y artifacting is visible because the voxel grid I used wasn’t sufficiently high resolution enough to make each voxel indistinguishable. The problem is even more obvious in this render, where I stuck the camera right up into a cloud:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2011/Oct/newtest2_smooth.png"><img src="https://blog.yiningkarlli.com/content/images/2011/Oct/newtest2_smooth.png" alt="" /></a></p>
<p>(sidenote for those reading out of interest in CIS460: I implemented multiple arbitrary light sources in my renderer, which is where those colors are coming from)</p>
<p>There are four ways to deal with the artifacting issue. The first is to simply move the camera further away. Once the camera is sufficiently far away, even a relatively low resolution grid will look pretty smooth:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2011/Oct/newtest_withnoise.png"><img src="https://blog.yiningkarlli.com/content/images/2011/Oct/newtest_withnoise.png" alt="" /></a></p>
<p>A second way is to simply dramatically increase the resolution of the voxel mesh. This technique can be very very very memory expensive though. Imagine a 100x100x100 voxel grid where each voxel requires 4 bytes of memory… the total memory required is about 3.8 MB, which isn’t bad at all. But lets say we want a grid 5 times higher in resolution… a 500^3 grid needs 476 MB! Furthermore, a 1000x1000x1000 grid requires 3.72 GB! Of course, we could try to save memory by only storing non-empty voxels through the use of a hashmap or something, but that is more computationally expensive and gives no benefit in the worst case scenario of every voxel having some density.</p>
<p>A third alternative is to use trilinear interpolation or some other interpolation scheme to smooth out the voxel grid as its being sampled. This technique can lead to some fairly nice results:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2011/Oct/cloud10.png"><img src="https://blog.yiningkarlli.com/content/images/2011/Oct/cloud10.png" alt="" /></a></p>
<p>At least in the case of my renderer, there is a fourth way to deal with the artifacting: instead of preloading the voxel buffer with values from Perlin noise, why not just get rid of the notion of a discretized voxel buffer altogether and directly sample the Perlin noise function when raymarching? The result would indeed be a perfectly smooth, artifact free render, but the computational cost is extraordinarily high compared to using a voxel buffer.</p>
<p>Of course, one could just box blur the render afterwards as well. But doing so is sort of cheating.</p>
<p>I also played with trying to get my clouds to self illuminate, with the hope of possibly eventually making explosion type things. Ideally I would have done this by properly implementing a physically accurate black body system, but I did not have much time before the finished assignment was due to implement such a system. So instead, my friend Stewart Hills and I came up with a fake black body system where the emmitance of each voxel was simply determined by how far the voxel is from the outside of the cloud. For each voxel, simply raycast in several random directions until each raymarch hits zero density, pick the shortest distance, and plug that distance into some exponential falloff curve to get the voxel’s emittance. Here’s a self-glowing cloud:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2011/Oct/blackbody_tril.png"><img src="https://blog.yiningkarlli.com/content/images/2011/Oct/blackbody_tril.png" alt="" /></a></p>
<p>…not even close to physically accurate, but pretty good looking for a hack that was cooked up in a few hours! A closeup shot:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2011/Oct/blackbody18.png"><img src="https://blog.yiningkarlli.com/content/images/2011/Oct/blackbody18.png" alt="" /></a></p>
<p>So! The volumetric renderer was definitely a fun assignment, and now I’ve got a cool way to make clouds! Hopefully I’ll be able to integrate this renderer into some future projects!</p>
https://blog.yiningkarlli.com/2011/10/buildinginstalling-alembic-for-osx.html
Building/Installing Alembic for OSX
2011-10-06T00:00:00+00:00
2011-10-06T00:00:00+00:00
Yining Karl Li
<p><a href="http://www.alembic.io/">Alembic</a> is a new open-source computer graphics interchange framework being developed by <a href="http://opensource.imageworks.com/">Sony Imageworks</a> and <a href="http://www.ilm.com/">ILM</a>. The basic idea is that moving animation rigs and data and whatnot between packages can be a very tricky procedure since every package has its own way to handle animation, so why not bake out all of that animation data into a common interchange format? So, for example, instead of having to import a Maya rig into Houdini, you could rig/animate in Maya, bake out the animation to Alembic, bring that into Houdini to conduct simulations with, and then bake out the animation and bring it back into Maya. This is a trend that a number of studios including Sony, ILM, Pixar, etc. have been moving toward for some time.</p>
<p>I’ve been working on a project lately (more on that later) that makes use of Alembic, but I found that the only way to actually get Alembic is to build it from source. That’s not terribly difficult, but there’s not really any guides out there for folks who might not be as comfortable with building things from source. So, I wrote up a little guide!</p>
<p>Here’s how to build Alembic for OSX (10.6 and 10.7):</p>
<ol>
<li>Alembic has a lot of dependencies that can be annoying to build/install by hand, so we’re going to cheat and use Homebrew. To install Homebrew:</li>
</ol>
<code class="language-plaintext highlighter-rouge">/usr/bin/ruby -e "$(curl -fsSL https://raw.github.com/gist/323731)"</code>
<ol>
<li>Get/build/install cmake with Homebrew:</li>
</ol>
<code class="language-plaintext highlighter-rouge">brew install cmake</code>
<ol>
<li>Get/build/install Boost with Homebrew:</li>
</ol>
<code class="language-plaintext highlighter-rouge">brew install Boost</code>
<ol>
<li>Get/build/install HDF5 with Homebrew:</li>
</ol>
<code class="language-plaintext highlighter-rouge">brew install HDF5</code>
HDF5 has to make install itself, so this may take some time to run. Be patient.
<ol>
<li>Unfortunately, ilmbase is not a standard UNIX package, so we can’t use Homebrew. We’ll have to build ilmbase manually. Get it from:</li>
</ol>
http://download.savannah.nongnu.org/releases/openexr/ilmbase-1.0.2.tar.gz
Untar/unzip to a readily accessible directory and cd into the ilmbase directory. Run:
<code class="language-plaintext highlighter-rouge">./configure</code>
After that finishes, we get to the annoying part: ilmbase by default makes use of a deprecated GCC 3.x compiler flag called Wno-long-double, which no longer exists in GCC 4.x. We’ll have to deactivate this flag in ilmbase’s makefiles manually in order to build correctly. In each and every of the following files:
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>`/Half/Makefile
/HalfTest/Makefile
/Iex/Makefile
/IexTest/Makefile
/IlmThread/Makefile
/Imath/Makefile
/ImathTest/Makefile`
</code></pre></div></div>
Find the following line:
<code class="language-plaintext highlighter-rouge">CXXFLAGS = -g -O2 -D_THREAD_SAFE -Wno-long-double</code>
and delete it from the makefile.
Once all of that is done, you can make and then make install like normal.
Now move the ilmbase folder to somewhere safe. Something like /Developer/Dependencies might work, or alternatively /usr/include/
<ol>
<li>
<p>Time to actually build Alembic. Get the source tarball from:</p>
<p><code class="language-plaintext highlighter-rouge">http://code.google.com/p/alembic/wiki/GettingAlembic</code></p>
<p>Untar/unzip into a readily accessible directory and then create a build root directory parallel to the source root you just created:</p>
<p><code class="language-plaintext highlighter-rouge">mkdir ALEMBIC_BUILD</code></p>
<p>The build root doesn’t necesarily have to be parallel, but here we’ll assume it is for the sake of consistency.</p>
</li>
<li>
<p>Now cd into ALEMBIC_BUILD and bootstrap the Alembic build process. The bootstrap script is a python script:</p>
<p><code class="language-plaintext highlighter-rouge">python ../[Your Alembic Source Root]/build/bootstrap/alembic_bootstrap.py</code></p>
<p>The script will ask you for a whole bunch of paths:</p>
<p>For “Please enter the location where you would like to build the Alembic”, enter the full path to your <code class="language-plaintext highlighter-rouge">ALEMBIC_BUILD</code> directory.</p>
<p>For “Enter the path to lexical_cast.hpp:”, enter the full path to your lexical_cast.hpp, which should be something like <code class="language-plaintext highlighter-rouge">/usr/local/include/boost/lexical_cast.hpp</code></p>
<p>For “Enter the path to libboost_thread:”, your path should be something like <code class="language-plaintext highlighter-rouge">/usr/local/lib/libboost_thread-mt.a
</code>
For “Enter the path to zlib.h”, your path should be something like <code class="language-plaintext highlighter-rouge">/usr/include/zlib.h</code></p>
<p>For “Enter the path to libz.a”, we’re actually not going to link against libz.a. We’ll be using libz.dylib instead, which should be at something like <code class="language-plaintext highlighter-rouge">/usr/lib/libz.dylib</code></p>
<p>For “Enter the path to hdf5.h”, your path should be something like <code class="language-plaintext highlighter-rouge">/usr/local/include/hdf5.h</code></p>
<p>For “Enter the path to libhdf5.a”, your path should be something like <code class="language-plaintext highlighter-rouge">/usr/local/Cellar/hdf5/1.x.x/lib/libhdf5.a </code>(unless you did not use Homebrew for installing hdf5, in which case libhdf5.a will be in whatever lib directory you installed it to)</p>
<p>For “Enter the path to ImathMath.h”, your path should be something like <code class="language-plaintext highlighter-rouge">/usr/local/include/OpenEXR/ImathMath.h</code></p>
<p>For “Enter the path to libImath.a”, your path should be something like <code class="language-plaintext highlighter-rouge">/usr/local/lib/libImath.a</code></p>
<p>Now hit enter, and let the script finish running!</p>
</li>
<li>
<p>If everything is bootstrapped correctly, you can now make. This will take a while, be patient.</p>
</li>
<li>
<p>Once the make finishes successfully, run make test to check for any problems.</p>
</li>
<li>
<p>Finally, run make install, and we’re done! Alembic should install to something like <code class="language-plaintext highlighter-rouge">/usr/bin/alembic-1.x.x/.</code></p>
</li>
</ol>
https://blog.yiningkarlli.com/2011/09/installing-numpy-for-maya-2012-64-bit-on-osx-10-7.html
Installing Numpy for Maya 2012 64-bit on OSX 10.7
2011-09-05T00:00:00+00:00
2011-09-05T00:00:00+00:00
Yining Karl Li
<p>On OSX 10.6, installing <a href="http://numpy.scipy.org/">Numpy</a> for Maya 2012 was <a href="http://animateshmanimate.com/2011/03/30/python-numpy-and-maya-osx-and-windows/">simple enough</a>. You could do it either by directly copying the Numpy install folder into Maya’s Python’s site-packages folder or by adding a sys.path.append to Maya’s UserSetup.py. The process was quite simple since OSX 10.6’s default preinstalled version of Python was 2.6.x and Maya 2012 uses Python 2.6.x as well.</p>
<p>However, OSX 10.7 comes with Python 2.7.x, so a few extra steps are needed:</p>
<p>For Maya 2012 64-bit:</p>
<ol>
<li>
<p>OSX 10.7 comes with Python 2.7.x, but we need 2.6.x, so install 2.6.x using the official installer from here: http://www.python.org/ftp/python/2.6.6/python-2.6.6-macosx10.3.dmg</p>
</li>
<li>
<p>Since we’re using 64-bit Maya with 64-bit Python, we’ll need a 64-bit build of Numpy. The official version distributed on scipy.numpy.org is 32-bit, so we’ll need a 64-bit build. Thankfully, there is an unofficial 64-bit build in the form of the <a href="http://stronginference.com/scipy-superpack/">Scipy Superpack for Mac OSX</a>. Even though we’re on OSX 10.7, we’ll want the OSX 10.6 variety of the script since the OSX 10.7 is Python 2.7.x dependent: <a href="http://idisk.mac.com/fonnesbeck-Public/superpack_10.6_2011.07.10.sh">http://idisk.mac.com/fonnesbeck-Public/superpack_10.6_2011.07.10.sh</a></p>
<p>EDIT (01/12/2012): I’ve been informed by Michael Frederickson that the link originally posted to the unofficial 64 bit Scipy Superpack build for 10.6 no longer works. Fortunately, I’ve backed up both the script and the required dependencies. The install script can be found here: <a href="http://yiningkarlli.com/files/osx10.7numpy2.6/superpack_10.6_2011.07.10.sh">http://yiningkarlli.com/files/osx10.7numpy2.6/superpack_10.6_2011.07.10.sh</a></p>
</li>
<li>
<p>Go to where the script downloaded to and in Terminal:</p>
<p><code class="language-plaintext highlighter-rouge">chmod +x superpack_10.6_2011.07.10.sh
./superpack_10.6_2011.07.10.sh </code></p>
<p>If you don’t already have GNU Fortran, make sure to answer ‘yes’ when the script asks.</p>
</li>
<li>
<p>Once the script is done installing, in Terminal:</p>
<p><code class="language-plaintext highlighter-rouge">ls /Library/Python/2.7/site-packages/ | grep numpy </code></p>
<p>You should get something like: <code class="language-plaintext highlighter-rouge">numpy-2.0.0.dev_b5cdaee_20110710-py2.6-macosx-10.6-universal.egg</code></p>
<p>Even though we installed Numpy for Python 2.6.x, on Lion it installs to the 2.7 folder for some reason. No matter, you can either leave it there or move it to 2.6.</p>
</li>
<li>
<p>Go to <code class="language-plaintext highlighter-rouge">/Users/[your username]/Library/Preferences/Autodesk/maya/2012-x64/scripts </code></p>
</li>
<li>
<p>If you don’t have a file named <code class="language-plaintext highlighter-rouge">userSetup.py</code>, make one and open it in a text editor. If yes, open it.</p>
</li>
<li>
<p>Add these lines to the file:</p>
<p><code class="language-plaintext highlighter-rouge">import os
import sys
sys.path.append('/Library/Python/2.7/site-packages/[thing you got from step 4]') </code></p>
</li>
<li>
<p>Sidenote: installing Python 2.6.x sets your default OSX Python to 2.6.x, but if you want to go back to 2.7.x, just edit your <code class="language-plaintext highlighter-rouge">~/.bash_profile</code> and remove these lines:</p>
<p><code class="language-plaintext highlighter-rouge">PATH="/Library/Frameworks/Python.framework/Versions/2.6/bin:${PATH}"
export PATH </code></p>
</li>
</ol>
<p>….and you should be done! In Maya, you should be able to just use import numpy and you’ll be good to go!</p>
https://blog.yiningkarlli.com/2011/09/why-backups-are-important.html
GH House Project, a.k.a. Why Backups are Important
2011-09-01T00:00:00+00:00
2011-09-01T00:00:00+00:00
Yining Karl Li
<p>Here is a cautionary tale about why backing up one’s harddrive is EXTREMELY IMPORTANT.</p>
<p>Over the summer, I started making a little scene based off of the <a href="http://www.ronenbekerman.com/challenges/architectural-visualization-challenge-i-the-gh-house/">GH House Challenge from RonenBekerman.com</a>, partially as a way to learn Vray and partially just for fun. I was working off of my laptop for the entire project, since I was in California at the time and didn’t have access to more powerful machines at home. Being out in California for the summer, I brought as little stuff with me as possible.</p>
<p>One of the things I decided to leave home was my backup Time Machine drive. “Oh, I won’t need this over the summer, what are the odds of file corruption or harddrive issues anyhow? I’ll be fine”, I thought to myself.</p>
<p>Which means, of course, that halfway through the summer a bunch of my files got corrupted and were therefore lost forever, and of course that block of lost data included my in-progress GH House project. NEVER ASSUME THAT YOU DO NOT NEED BACKUP.</p>
<p>What follows are some random in-progress renders that survived through being in posts I made to Facebook and Tumblr.</p>
<p>Here are a series of small in-progress renders showing shading and lighting tests:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2011/Sep/GHHouse01.png"><img src="https://blog.yiningkarlli.com/content/images/2011/Sep/GHHouse01.png" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2011/Sep/GHHouse02.png"><img src="https://blog.yiningkarlli.com/content/images/2011/Sep/GHHouse02.png" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2011/Sep/GHHouse03.png"><img src="https://blog.yiningkarlli.com/content/images/2011/Sep/GHHouse03.png" alt="" /></a></p>
<p>I also started playing with some ideas for the interior:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2011/Sep/5.jpg"><img src="https://blog.yiningkarlli.com/content/images/2011/Sep/5.jpg" alt="" /></a></p>
<p>…and finally, some larger in-progress renders. These renders represent where the project was when I lost all of the data:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2011/Sep/1.jpg"><img src="https://blog.yiningkarlli.com/content/images/2011/Sep/1.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2011/Sep/2.jpg"><img src="https://blog.yiningkarlli.com/content/images/2011/Sep/2.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2011/Sep/3.jpg"><img src="https://blog.yiningkarlli.com/content/images/2011/Sep/3.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2011/Sep/4.jpg"><img src="https://blog.yiningkarlli.com/content/images/2011/Sep/4.jpg" alt="" /></a></p>
<p>In the end, the fact that I lost the project isn’t as important as the fact that I learned quite a lot from tinkering with this project. However, losing all of the data for this project was definitely a major bummer. But, lesson learned: BACK UP ALL THE TIME.</p>
https://blog.yiningkarlli.com/2011/05/animation-final-project-stills.html
Animation Final Project Stills
2011-05-08T00:00:00+00:00
2011-05-08T00:00:00+00:00
Yining Karl Li
<p>For my Computer Animation class’s final, I decided to go for a change in pace and work in 2D instead of in Maya. I want to tweak a few things before I post the finished animation, but I have two more finals to get through first. So for now, here are some stills:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2011/May/s1.png"><img src="https://blog.yiningkarlli.com/content/images/2011/May/s1.png" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2011/May/s3.png"><img src="https://blog.yiningkarlli.com/content/images/2011/May/s3.png" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2011/May/s2.png"><img src="https://blog.yiningkarlli.com/content/images/2011/May/s2.png" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2011/May/s4.png"><img src="https://blog.yiningkarlli.com/content/images/2011/May/s4.png" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2011/May/s5.png"><img src="https://blog.yiningkarlli.com/content/images/2011/May/s5.png" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2011/May/s6.png"><img src="https://blog.yiningkarlli.com/content/images/2011/May/s6.png" alt="" /></a></p>
https://blog.yiningkarlli.com/2011/05/why-cd-when-you-can-go.html
Why cd when you can go?
2011-05-05T00:00:00+00:00
2011-05-05T00:00:00+00:00
Yining Karl Li
<p>I learned a sweet trick from fellow Penn CIS student <a href="http://alexeymk.com/">Alexey Komissarouk</a>’s blog today: the ‘go’ command!</p>
<p>So in a standard *nix bash CLI, you have you’re typical cd command. We all know how to use cd.</p>
<p>But have you ever accidentally cd’d a file? “cd /stuff/blah.txt” makes no sense and just gets you a “Not a directory” error. So then you have to backtrack and use vim or emacs or nano or whatever… blarg. If you’re using emacs or vim, you like efficiency and you’ve already lost efficiency by wasting a perfectly good moment trying to cd into a file.</p>
<p>Enter the ‘go’ command!</p>
<p>Add this bit of code to your .bashrc file and replace $EDITOR with the CLI text editor of your choice:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>go()
{
if [ -f $1 ]
then
$EDITOR $1
else
cd $1 && ls
fi
}
</code></pre></div></div>
<p>and you’re all done! Now when you go to a directory, bash will cd and when you go to a file, bash will fire up vim or emacs or whatever.</p>
<p>As a side note, it might be fun to modify the ‘go’ command even further to automatically launch actions for other filetypes as well, like run javac whenever a .java is encountered or launch .jar files or run gcc or make whenever C++ makefiles are encountered. That’s left as an exercise to the reader though!</p>
https://blog.yiningkarlli.com/2011/04/chairs-now-with-balloons.html
Chairs…. now with Balloons!
2011-04-29T00:00:00+00:00
2011-04-29T00:00:00+00:00
Yining Karl Li
<p>Oops, I haven’t posted in a while…</p>
<p>A few weeks back I decided to try out overhauling one of my previous projects with VRay. I figured the chairs project would be fun, so…</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2011/Apr/shot3.jpg"><img src="https://blog.yiningkarlli.com/content/images/2011/Apr/shot3.jpg" alt="" /></a></p>
<p>Wwwwaaaayyyy prettier than before. I really like VRay, although I feel that setting it up is a bit more involved than MentalRay is. Still haven’t made too many inroads with Photorealistic Renderman yet, so I can’t comment on that quite yet.</p>
<p>Oh, also, as you can see, I added balloons too. I like balloons.</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2011/Apr/shot0.jpg"><img src="https://blog.yiningkarlli.com/content/images/2011/Apr/shot0.jpg" alt="" /></a></p>
<p>I decided to add balloons after seeing an article on <a href="http://www.ronenbekerman.com/">RonenBerkerman.com</a> a while back about shading balloons using VRay in 3DSMax. I’m using VRay in Maya, however, so I had to figure out how to recreate the shader in Maya’s Hypershade. The shader network winded up looking like this:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2011/Apr/balloonshadernetwork.png"><img src="https://blog.yiningkarlli.com/content/images/2011/Apr/balloonshadernetwork.png" alt="" /></a></p>
<p>It’s *almost* fully procedural, minus that one black and white ramp image that I wound up using for a lot of things. Replacing that image with a procedural ramp shader to make the entire shader fully procedural probably wouldn’t be very hard at all, but I got lazy :p</p>
<p>I was originally going to post breakdowns of all of the settings for each node in the shading network as well, but again, I’m lazy. So instead, <a href="http://www.yiningkarlli.com/files/BalloonShader.zip">here’s the shader in a Maya .ma file</a>!</p>
<p>A few more renders:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2011/Apr/shot1.jpg"><img src="https://blog.yiningkarlli.com/content/images/2011/Apr/shot1.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2011/Apr/shot2.jpg"><img src="https://blog.yiningkarlli.com/content/images/2011/Apr/shot2.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2011/Apr/shot4.jpg"><img src="https://blog.yiningkarlli.com/content/images/2011/Apr/shot4.jpg" alt="" /></a></p>
<p>As soon as my last finals are over in about a week, I’ll catch up with my backlog of things that need to be posted. I’m planning on posting a series of posts introducing some concepts in graphics programming that I learned in CIS277 this semester. I’m not going to go super duper in depth (for that, take CIS277! Dr. Norm Badler is an awesome professor.), but at the very least I’ll highlight some of the cooler things I learned. That class was really neat, we wound up writing our own 2D animation software from scratch and our final team project assignment was to build our own 3D modeling software. Basically, we made mini-Maya. My team (Adam Mally, Stewart Hills, and me) got some really neat stuff to work.</p>
<p>Speaking of Stewart, Stewart and I both will be interning at Pixar this summer! We got into their Pixar Undergraduate Program… uh… program. PUP essentially is a 10 week crash course on Pixar’s production pipeline, so we’ll be learning about everything from modeling to simulation to using Photorealistic Renderman. I’m really looking forward to that.</p>
https://blog.yiningkarlli.com/2011/03/vray-tree.html
VRay Tree
2011-03-28T00:00:00+00:00
2011-03-28T00:00:00+00:00
Yining Karl Li
<p>After being frustrated with Mentalray for a few weeks, I’ve decided to start experimenting with VRay. VRay is… pretty amazing.</p>
<p>I’ve been continuing my tree experiments using VRay. VRay’s Sun&Sky system is much nicer than Mentalray’s system and VRay has this crazy useful two-sided material for flat two dimensional planes… such as leaves. Here’s what I managed to cook up over the weekend:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2011/Mar/tree.jpg"><img src="https://blog.yiningkarlli.com/content/images/2011/Mar/tree.jpg" alt="" /></a></p>
<p>I’m still working out some kinks in my new tree workflow. I’ll post a full breakdown in a few days.</p>
https://blog.yiningkarlli.com/2011/03/mo-tree-and-grass-experimenting.html
Mo’ Tree (and Grass) Experimenting
2011-03-19T00:00:00+00:00
2011-03-19T00:00:00+00:00
Yining Karl Li
<p>I experimented with subsurface scatter based shaders for plant leaves today! I’m still working on it, so I won’t be writing up what I’ve found until a bit later. But for now, here’s what I’ve managed to get!</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2011/Mar/treetest0.png"><img src="https://blog.yiningkarlli.com/content/images/2011/Mar/treetest0.png" alt="" /></a></p>
<p>The grass is just Maya fur with a custom shader (woot subsurface scatter!) and the tree is modeled after a Japanese Maple and was made the same way as the one I posted a few days back.</p>
<p>Back to exploring!</p>
https://blog.yiningkarlli.com/2011/03/autumn-tree.html
Autumn Tree!
2011-03-17T00:00:00+00:00
2011-03-17T00:00:00+00:00
Yining Karl Li
<p>Every couple of months I find myself trying to make trees again in Maya. Today I found myself tackling the tree problem yet again…</p>
<p>I’ve found that using XFrog’s plant modeler program is my favorite way to create base meshes for plants. It sure as heck beats hand modeling all of those leaves… Speaking of models and leaves, the method I’ve settled on for tree leaves is to just use planes where the leaves should go and then make the planes look like leaves through alpha mapping.</p>
<p>Anyhoo, here’s where I managed to get tonight:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2011/Mar/treetest2.png"><img src="https://blog.yiningkarlli.com/content/images/2011/Mar/treetest2.png" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2011/Mar/treetest1.png"><img src="https://blog.yiningkarlli.com/content/images/2011/Mar/treetest1.png" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2011/Mar/treetest3.png"><img src="https://blog.yiningkarlli.com/content/images/2011/Mar/treetest3.png" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2011/Mar/treetest4.png"><img src="https://blog.yiningkarlli.com/content/images/2011/Mar/treetest4.png" alt="" /></a></p>
<p>The displacement on the bark is really really weird right now and the color of the leaves is weird too. I think I’m going to try subsurface scattering on the leaves… see if that helps. More updates later…</p>
https://blog.yiningkarlli.com/2011/03/demoreel-update.html
Demoreel Update!
2011-03-11T00:00:00+00:00
2011-03-11T00:00:00+00:00
Yining Karl Li
<p>So… after interviewing with Paul Kanyuk from Pixar, I’ve decided to update my reel a bit…</p>
<div class="embed-container"><iframe src="https://player.vimeo.com/video/20909195" frameborder="0">Demoreel Spring 2011 v2.1</iframe></div>
<p><a href="http://yiningkarlli.com/demoreel/spring2011/spring2011breakdown.pdf">Breakdown is here (PDF)</a>.</p>
<p>So why the updated reel? Interviewing with Paul and chatting with the other two people from Pixar that visited Penn was a really interesting. Paul had a lot of suggestions for my work during our interview, so I’ve decided to go ahead and incorporate a lot of the changes that Paul suggested.</p>
<p>So changelog time!</p>
<p>Overall Changes:</p>
<ul>
<li>New song! The new song is an instrumental version of Baby Universe from the We Love Katamari OST. As usual, the version of the reel I’m actually sending out to studios for internships has not music, though.</li>
<li>I’ve replaced “Postcards from Prague” with a new project, “Chairs”</li>
<li>Shuffled around the order of some pieces.</li>
</ul>
<p>Apples:</p>
<ul>
<li>Recomposited with slightly better z-depth using a new depth of field plugin for After Effects I found called Frischluft Lenscare. Apparently Alex Roman uses it, and if Alex Roman uses it, then gosh golly I’d better give it a try. Hahaha.</li>
<li>Slightly tweaked color grading</li>
<li>There’s a little more footage of the second shot of the apples bouncing than there was in the previous reel</li>
</ul>
<p>Hermit Crab:</p>
<ul>
<li>Recomposited with tweaked ambient occlusion in the turntable. Paul pointed out that there were some odd light leaking issues on the underside of the shell’s opening, so I’ve increased the intensity of the AO there to try to make it a bit darker.</li>
<li>Fixed a small problem with the transition between the Untextured Lambert and the Fully Textured parts of the turntable</li>
</ul>
<p>White Room:</p>
<ul>
<li>Every shot’s depth of field has been redone using Frischluft Lenscare</li>
<li>The first shot of the underside of the stairs was lengthened, rerendered, recomposited, and re color graded.</li>
<li>The second shot has new color grading and altered AO.</li>
<li>The third shot was rerendered with new contrast settings and re color gaded.</li>
</ul>
<p>Clock:</p>
<ul>
<li>Paul pointed out that a major flaw with the clock was that the highlight on the glass washed out everything under the glass, so I fixed that by changing the reflective properties of the glass slightly and giving the glass more of a curve to break up the highlight</li>
<li>Turntables were sped up to help the reel’s overall pacing</li>
<li>Textures on the clock face are sharper than before</li>
</ul>
<p>Raincoat Girls:</p>
<ul>
<li>Turntable was rerendered to get rid of the light blue band that appeared partway through the turntables in the previous reel</li>
<li>Environment shots were re color graded</li>
</ul>
https://blog.yiningkarlli.com/2011/02/demoreel-and-new-site.html
Demoreel and New Site!
2011-02-21T00:00:00+00:00
2011-02-21T00:00:00+00:00
Yining Karl Li
<p>Look! I finally cut together a demoreel!</p>
<div class="embed-container"><iframe src="https://player.vimeo.com/video/20205051" frameborder="0">Demoreel Spring 2011 v1.2</iframe></div>
<p>I’m getting interviewed by PIXAR! Actually, Pixar gets a slightly different version of my reel. The Pixar version simply has no music… they’re not very big fans of music in demoreels, apparently.</p>
<p>Which is why I got off of my lazy bum and finally cut a reel together.</p>
<p>Oh, I have a reel breakdown too! <a href="http://yiningkarlli.com/demoreel/spring2011/spring2011_v1.pdf">Check it out here (PDF)</a>.</p>
<p>The song in my reel is “I Like Van Halen Because My Sister Says They Are Cool” by El Ten Eleven. I’ve recut it slightly to fit the length of the reel.</p>
<p>I also finally put together a personal site thing at <a href="www.yiningkarlli.com">www.yiningkarlli.com</a>.</p>
<p>Speaking of new sites… I should probably get around to redesigning Omjii.com soon. Hm.</p>
https://blog.yiningkarlli.com/2011/02/how-bout-them-apples.html
How 'Bout Them Apples?
2011-02-19T00:00:00+00:00
2011-02-19T00:00:00+00:00
Yining Karl Li
<p>Earlier this week my mom gave my roommates and me a ginormous sack of apples, so I’ve been eating apples all week. Which is good, because I love apples.</p>
<p>So I had an apple sitting on my desk, and I had Maya open, and I was a little bit bored, so… I made some apples in Maya!</p>
<div class="embed-container"><iframe src="https://player.vimeo.com/video/21169670" frameborder="0">Falling Apples</iframe></div>
<p>Over the past few months I’ve developed a bit of an… odd… workflow for texturing/shading irregularly shaped objects (apples… muddy boots… hermit crabs…). I start with modeling and whatnot in Maya, as usual:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2011/Feb/apples_wireframe.png"><img src="https://blog.yiningkarlli.com/content/images/2011/Feb/apples_wireframe.png" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2011/Feb/apples_flat.png"><img src="https://blog.yiningkarlli.com/content/images/2011/Feb/apples_flat.png" alt="" /></a></p>
<p>Then I go into Photoshop and use various reference (images found online, photos taken with my Nikon D60, etc) to paint a tile of the texture I want. For example, for the apples I took some photos of the apples and then extracted textures from the photos to create this texture tile:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2011/Feb/apple_stencil.jpg"><img src="https://blog.yiningkarlli.com/content/images/2011/Feb/apple_stencil.jpg" alt="" /></a></p>
<p>Next, I bring the object mesh and the texture tile into Mudbox and use Mudbox’s projection stencil tool to paint the mesh using the texture tile as the stencil. The nice thing about bringing things into Mudbox for texturing is that I don’t really have to worry too much about UV mapping. Mudbox will automagically take care of all of the UV stuff as long as the imported mesh doesn’t have any overlapping UV coordinates. So instead of messing with the UV editor in Maya before texturing, I can just use Maya’s Automatic UV mapping tool to make sure that no UVs overlap and bring that into Mudbox. After painting in Mudbox, I got a texture image like this:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2011/Feb/apple_texture.jpg"><img src="https://blog.yiningkarlli.com/content/images/2011/Feb/apple_texture.jpg" alt="" /></a></p>
<p>After texture painting in Mudbox, deriving spec and bump maps in Photoshop is a relatively straightforward affair. Once texturing and shading is done, I render out the beauty pass and z-depth pass and other passes…</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2011/Feb/apples_z.png"><img src="https://blog.yiningkarlli.com/content/images/2011/Feb/apples_z.png" alt="" /></a></p>
<p>…and bring all those passes into After Effects for compositing and color grading, and I’m done!</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2011/Feb/apples1.png"><img src="https://blog.yiningkarlli.com/content/images/2011/Feb/apples1.png" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2011/Feb/apples2.png"><img src="https://blog.yiningkarlli.com/content/images/2011/Feb/apples2.png" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2011/Feb/apples3.png"><img src="https://blog.yiningkarlli.com/content/images/2011/Feb/apples3.png" alt="" /></a></p>
https://blog.yiningkarlli.com/2011/02/watermelon-smash.html
Watermelon Smash
2011-02-04T00:00:00+00:00
2011-02-04T00:00:00+00:00
Yining Karl Li
<p>For animation class, we were given an assignment where we each had to pick a random mixed drink name and use that name as the basis of a 10 second animation in After Effects. I picked something called a Watermelon Smash (it contains…. watermelon cubes… and I don’t remember what else). So… here’s a watermelon smashing something!</p>
<div class="embed-container"><iframe src="https://player.vimeo.com/video/21053019" frameborder="0">Watermelon Smash</iframe></div>
<p>We were actually allowed to use anything we wanted for the actual animation, the only rule was that we had to composite the final result together in After Effects. I wound up doing the ocean, splashes, and sparks in Flash (with tons of help from <a href="http://elementalmagic.blogspot.com/">Joseph Gilland</a>’s book <a href="http://www.amazon.com/Elemental-Magic-Special-Effects-Animation/dp/0240811631">Elemental Magic</a>) and painting the boat and the watermelon in Photoshop. The watermelon itself is animated entirely using the puppet warp tool in After Effects.</p>
<p>I’m not terribly happy with the sound. That needs some reworking probably…</p>
<p>Here’s a few stills breaking down how the entire thing was composited:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2011/Feb/breakdown1.png"><img src="https://blog.yiningkarlli.com/content/images/2011/Feb/breakdown1.png" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2011/Feb/breakdown2.png"><img src="https://blog.yiningkarlli.com/content/images/2011/Feb/breakdown2.png" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2011/Feb/breakdown3.png"><img src="https://blog.yiningkarlli.com/content/images/2011/Feb/breakdown3.png" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2011/Feb/breakdown4.png"><img src="https://blog.yiningkarlli.com/content/images/2011/Feb/breakdown4.png" alt="" /></a></p>
<p>…and here’s some random stills:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2011/Feb/shot1.png"><img src="https://blog.yiningkarlli.com/content/images/2011/Feb/shot1.png" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2011/Feb/shot2.png"><img src="https://blog.yiningkarlli.com/content/images/2011/Feb/shot2.png" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2011/Feb/shot3.png"><img src="https://blog.yiningkarlli.com/content/images/2011/Feb/shot3.png" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2011/Feb/shot4.png"><img src="https://blog.yiningkarlli.com/content/images/2011/Feb/shot4.png" alt="" /></a></p>
https://blog.yiningkarlli.com/2011/01/recent-stuff.html
Recent Stuff
2011-01-27T00:00:00+00:00
2011-01-27T00:00:00+00:00
Yining Karl Li
<p>This is just a quick dump of recent things I’ve been working on, detailed posts to come later.</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2011/Jan/test10.png"><img src="https://blog.yiningkarlli.com/content/images/2011/Jan/test10.png" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2011/Jan/render3.png"><img src="https://blog.yiningkarlli.com/content/images/2011/Jan/render3.png" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2011/Jan/render1.png"><img src="https://blog.yiningkarlli.com/content/images/2011/Jan/render1.png" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2011/Jan/render2.png"><img src="https://blog.yiningkarlli.com/content/images/2011/Jan/render2.png" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2011/Jan/trainstation_testrender011.jpg"><img src="https://blog.yiningkarlli.com/content/images/2011/Jan/trainstation_testrender011.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2011/Jan/trainstation_testrender02.jpg"><img src="https://blog.yiningkarlli.com/content/images/2011/Jan/trainstation_testrender02.jpg" alt="" /></a></p>
https://blog.yiningkarlli.com/2010/12/raincoat-girl-turntable.html
Raincoat Girl Turntable
2010-12-15T00:00:00+00:00
2010-12-15T00:00:00+00:00
Yining Karl Li
<p>I’m done with my character model! Until I can think of a better name, I’m just calling her “Raincoat Girl”.</p>
<p>Here’s a still:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2010/Dec/character.png"><img src="https://blog.yiningkarlli.com/content/images/2010/Dec/character.png" alt="" /></a></p>
<p>Getting the hair and cloth sims to work right took aaaggggeeesssss. Thank goodness <a href="http://www.marissakrupen.blogspot.com/">Marissa Krupen</a> knows so much and helped me out a lot.</p>
<p>I wound up cheating on the subsurface scatter for the skin. I couldn’t get it to look right on its own, so I wound up using a layered shader with the subsurface on one layer and a normal texture map on the other layer. I think the end result looks okay.</p>
<p>Turntable!</p>
<div class="embed-container"><iframe src="https://player.vimeo.com/video/17840556" frameborder="0">Raincoat Girl</iframe></div>
<p>I’ll post more stills later, but now I really need to study for that Finance final that I’ve been avoiding… I also have to make an environment for my character to go in to, and I still have that final project for 3D modeling to finish (I haven’t even started…).</p>
https://blog.yiningkarlli.com/2010/12/character-model-face.html
Character Model Face
2010-12-12T00:00:00+00:00
2010-12-12T00:00:00+00:00
Yining Karl Li
<p>Okay, I’m close to finished with the face…</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2010/Dec/test_render.png"><img src="https://blog.yiningkarlli.com/content/images/2010/Dec/test_render.png" alt="" /></a></p>
<p>I’ve decided not to give her eyelashes. They didn’t look very good. I also noticed that in some Pixar characters, Pixar chose to keep the lips the same color as the rest of the skin… I kind of like that style choice, so I’m going to steal it (read: Karl is too lazy to paint lips).</p>
https://blog.yiningkarlli.com/2010/12/cloth-simulation-progress.html
Cloth Simulation Progress
2010-12-04T00:00:00+00:00
2010-12-04T00:00:00+00:00
Yining Karl Li
<p>I’m working on a character model right now loosely based on <a href="http://blog.yiningkarlli.com/2010/10/puddle-redux.html">this sketch</a>. Here’s how the cloth simulation stuff is looking right now:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2010/Dec/cloth_test_4.jpeg"><img src="https://blog.yiningkarlli.com/content/images/2010/Dec/cloth_test_4.jpeg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2010/Dec/cloth_test_3.jpeg"><img src="https://blog.yiningkarlli.com/content/images/2010/Dec/cloth_test_3.jpeg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2010/Dec/cloth_test_2.jpeg"><img src="https://blog.yiningkarlli.com/content/images/2010/Dec/cloth_test_2.jpeg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2010/Dec/cloth_test_1.jpeg"><img src="https://blog.yiningkarlli.com/content/images/2010/Dec/cloth_test_1.jpeg" alt="" /></a></p>
<p>There is still much work to be done.</p>
https://blog.yiningkarlli.com/2010/11/city-street-playing-with-z-depth-and-ambient-occlusion.html
City Street- Playing with Z-Depth and Ambient Occlusion
2010-11-19T00:00:00+00:00
2010-11-19T00:00:00+00:00
Yining Karl Li
<p>I haven’t managed to make any progress on actually finishing this project since my last post, but I have had a bit of time to play with ambient occlusion and z-depth mapping. So… same render as before, but now with depth of field and some ambient occlusion:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2010/Nov/testrender7_composite_ao_zv2.jpg"><img src="https://blog.yiningkarlli.com/content/images/2010/Nov/testrender7_composite_ao_zv2.jpg" alt="" /></a></p>
<p>…and the z-depth map:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2010/Nov/z.jpg"><img src="https://blog.yiningkarlli.com/content/images/2010/Nov/z.jpg" alt="" /></a></p>
<p>…and the ambient occlusion map. I did the leaves on the trees by transparency mapping the planes where the leaves went on the model, but because of that I wasn’t sure how I was supposed to ambient occlude the trees. So I removed them for the ambient occlusion map:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2010/Nov/a_o.jpeg"><img src="https://blog.yiningkarlli.com/content/images/2010/Nov/a_o.jpeg" alt="" /></a></p>
<p>I actually found an alternate way to render out the z-depth map, but I’m not entirely sure this is as physically accurate as the standard way Maya does z-depth:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2010/Nov/z_alt.jpg"><img src="https://blog.yiningkarlli.com/content/images/2010/Nov/z_alt.jpg" alt="" /></a></p>
<p>Hopefully more soon!</p>
https://blog.yiningkarlli.com/2010/11/city-street-progress.html
City Street Progress
2010-11-17T00:00:00+00:00
2010-11-17T00:00:00+00:00
Yining Karl Li
<p>I’ve been working on a little city street for a few days now. I want to capture the kind of old European feel that one can find in places like Edinburgh.</p>
<p>Right now this is about 65% done. I think I’m going to try to make it look like an old postcard.</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2010/Nov/testrender7_compositev2.jpg"><img src="https://blog.yiningkarlli.com/content/images/2010/Nov/testrender7_compositev2.jpg" alt="" /></a></p>
https://blog.yiningkarlli.com/2010/11/clock-miniproject.html
Clock Miniproject
2010-11-08T00:00:00+00:00
2010-11-08T00:00:00+00:00
Yining Karl Li
<p>Over the weekend I decided to do a little mini-project to try out some new tricks I’ve learned with rendering. I decided to try to make as photorealistic of an image as possible of a clock. Here’s what I came up with:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2010/Nov/testrender_composite.jpg"><img src="https://blog.yiningkarlli.com/content/images/2010/Nov/testrender_composite.jpg" alt="" /></a></p>
<p>The clock face is noticeably pixelated; I’m not entirely sure why that is. For some reason Mental Ray is not sampling the texture file at a very high frequency, I’ll work on that next I suppose.</p>
<p>A little breakdown video of the compositing that went into the clock:</p>
<div class="embed-container"><iframe src="https://player.vimeo.com/video/16631563" frameborder="0">Clock Rendering/Compositing Breakdown</iframe></div>
https://blog.yiningkarlli.com/2010/11/hermit-crab.html
Hermit Crab
2010-11-06T00:00:00+00:00
2010-11-06T00:00:00+00:00
Yining Karl Li
<p>The hermit crab is complete!</p>
<div class="embed-container"><iframe src="https://player.vimeo.com/video/16569708" frameborder="0">Hermit Crab Redux</iframe></div>
<p>Modeled in Maya, textured in Mudbox, rendered with MentalRay.</p>
<p>I’m perfectly aware that no hermit crab would ever actually live in a conch shell that large, but I thought the image of a small crab in a huge shell was amusing.</p>
<p>Some stills from some different angles…</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2010/Nov/1.png"><img src="https://blog.yiningkarlli.com/content/images/2010/Nov/1.png" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2010/Nov/2.png"><img src="https://blog.yiningkarlli.com/content/images/2010/Nov/2.png" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2010/Nov/3.png"><img src="https://blog.yiningkarlli.com/content/images/2010/Nov/3.png" alt="" /></a></p>
<p>As the title “redux” suggests, the crab above is actually the second version of the hermit crab I’ve made. I originally finished about a week earlier with a different version, but then after getting some suggestions from Professor Scott White, my 3D modeling professor, I decided to redesign the conch shell. Here’s what Hermit Crab Mark I looked like:</p>
<div class="embed-container"><iframe src="https://player.vimeo.com/video/16490011" frameborder="0">Hermit Crab</iframe></div>
<p>I actually still want to change some things. If I have time, I’m going to go back and change the displacement mapping on the conch to get the groves to all go in a more uniform direction. Also 9 seconds into the turntable, you might notice there’s a slight shiny spot on the conch. That’s a mistake I made in the specular map that I definitely want to fix. I also want to try placing some small low intensity lights really close to the crab’s eyes to bring out the gloss that’s visible in the Mark I crab. In the Mark II crab, the shadow from the flaring part of the conch makes the crab’s eyes look matte. The crab’s claws need some color tweaking as well; the color doesn’t quite perfectly match the rest of the crab.</p>
<p>The DMD director, Amy Calhoun, told me that no modeler is ever satisfied with a model. So true.</p>
https://blog.yiningkarlli.com/2010/10/hermit-crab-ready-for-texturing.html
Hermit Crab Ready For Texturing!
2010-10-31T00:00:00+00:00
2010-10-31T00:00:00+00:00
Yining Karl Li
<p>My hermit crab is ready for texturing and lighting and rendering! I’m going with Mudbox for texture painting for sure. I’m still not entirely sure how I’m going to get all the prickly parts of the legs done… I’ll probably just do a displacement map or something.</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2010/Oct/front.png"><img src="https://blog.yiningkarlli.com/content/images/2010/Oct/front.png" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2010/Oct/right.png"><img src="https://blog.yiningkarlli.com/content/images/2010/Oct/right.png" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2010/Oct/left.png"><img src="https://blog.yiningkarlli.com/content/images/2010/Oct/left.png" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2010/Oct/back.png"><img src="https://blog.yiningkarlli.com/content/images/2010/Oct/back.png" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2010/Oct/top.png"><img src="https://blog.yiningkarlli.com/content/images/2010/Oct/top.png" alt="" /></a></p>
https://blog.yiningkarlli.com/2010/10/hermit-crab-progress.html
Hermit Crab Progress
2010-10-28T00:00:00+00:00
2010-10-28T00:00:00+00:00
Yining Karl Li
<p>I’m working on a hermit crab in 3D Modeling class! The shell was really hard to make… I wound up making a small segment, duplicating special it, and then stitching all the segments together by hand. So… the crab itself is only some legs right now. I have a lot of work to do on this still…</p>
<p>I’m thinking about trying Mudbox for texturing this thing. The UVs on the shell aren’t pretty, and I don’t want to spend a gazillion hours unwrapping those UVs….</p>
<p>More later.</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2010/Oct/crabprogress1.png"><img src="https://blog.yiningkarlli.com/content/images/2010/Oct/crabprogress1.png" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2010/Oct/crabprogress2.png"><img src="https://blog.yiningkarlli.com/content/images/2010/Oct/crabprogress2.png" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2010/Oct/crabprogress3.png"><img src="https://blog.yiningkarlli.com/content/images/2010/Oct/crabprogress3.png" alt="" /></a></p>
https://blog.yiningkarlli.com/2010/10/puddle-redux.html
Puddle! Redux
2010-10-23T02:00:00+00:00
2010-10-23T02:00:00+00:00
Yining Karl Li
<p><a href="http://floatingdoor.blogspot.com/">Ana</a> suggested a few changes. Much credit to her, the painting looks much much better now:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2010/Oct/jump_take3.png"><img src="https://blog.yiningkarlli.com/content/images/2010/Oct/jump_take3.png" alt="" /></a></p>
<p>I also played with giving the girl goggles of some sort and a snorkel, but I’m not sure this idea works so well.</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2010/Oct/jump_take2.png"><img src="https://blog.yiningkarlli.com/content/images/2010/Oct/jump_take2.png" alt="" /></a></p>
https://blog.yiningkarlli.com/2010/10/puddle.html
Puddle!
2010-10-23T01:00:00+00:00
2010-10-23T01:00:00+00:00
Yining Karl Li
<p><a href="https://blog.yiningkarlli.com/content/images/2010/Oct/jump_final.png"><img src="https://blog.yiningkarlli.com/content/images/2010/Oct/jump_final.png" alt="" /></a></p>
<p>I felt like painting (in Photoshop) today. I’m pretty happy with how this one turned out.</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2010/Oct/20.jpg"><img src="https://blog.yiningkarlli.com/content/images/2010/Oct/20.jpg" alt="" /></a></p>
https://blog.yiningkarlli.com/2010/10/little-kiddo.html
Little Kiddo
2010-10-18T00:00:00+00:00
2010-10-18T00:00:00+00:00
Yining Karl Li
<p>In Penn’s SIGGRAPH chapter, we’re spending the next few months designing our own little characters in Maya!</p>
<p>I drew a little girl complete with little kid size coat and rubber rain boots:</p>
<p><img src="https://blog.yiningkarlli.com/content/images/2010/Oct/characterdesign.jpg" alt="" /></p>
<p>I don’t really have any idea yet of what kind of adventure she’ll go on. I’ll figure that out as I go along, I suppose. I picked a little kiddo mainly because I love how crazy exaggerated little kids often make their expressions. Just check out the absolutely beautifully animated short <a href="http://vimeo.com/15731659">Playing with light - Mon ami le robot</a>.</p>
<p>More later!</p>
https://blog.yiningkarlli.com/2010/09/give-gifi.html
Give Gifi!
2010-09-29T00:00:00+00:00
2010-09-29T00:00:00+00:00
Yining Karl Li
<p>A few weeks ago I joined a startup founded by a few Penn alums called <a href="http://www.venmo.com/">Venmo</a>! My project at Venmo for the past few weeks has been helping my friend and co-worker, <a href="http://twitter.com/ayanonagon">Ayaka Nonaka</a>, with a new app from Venmo called <a href="http://www.givegifi.com/">Gifi</a>, which is a Foursquare/Venmo mashup that lets people leave Venmo money at geographic locations. I’ve been working on Gifi’s <a href="http://www.givegifi.com/">website</a> and overall look. Here’s some of the artwork I did for Gifi:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2010/Sep/quizzical.png"><img src="https://blog.yiningkarlli.com/content/images/2010/Sep/quizzical.png" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2010/Sep/panels_small.png"><img src="https://blog.yiningkarlli.com/content/images/2010/Sep/panels_small.png" alt="" /></a></p>
https://blog.yiningkarlli.com/2010/07/playing-with-maya.html
Playing with Maya
2010-07-10T00:00:00+00:00
2010-07-10T00:00:00+00:00
Yining Karl Li
<p>I’ve been playing with Maya for the past few months. 3D animation is a direction I’d like to start moving in.</p>
<p>I’m starting to get the hang of lighting things and whatnot, although I still do not know much. I’m taking a 3D modeling course in the fall, hopefully I’ll get much better by then.</p>
<p>Some glasses and strawberries and grapes on a table:</p>
<div class="embed-container"><iframe src="https://player.vimeo.com/video/12256221" frameborder="0">A Table With Some Stuff</iframe></div>
https://blog.yiningkarlli.com/2010/04/sneak-peak.html
A Sneak Peak...
2010-04-08T00:00:00+00:00
2010-04-08T00:00:00+00:00
Yining Karl Li
<p>A lot of you guys already know what I’ve been up to for the past few weeks, but for anybody who I haven’t told, here’s a peek:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2010/Apr/maya_progress.png"><img src="https://blog.yiningkarlli.com/content/images/2010/Apr/maya_progress.png" alt="" /></a></p>
https://blog.yiningkarlli.com/2010/04/the-foyer.html
The Foyer
2010-04-04T00:00:00+00:00
2010-04-04T00:00:00+00:00
Yining Karl Li
<p>For most of March, our Digital Design Foundations assignment was to create a room from an odd perspecitve using Illustrator and Photoshop. Here’s what I came up with:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2010/Apr/room_textured_all-04.jpg"><img src="https://blog.yiningkarlli.com/content/images/2010/Apr/room_textured_all-04.jpg" alt="" /></a></p>
<p>Almost all of this is Illustrator, including the wood, which took ages to do. I loosely based this off of the foyer back at home. My intent with this piece was to practice doing lighting work, and I must say I’m rather happy with it.</p>
<p>I made a quick little video of all the stages this piece went through:</p>
<div class="embed-container"><iframe src="https://player.vimeo.com/video/10676852#" frameborder="0">The Foyer</iframe></div>
<p>Just for fun, here are some color variations that I made by putting it through Lightroom:</p>
<p>I call this variant the “Coraline” version:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2010/Apr/room_textured_all-04_coraline.jpg"><img src="https://blog.yiningkarlli.com/content/images/2010/Apr/room_textured_all-04_coraline.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2010/Apr/room_textured_all-04_greenglow.jpg"><img src="https://blog.yiningkarlli.com/content/images/2010/Apr/room_textured_all-04_greenglow.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2010/Apr/room_textured_all-04_purple.jpg"><img src="https://blog.yiningkarlli.com/content/images/2010/Apr/room_textured_all-04_purple.jpg" alt="" /></a></p>
https://blog.yiningkarlli.com/2010/02/george-harrison-portrait.html
George Harrison Portrait
2010-02-23T00:00:00+00:00
2010-02-23T00:00:00+00:00
Yining Karl Li
<p>For Digital Design Foundations, our latest project was to do a portrait of a famous person with interesting looking hair. We were supposed to do these in Illustrator with black, white, and two colors of our choosing (the two colors allowed us to use as many opacity settings as we wanted for each color).</p>
<p>I decided to do George Harrison from the Beatles:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2010/Feb/HarrisonFinalv2_full.png"><img src="https://blog.yiningkarlli.com/content/images/2010/Feb/HarrisonFinalv2.png" alt="" /></a></p>
<p>The background pattern is influenced by the psychedelic pattern on the inner sleeve of Sgt. Pepper’s Lonely Hearts Club Band.</p>
<p>Here are some studies that show the progression of the portrait and some of the variations his mustache went though:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2010/Feb/HarrisonStudiesv2_full.png"><img src="https://blog.yiningkarlli.com/content/images/2010/Feb/HarrisonStudiesv2.png" alt="" /></a></p>
https://blog.yiningkarlli.com/2010/01/elemental-magic-workshop-with-joseph-gilland.html
Elemental Magic Workshop with Joseph Gilland
2010-01-27T00:00:00+00:00
2010-01-27T00:00:00+00:00
Yining Karl Li
<p>Last week over the Martin Luther King Jr. Day weekend, I attended a workshop on effects animation at Penn’s School of Design. The workshop was run by <a href="http://elementalmagic.blogspot.com/">Joseph Gilland</a>, who ran effects animation at Walt Disney Feature Animation for a while and worked on films such as Lilo and Stitch and Mulan.</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2010/Jan/penn_6.jpg"><img src="https://blog.yiningkarlli.com/content/images/2010/Jan/penn_6.jpg" alt="" /></a></p>
<p>The workshop ran for three days and focused on Mr. Gilland’s “organic approach” to visual effects animation- basically, his idea is that effects animation should focus on more traditional, hand-animated techniques rather than the complex CGI simulation stuff that’s all the rage today. After his workshop, I think I agree with him- the stuff he showed us was simply breathtaking.</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2010/Jan/penn_68.jpg"><img src="https://blog.yiningkarlli.com/content/images/2010/Jan/penn_68.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2010/Jan/penn_7.jpg"><img src="https://blog.yiningkarlli.com/content/images/2010/Jan/penn_7.jpg" alt="" /></a></p>
<p>During the workshop, we did some studies and prototyping for various visual effects ideas we had. I chose to do an exploding aquarium (I think Mr. Gilland started referring me as “the crazy guy” after I chose that.). I had another idea: a hand reaching through smoke. I had the idea that the hand would be reaching through smoke into a bank vault or something. Here’s a sketch of the two initial ideas:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2010/Jan/concepts.jpg"><img src="https://blog.yiningkarlli.com/content/images/2010/Jan/concepts.jpg" alt="" /></a></p>
<p>Mr. Gilland was kind enough to talk through the concept with me though. He sketched this initial concept for me (mad cool!):</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2010/Jan/jgsketch.jpg"><img src="https://blog.yiningkarlli.com/content/images/2010/Jan/jgsketch.jpg" alt="" /></a></p>
<p>After talking with Mr. Gilland and looking at some of his suggestions, I went about doing three separate studies of what the glass, water, and fireball might look like. Some pencil sketches:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2010/Jan/pencilsketch1.jpg"><img src="https://blog.yiningkarlli.com/content/images/2010/Jan/pencilsketch1.jpg" alt="" /></a></p>
<p><a href="https://blog.yiningkarlli.com/content/images/2010/Jan/pencilsketch2.jpg"><img src="https://blog.yiningkarlli.com/content/images/2010/Jan/pencilsketch2.jpg" alt="" /></a></p>
<p>Then I scanned the three studies and composited/colorized them together in Photoshop. The blue is water, the yellow/orange is the fireball, and the red represents the shattering glass:</p>
<p><a href="https://blog.yiningkarlli.com/content/images/2010/Jan/finalcolor.jpg"><img src="https://blog.yiningkarlli.com/content/images/2010/Jan/finalcolor.jpg" alt="" /></a></p>
<p>I’m going to try to actually do some animation tests for this- although obviously the end result will have to be much simpler visually than the above sketches. I have a feeling I’m going to be consulting the effects library Mr. Gilland gave us a LOT.</p>
<p>Mr. Gilland has a book on visual effects animation titled <a href="http://www.amazon.com/Elemental-Magic-Special-Effects-Animation/dp/0240811631">Elemental Magic</a>. I really recommend checking it out.</p>
https://blog.yiningkarlli.com/2009/12/experimenting-with-time-lapse.html
Experimenting with Time Lapse
2009-12-21T00:00:00+00:00
2009-12-21T00:00:00+00:00
Yining Karl Li
<p>Only two days left until my last final for the semester, so what do I do? Not study! I SHOULD be studying, but then the entire Northeast coast got slammed with a snowstorm. The snow looked really cool outside my window, which meant… photography experiment time!</p>
<p>Recently I’ve been experimenting with Adobe After Effects and Adobe Premiere Pro. More on that later. I also recently got Nikon Camera Control Pro 2, which is Nikon’s tool for remote controlling their DSLRs from computers, which means I can now remote trigger my Nikon D60 over USB from my MacBook Pro. Awesomeness. Time for some snowstorm time lapse experimenting!</p>
<p>So on Friday night/Saturday morning, I pointed my D60 out the window and set Camera Control to take a picture every 40 seconds for 5 hours starting from 5 AM. Unfortunately, I forgot the charge the battery to the camera died 80 minutes into the experiment. Also, apparently the movement of the camera’s internal mirror is enough to cause the camera to shift a bit if not stabilized. As a result, the video is really short and not very stable. It’s not particularly good, but it’s a start:</p>
<div class="embed-container"><iframe src="http://www.youtube.com/embed/IdzW27ydfqo" frameborder="0">Snowstorm Sunrise Time Lapse Test- 12/19/2009</iframe></div>
<p>This time lapse experiment also served a secondary purpose- to test out the planned workflow that we’re going to try using with the upcoming Omjii Show. I composited all of the video in Adobe After Effects and Adobe Premiere Pro and then used Apple Color to color grade the video. If you’re wondering why I’m using Apple Color but am using Premiere Pro instead of Final Cut Pro, it’s because I tend to favor things that plug into Adobe’s Creative Suite workflow but Adobe doesn’t have a color grader, whereas Apple has a really nice one.</p>
<p>Later in the afternoon, I decided to give the time lapse another shot. This time i remembered to charge the battery and stabilize the camera. The result:</p>
<div class="embed-container"><iframe src="http://www.youtube.com/embed/r5ICBpAj_uI" frameborder="0">Snowstorm Sunset Time Lapse- 12/19/2009</iframe></div>
<p>The problem with attempting time-lapses with a DSLR is that the length of the time you can cover is limited by your battery, unless you have an extended battery or something. Another attempt, this time from Sunday:</p>
<div class="embed-container"><iframe src="http://www.youtube.com/embed/7XRyl3fAiqg" frameborder="0">Sunset Over UPenn Time Lapse- 12/20/2009</iframe></div>
<p>I’m still working on getting the technique down, but I’ll post improved attempts and a detailed run-through of the process once I figure out how to stabilize better, among other things.</p>
https://blog.yiningkarlli.com/2009/12/an-introduction.html
An Introduction
2009-12-01T00:00:00+00:00
2009-12-01T00:00:00+00:00
Yining Karl Li
<p>Welcome to Code and Visuals, my blog for tracking my exploration of the world of computer graphics!</p>
<p>This post says December 2009 on it, but it’s actually backdated. I’m adding this post backdated in order to serve as a bit of an introduction. This blog began elsewhere but eventually became my computer graphics blog. Upon moving the hosting of this blog to Github Pages, I’ve decided to clear out some older off-topic posts, although those posts will remain available on the <a href="http://yiningkarlli.blogspot.com">old Blogger version of this blog</a>.</p>
<p>I started this blog around the time I joined Penn’s Digital Media Design program in 2009. Most of the older posts on this blog are pretty silly, but hopefully they show that I’ve made progress since then!</p>