Moana 2

This fall marked the release of Moana 2, Walt Disney Animation’s 63rd animated feature and the 10th feature film rendered entirely using Disney’s Hyperion Renderer. Moana 2 brings us back to the beautiful world of Moana, but this time on a larger adventure with a larger canoe, a crew to join our heroine, bigger songs, and greater stakes. The first Moana was at the time of its release one of the most beautiful animated films ever made, and Moana 2 lives up to that visual legacy with frames that match or often surpass what we did in the original movie. I got to join Moana 2 about two years ago and this film proved to be an incredibly interesting project!

While we’ve now used Disney’s Hyperion Renderer to make several sequels to previous Disney Animation films, Moana 2 is the first time we’ve used Hyperion to make a sequel to a previous film that also used Hyperion. From a technical perspective, the time between the first and second Moana films is filled with almost a decade of continual advancement in our rendering technology and in our wider production pipeline. At the time that we made the first Moana, Hyperion was only a few years old and we spent a lot of time on the first Moana fleshing out various still-underdeveloped features and systems in the renderer. Going into the second Moana, Hyperion is now an extremely mature, extremely feature rich, battle-tested production renderer with which we can make essentially anything we can imagine. Almost every single feature and system in Hyperion today has seen enormous advancement and improvement over what we had on the first Moana; many of these advancements were in fact driven by hard lessons that we learned on the first Moana! Compared with the first Moana, here’s a short, very incomplete laundry list of improvements made over the past decade that we were able to leverage on Moana 2:

  • Moana 2 uses a completely new water rendering system that represents an enormous leap in both render-time efficiency and easier artist workflows compared with what we used on the first Moana; more on this later in this post.
  • After the first Moana, we completely rewrote Hyperion’s previous volume rendering subsystem [Habel 2017] from scratch; the modern system is a state-of-the-art delta-tracking system that required us to make foundational research advancements in order to implement [Kutz et al. 2017, Huang et al. 2021].
  • Our traversal system was completely rewritten to better handle thread scalability and to incorporate a form of rebraiding to efficiently handle gigantic world-spanning geometry; this was inspired directly by problems we had rendering the huge ocean surfaces and huge islands in the first Moana [Burley et al. 2018].
  • On the original Moana, ray self-intersection with things like Maui’s feathers presented a major challenge; Moana 2 is the first film using our latest ray self-intersection prevention system that notably does away with any form of ray bias values.
  • We introduced a limited form of photon mapping on the first Moana that only worked between the sun and water surfaces [Burley et al. 2018].; Moana 2 uses an evolved version of our photon mapper that supports all of our light types, many or our standard lighting features, and even has advanced capabilities like a form of spectral dispersion.
  • We’ve made a number of advancements [Burley et al. 2017, Chiang et al. 2016, Chiang at al. 2019, Zeltner et al. 2022] to various elements of the Disney BSDF shading model.
  • Subsurface scattering on the first Moana was done using normalized diffusion; since then we’ve moved all subsurface scattering to use a state-of-the-art brute force path tracing approach [Chiang et al. 2016].
  • Eyes on the first Moana used our old ad-hoc eye shader; eyes on Moana 2 use our modern physically plausible eye shader that includes state-of-the-art iris caustics calculated using manifold next event estimation [Chiang & Burley 2018].
  • The emissive mesh importance sampling system that we implemented on the first Moana and our overall many-lights sampling system has seen many efficiency improvements [Li et al. 2024].
  • Hyperion has gained many more powerful features granting artists an enormous degree of artistic control both in the renderer and post-render in compositing [Burley 2019, Burley et al. 2024].
  • Since the first Moana, Hyperion’s subdivision/tessellation system has gained an advanced fractured mesh system that makes many of the huge-scale effects in the first Moana movie much easier for us to create today [Burley & Rodriguez 2022].
  • We’ve introduced path guiding into Hyperion to handle particularly difficult light transport cases [Müller et al. 2017, Müller 2019].
  • The original Moana used our somewhat ad-hoc first-generation denoiser, while Moana 2 uses our best-in-industry second-generation deep learning denoiser jointly developed by Disney Research Studios, Disney Animation, Pixar, and ILM [Vogels et al. 2018, Dahlberg et al. 2019].
  • Even Hyperion’s internal architecture has changed enormously; Hyperion originally was famous for being a batched wavefront renderer, but this has evolved significantly since then and continues to evolve.

There are many many more changes to Hyperion that there simply isn’t room to list here. To give you some sense of how far Hyperion has evolved between Moana and Moana 2: the Hyperion used on Moana was internally versioned as Hyperion 3.x; the Hyperion used on Moana 2 is internally versioned as Hyperion 16.x, with each version number in between representing major changes. In addition to improvements in Hyperion, our rendering team has also been working for the past few years on a next-generation interactive lighting system that extensively leverages hardware GPU ray tracing; Moana 2 saw the widest deployment yet of this system; I can’t say much more on this topic yet but hopefully we’ll have more to share soon.

Outside of the rendering group, literally everything else about our entire studio production pipeline has changed as well; the first Moana was made mostly on proprietary internal data formats, while Moana 2 was made using the latest iteration of our cutting-edge modern USD pipeline [Miller et al. 2022, Vo et al. 2023, Li et al. 2024]. The modern USD pipeline has granted our pipeline many amazing new capabilities and far more flexibility, to the point where it became possible to move our entire lighting workflow to a new DCC for Moana 2 without needing to blow up the entire pipeline. Our next-generation interactive lighting system is similarly made possible by our modern USD pipeline. I’m sure we’ll have much more about this at SIGGRAPH!

While I get to work on every one of our feature films and get to do fun and interesting things every time, Moana 2 is the most direct and deep I’ve worked on one of our films probably since the original Moana. There are two specific projects I worked on for Moana 2 that I am particularly proud of: a completely new water rendering system that is part of Moana 2’s overall new water FX workflow, and the volume rendering work that was done for the storm battle in the movie’s third act.

On the original Moana, we had to develop a lot of custom water simulation and rendering technology because commercial tools at the time couldn’t quite handle what the movie required. On the simulation side, the original Moana required Disney Animation to invent new techniques such as the APIC (affine particle-in-cell) fluid simulation model [Jiang et al. 2015] and the FAB (fluxed animated boundary) method for integrating procedural and simulated water dynamics [Stomakhin and Selle 2017]. Disney Animation’s general philosophy towards R&D is that we will develop and invent new methods when needed, but will then aim to publish our work with the goal of allowing anything useful we invent to find its way into the wider graphics field and industry; a great outcome is when our publications are adopted by the commercial tools and packages that we build on top of. APIC and FAB were both published and have since become a part of the stock toolset in Houdini, which in turn allowed us to build more on top of Houdini’s built-in SOPs for Moana 2’s water FX workflow.

On the rendering side, the main challenge on the original Moana for rendering water was the need to combine water surfaces from many different sources (procedural, manually animated, and simulated) into a single seamless surface that could be rendered with proper refraction, internal volumetric effects, caustics, and so on. Our solution to combine different water surfaces on the original Moana was to convert all input water elements into signed distance fields, composite all of the signed distance fields together into a single world-spanning levelset, and then mesh that levelset into a triangle mesh for ray intersection [Palmer et al. 2017]. While this process produced great visual results, running this entire world-spanning levelset compositing and meshing operation at renderer startup for each frame proved to be completely untenable due to how slow it made interaction for artists, so an extensive system for pre-caching ocean surfaces overnight to disk had to be built out. All in all, the water rendering and caching system on the first Moana required a dedicated team of over half a dozen developers and TDs to maintain, and setting up the levelset compositing system correctly proved to be challenging for artists.

For Moana 2, we decided to revisit water rendering with the goal of coming up with something much easier for artists to use, much faster to render, and much easier to maintain by a smaller group of engineers and TDs. Over the course of about half a year, we completely replaced the old levelset compositing and meshing system with a new ray-intersection-time CSG system. Our new system requires almost zero work for artists to set up, requires zero preprocessing time before renderer startup and zero on-disk caching, renders with negligible impact on ray tracing speed, and required zero dedicated TDs and only part of my time as an engineer to support once primary development was completed. In addition to all of the above, the new system also allows for both better looking and more memory efficient water because the level of detail that water meshes have to exist at is no longer constrained by the resolution of a world-size meshed levelset, even with an adaptive levelset meshing. I think this was a great example where by revisiting a world that we already knew how to make, we were given an opportunity to reevaluate what we learned on Moana in order to come up with something better by every metric for Moana 2.

We knew that returning to the world of Moana was likely going to require a heavy lift from a volume rendering perspective. With a mind towards this, we worked closely with Disney Research Studios in Zürich to implement next-generation volume path guiding techniques in Hyperion, which wound up not seeing wide deployment this time but nonetheless proved to be a fun and interesting project from which we learned a lot. We also realized that the third act’s storm battle was going to be incredibly challenging from both an FX and rendering perspective. My last few months on Moana 2 were spent helping get the storm battle sequences finished; one extremely unusual thing we wound up doing was providing custom builds of Hyperion with specific optimizations tailored to the specific requirements of the storm sequence, sometimes going as far as to provide specific builds and settings tailored on a per-shot basis. Normally this is something any production rendering team tries to avoid if possible, but one of the benefits of having our own in-house team and our own in-house renderer is that we are still able to do this when the need arises. From a personal perspective, being able to point at specific shots and say “I wrote code for that specific thing” is pretty neat!

From both a story and a technical perspective, Moana 2 is everything we loved from Moana brought back, plus a lot of fun, big, bold new stuff. Making Moana 2 both gave us new challenges to solve and allowed us to revisit and come up with better solutions to old challenges from Moana. I’m incredibly proud of the work that my teammates and I were able to do on Moana 2; I’m sure we’ll have a lot more to share at SIGGRAPH 2025, and in the meantime I strongly encourage you to see Moana 2 on the biggest screen you can find!

To give you a taste of how beautiful this film looks, here are some frames from Moana 2, 100% created using Disney’s Hyperion Renderer by our amazing artists. These are presented in no particular order:

All images in this post are courtesy of and the property of Walt Disney Animation Studios.

References

Brent Burley, David Adler, Matt Jen-Yuan Chiang, Ralf Habel, Patrick Kelly, Peter Kutz, Yining Karl Li, and Daniel Teece. 2017. Recent Advancements in Disney’s Hyperion Renderer. In ACM SIGGRAPH 2017 Course Notes: Path Tracing in Production Part 1. 26-34.

Brent Burley, David Adler, Matt Jen-Yuan Chiang, Hank Driskill, Ralf Habel, Patrick Kelly, Peter Kutz, Yining Karl Li, and Daniel Teece. 2018. The Design and Evolution of Disney’s Hyperion Renderer. ACM Transactions on Graphics. 37, 3 (2018), 33:1-33:22.

Brent Burley. 2019. On Histogram-Preserving Blending for Randomized Texture Tiling. Journal of Computer Graphics Techniques, 8, 4 (2019), 31-53.

Brent Burley and Francisco Rodriguez. 2022. Fracture-Aware Tessellation of Subdivision Surfaces. In ACM SIGGRAPH 2022 Talks, 10:1-10:2.

Brent Burley, Brian Green, and Daniel Teece. 2024. Dynamic Screen Space Textures for Coherent Stylization. In ACM SIGGRAPH 2024 Talks, 50:1-50:2.

Matt Jen-Yuan Chiang, Peter Kutz, and Brent Burley. 2016. Practical and Controllable Subsurface Scattering for Production Path Tracing. In ACM SIGGRAPH 2016 Talks. 49:1-49:2.

Matt Jen-Yuan Chiang and Brent Burley. 2018. Plausible Iris Caustics and Limbal Arc Rendering. In ACM SIGGRAPH 2018 Talks, 15:1-15:2.

Matt Jen-Yuan Chiang, Yining Karl Li, and Brent Burley. 2019. Taming the Shadow Terminator. In ACM SIGGRAPH 2019 Talks. 71:1-71:2.

Henrik Dahlberg, David Adler, and Jeremy Newlin. 2019. Machine-Learning Denoising in Feature Film Production. In ACM SIGGRAPH 2019 Talks. 21:1-21:2.

Ralf Habel. 2017. Volume Rendering in Hyperion. In ACM SIGGRAPH 2017 Course Notes: Production Volume Rendering. 91-96.

Wei-Feng Wayne Huang, Peter Kutz, Yining Karl Li, and Matt Jen-Yuan Chiang. 2021. Unbiased Emission and Scattering Importance Sampling for Heterogeneous Volumes. In ACM SIGGRAPH 2021 Talks. 3:1-3:2.

Chenfafu Jiang, Craig Schroeder, Andrew Selle, Joseph Teran, and Alexey Stomakhin. 2015. The Affine Particle-in-Cell Method. ACM Transactions on Graphics. 34, 4 (2015), 51:1-51:10.

Peter Kutz, Ralf Habel, Yining Karl Li, and Jan Novák. 2017. Spectral and Decomposition Tracking for Rendering Heterogeneous Volumes. ACM Transactions on Graphics. 36, 4 (2017), 111:1-111:16.

Harmony M. Li, George Rieckenberg, Neelima Karanam, Emily Vo, and Kelsey Hurley. 2024. Optimizing Assets for Authoring and Consumption in USD. In ACM SIGGRAPH 2024 Talks. 30:1-30:2.

Yining Karl Li, Charlotte Zhu, Gregory Nichols, Peter Kutz, Wei-Feng Wayne Huang, David Adler, Brent Burley, and Daniel Teece. 2024. Cache Points for Production-Scale Occlusion-Aware Many-Lights Sampling and Volumetric Scattering. In DigiPro 2024. 6:1-6:19.

Tad Miller, Harmony M. Li, Neelima Karanam, Nadim Sinno, and Todd Scopio. 2022. Making Encanto with USD: Rebuilding a Production Pipeline Working from Home. In ACM SIGGRAPH 2022 Talks. 12:1-12:2.

Thomas Müller. 2019. Practical Path Guiding in Production. In ACM SIGGRAPH 2019 Course Notes: Path Guiding in Production. 37-50.

Thomas Müller, Markus Gross, and Jan Novák. 2017. Practical Path Guiding for Efficient Light-Transport Simulation. Computer Graphics Forum. 36, 4 (2017), 91-100.

Sean Palmer, Jonathan Garcia, Sara Drakeley, Patrick Kelly, and Ralf Habel. 2017. The Ocean and Water Pipeline of Disney’s Moana. In ACM SIGGRAPH 2017 Talks. 29:1-29:2.

Alexey Stomakhin and Andy Selle. 2017. Fluxed Animated Boundary Method. ACM Transactions on Graphics. 36, 4 (2017), 68:1-68:8.

Emily Vo, George Rieckenberg, and Ernest Petti. 2023. Honing USD: Lessons Learned and Workflow Enhancements at Walt Disney Animation Studios. In ACM SIGGRAPH 2023 Talks. 13:1-13:2.

Thijs Vogels, Fabrice Rousselle, Brian McWilliams, Gerhard Röthlin, Alex Harvill, David Adler, Mark Meyer, and Jan Novák. 2018. Denoising with Kernel Prediction and Asymmetric Loss Functions. ACM Transactions on Graphics. 37, 4 (2018), 124:1-124:15.

Tizian Zeltner, Brent Burley, and Matt Jen-Yuan Chiang. 2022. Practical Multiple-Scattering Sheen Using Linearly Transformed Cosines. In ACM SIGGRAPH 2022 Talks. 7:1-7:2.

Porting Takua Renderer to Windows on Arm

A few years ago I ported Takua Renderer to build and run on arm64 systems. Porting to arm64 proved to be a major effort (see Parts 1, 2, 3, and 4) which wound up paying off in spades; I learned a lot, found and fixed various longstanding platform-specific bugs in the renderer, and wound up being perfectly timed for Apple transitioning the Mac to arm64-based Apple Silicon. As a result, for the past few years I have been routinely building and running Takua Renderer on arm64 Linux and macOS, in addition to building and runninng on x86-64 Linux/Mac/Windows. Even though I take somewhat of a Mac-first approach for personal projects since I daily drive macOS, I make a point of maintaining robust cross-platform support for Takua Renderer for reasons I wrote about in the first part of this series.

Up unti recently though, my supported platforms list for Takua Renderer notably did not include Windows on Arm. There are two main reasons why I never ported Takua Renderer to build and run on Windows on Arm. The first reason is that Microsoft’s own support for Windows on Arm has up until recently been in a fairly nascent state. Windows RT added Arm support in 2012 but only for 32-bit processors, and Windows 10 added arm64 support in 2016 but lacked a lot of native applications and developer support; notably, Visual Studio didn’t gain native arm64 support until late in 2022. The second reason I never got around to adding Windows on Arm support is simply that I don’t have any Windows on Arm hardware sitting around and generally there just have not been many good Windows on Arm devices available in the market. However, with the advent of Qualcomm’s Oryon-based Snapdragon X SoCs and Microsoft’s push for a new generation of arm64 PCs using the Snapdragon X SoCs, all of the above finally seems to be changing. Microsoft also authorized arm64 editions of Windows 11 for use in virtual machines on Apple Silicon Macs at the beginning of this year. With Windows on Arm now clearly signaled as a major part of the future of Windows and clearly signaled as here to stay, and now that spinning up a Windows 11 on Arm VM is both formally supported and easy to do, a few weeks ago I finally got around to getting Takua Renderer up and running on native arm64 Windows 11.

Overall this process was very easy compared with my previous efforts to add support for arm64 Mac and Linux. This was not because porting architectures is easier on Windows but rather is a consequence of the fact that I had already solved all of the major architecture-related porting problems for Mac and Linux; the Windows 11 on Arm port just piggy-backed on those efforts. Because of how relatively straightforward this process was, this will be a shorter post, but there were a few interesting gotchas and details that I think are worth noting in case they’re useful to anyone else porting graphics stuff to Windows on Arm.

Note that everything in this post uses arm64 Windows 11 Pro 23H2 and Visual Studio 2022 17.10.x. Noting the specific versions used here is important since Microsoft is still actively fleshing out arm64 support in Windows 11 and Visual Studio 2022; later versions will likely see improvements to problems discussed in this post.

Figure 1: Takua Renderer running on arm64 Windows 11, in a virtual machine on an Apple Silicon Mac.

OpenGL on arm64 Windows 11

Takua has two user interface systems: a macOS-specific UI written using a combination of Dear Imgui, Metal, and AppKit, and a cross-platform UI written using a combination of Dear Imgui, OpenGL, and GLFW. On macOS, OpenGL is provided by the operating system itself as part of the standard system frameworks. On most desktop Linux distributions, OpenGL can be provided by several different sources: one option is entirely through the operating system’s provided Mesa graphics stack, another option is through a combination of Mesa for the graphics API and a proprietary driver for the backend hardware support, and the last option is entirely through a proprietary driver (such as with Nvidia’s official drivers). On Windows, however, the operating system does not provide modern OpenGL (“modern” meaning OpenGL 3.3 or newer), support whatsoever and the OpenGL 1.1 support that is available is a wrapper around Direct3D; modern OpenGL support on Windows has to be provided entirely by the graphics driver.

I don’t actually have any native arm64 Windows 11 hardware, so for this porting project, I ran arm64 Windows 11 as a virtual machine on two of my Apple Silicon Macs. I used the excellent UTM app (which under the hood uses QEMU) as the hypervisor. However, UTM does not provide any kind of GPU emulation/virtualization to Windows virtual machines, so the first problem I ran into was that my arm64 Windows 11 environment did not have any kind of modern OpenGL support due to the lack of a GPU driver with OpenGL. Therefore, I had no way to build and run Takua’s UI system.

Fortunately, because OpenGL is so widespread in commonly used applications and games, this is a problem that Microsoft has already anticipated and come up with a solution for. A few years ago, Microsoft developed and released an OpenGL/OpenCL Compatability Pack for Windows on Arm, and they’ve since also added Vulkan support to the compatability pack as well. The compatability pack is available for free on the Windows Store. Under the hood, the compatability pack uses a combination of Microsoft-developed client drivers and a bunch of components from Mesa to translate from OpenGL/OpenCL/Vulkan to Direct3D [Jiang 2020]. This system was originally developed to provide support for specifically Photoshop on arm64 Windows, but has since been expanded to provide general OpenGL 3.3, OpenCL 3.0, and Vulkan 1.2 support to all applications on arm64 Windows. Installing the compatability pack allowed me to get GLFW building and to get GLFW’s example demos working.

Takua’s cross-platform UI is capable of running either using OpenGL 4.5 on systems with support for the latest fanciest OpenGL API version, or using OpenGL 3.3 on systems that only have older OpenGL support (examples include macOS when not using the native Metal-based UI and include many SBC Linux devices such as Raspberry Pi). Since the arm64 Windows compatability pack only fully supports up to OpenGL 3.3, I set up Takua’s arm64 Windows build to fall back to only use the OpenGL 3.3 code path, which was enough to get things up and running. However, I immediately noticed that everything in the UI looked wrong; specifically, everything was clearly not in the correct color space.

The problem turned out to be that the Windows OpenGL/OpenCL/Vulkan compatability pack doesn’t seem to correctly implement GL_FRAMEBUFFER_SRGB; calling glEnable(GL_FRAMEBUFFER_SRGB) did not have any impact on the actual color space that the framebuffer rendered with. To work around this problem, I simply added software sRGB emulation to the output fragment shader and added some code to detect if GL_FRAMEBUFFER_SRGB was working or not and if not, fall back to the fragment shader’s implementation. Implementing the sRGB transform is extremely easy and is something that every graphics program inevitably ends up doing a bunch of times throughout one’s career:

float sRGB(float x) {
    if (x <= 0.00031308)
        return 12.92 * x;
    else
        return 1.055*pow(x,(1.0 / 2.4) ) - 0.055;
}

With this fix, Takua’s UI now fully works on arm64 Windows 11 and displays renders correctly:

Figure 2: The left window shows Takua running using glEnable(GL_FRAMEBUFFER_SRGB) and not displaying the render correctly, while the right window shows Takua running using sRGB emulation in the fragment shader.

Building Embree on arm64 Windows 11

Takua has a moderately sized dependency base, and getting all of the dependency base compiled during my ports to arm64 Linux and arm64 macOS was a very large part of the overall effort since arm64 support across the board was still in an early stage in the graphics field three years ago. However, now that libraries such as Embree and OpenEXR and even TBB have been building and running on arm64 for years now, I was expecting that getting Takua’s full dependency base brought up on Windows on Arm would be straightforward. Indeed this was the case for everything except Embree, which proved to be somewhat tricky to get working. I was surprised that Embree proved to be difficult, since Embree for a few years now has had excellent arm64 support on macOS and Linux. Thanks to a contribution from Apple’s Developer Ecosystem Engineer team, arm64 Embree now even has a neat double-pumped NEON option for emulating AVX2 instructions.

As of the time of writing this post, compiling Embree 4.3.1 for arm64 using MSVC 19.x (which ships with Visual Studio 2022) simply does not work. Initially just to get the renderer up and running in some form at all, I disabled Embree in the build. Takua has both an Embree-based traversal system and a standalone traversal system that uses my own custom BVH implementation; I keep both systems at parity with each other because Takua at the end of the day is a hobby renderer that I work on for fun, and writing BVH code is fun! However, a secondary reason for keeping both traversal systems around is because in the past having a non-Embree code path has been useful for getting the renderer bootstrapped on platforms that Embree doesn’t fully support yet, and this was another case of that.

Right off the bat, building Embree with MSVC runs into a bunch of problems with detecting the platform as being a 64-bit platform and also runs into all kinds of problems with including immintrin.h, which is where vector data types and other x86-64 intrinsics stuff is defined. After hacking my way through solving those problems, the next issue I ran into is that MSVC really does not like how Embree carries out static initialisation of NEON datatypes; this is a known problem in MSVC. Supposedly this issue was fixed in MSVC some time ago, but I haven’t been able to get it to work at all. Fixing this issue requires some extensive reworking of how Embree does static initialisation of vector datatypes, which is not a very trivial task; Anthony Roberts previously attempted to actually make these changes in support of getting Embree on Windows on Arm working for use in Blender, but eventually gave up since making these changes while also making sure Embree still passes all of its internal tests proved to be challenging.

In the end, I found a much easier solution to be to just compile Embree using Visual Studio’s version of clang instead of MSVC. This has to be done from the command line; I wasn’t able to get this to work from within Visual Studio’s regular GUI. From within a Developer PowerShell for Visual Studio session, the following worked for me:

cmake -G "Ninja" ../../ -DCMAKE_C_COMPILER="clang-cl" `
                        -DCMAKE_CXX_COMPILER="clang-cl" ` 
                        -DCMAKE_C_FLAGS_INIT="--target=arm64-pc-windows-msvc" `
                        -DCMAKE_CXX_FLAGS_INIT="--target=arm64-pc-windows-msvc" `
                        -DCMAKE_BUILD_TYPE=Release `
                        -DTBB_ROOT="[TBB LOCATION HERE]" `
                        -DCMAKE_INSTALL_PREFIX="[INSTALL PREFIX HERE]"

To do the above, of course you will need both CMake and Ninja installed; fortunately both come with pre-built arm64 Windows binaries on their respective websites. You will also need to install the “C++ Clang Compiler for Windows” component in the Visual Studio Installer application if you haven’t already.

Just building with clang is also the solution that Blender eventually settled on for Windows on Arm, although Blender’s version of this solution is a bit more complex since Blender builds Embree using its own internal clang and LLVM build instead of just using the clang that ships with Visual Studio.

An additional limitation in compiling Embree 4.3.1 for arm64 on Windows right now is that ISPC support seems to be broken. On arm64 macOS and Linux this works just fine; the ISPC project provides prebuilt arm64 binaries on both platforms, and even without a prebuilt arm64 binary, I found that running the x86-64 build of ISPC on arm64 macOS via Rosetta 2 worked without a problem when building Embree. However, on arm64 Windows 11, even though the x86-64 emulation system ran the x86-64 build of ISPC just fine standalone, trying to run it as part of the Embree build didn’t work for me despite me trying a variety of ways to get it to work. I’m not sure if this works with a native arm64 build of ISPC; building ISPC is a sufficiently involved process that I decided it was out of scope for this project.

Running x86-64 code on arm64 Windows 11

Much like how Apple provides Rosetta 2 for running x86-64 applications on arm64 macOS, Microsoft provides a translation layer for running x86 and x86-64 applications on arm64 Windows 11. In my post on porting to arm64 macOS, I included a lengthy section discussing and performance testing Rosetta 2. This time around, I haven’t looked as deeply into x86-64 emulation on arm64 Windows, but I did do some basic testing. Part of why I didn’t go as deeply into this area on Windows is because I’m running arm64 Windows 11 in a virtual machine instead of on native hardware- the comparison won’t be super fair anyway. Another part of why I didn’t go in as deeply is because x86-64 emulation is something that continues to be in an active state of development on Windows; Windows 11 24H2 is supposed to introduce a new x86-64 emulation system called Prism that Microsoft promises to be much faster than the current system in 23H2 [Mehdi 2024]. As of writing though, little to no information is available yet on how Prism works and how it improves on the current system.

The current system for emulating x86 and x86-64 on arm64 Windows is a fairly complex system that differs greatly from Rosetta 2 in a lot of ways. First, arm64 Windows 11 supports emulating both 32-bit x86 and 64-bit x86-64, whereas macOS dropped any kind of 32-bit support long ago and only needs to support 64-bit x86-64 on 64-bit arm64. Windows actually handles 32-bit x86 and 64-bit x86-64 through two basically completely different systems. 32-bit x86 is handled through an extension of the WoW64 (Windows 32-bit on Windows 64-bit) system, while 64-bit x86-64 uses a different system. The 32-bit system uses a JIT compiler called xtajit.dll [Radich et al. 2020, Beneš 2018] to translate blocks of x86 assembly to arm64 assembly and has a caching mechanism for JITed code blocks similar to Rosetta 2 to speed up execution of x86 code that has already been run through the emulation system before [Cylance Research Team 2019]. In the 32-bit system, overall support for providing system calls and whatnot are handled as part of the larger WoW64 system.

The 64-bit system relies on a newer mechanism. The core binary translation system is similar to the 32-bit system, but providing system calls and support for the rest of the surrounding operatin system doesn’t happen through WoW64 at all and instead relies on something that is in some ways similar to Rosetta 2, but is in other crucial ways radically different from Rosetta 2 or the 32-bit WoW64 approach. In Rosetta 2, arm64 code that comes from translation uses a completely different ABI from native arm64 code; the translated arm64 ABI contains a direct mapping between x86-64 and arm64 registers. Microsoft similarly uses a different ABI for translated arm64 code compared with native arm64 code; in Windows, translated arm64 code uses the arm64EC (EC for “Emulation Compatible”) ABI. Here though we find the first major difference between the macOS and Windows 11 approaches. In Rosetta 2, the translated arm64 ABI is an internal implementation detail that is not exposed to users or developers whatsoever; by default there is no way to compile source code against the translated arm64 ABI in Xcode. In the Windows 11 system though, the arm64EC ABI is directly available to developers; Visual Studio 2022 supports compiling source code against either the native arm64 or the translation-focused arm64EC ABI. Code built as arm64EC is capable of interoperating with emulated x86-64 code within the same process, the idea being that this approach allows developers to incrementally port applications to arm64 piece-by-piece while leaving other pieces as x86-64 [Sweetgall et al. 2023]. This… is actually kind of wild if you think about it!

The second major difference between the macOS and Windows 11 approaches is even bigger than the first. On macOS, application binaries can be fat binaries (Apple calls these universal binaries), which contain both full arm64 and x86-64 versions of an application and share non-code resources within a single universal binary file. The entirety of macOS’s core system and frameworks ship as universal binaries, such that at runtime Rosetta 2 can simply translate both the entirety of the user application and all system libraries that the application calls out to into arm64. Windows 11 takes a different approach- on arm64, Windows 11 extends the standard Windows portable executable format (aka .exe files) to be a hybrid binary format called arm64X (X for eXtension). The arm64X format allows for arm64 code compiled against the arm64EC ABI and emulated x86-64 code to interoperate within the same binary; x86-64 code in the binary is translated to arm64EC as needed. Pretty much every 64-bit system component of Windows 11 on Arm ships as arm64X binaries [Niehaus 2021]. Darek Mihocka has a fantastic article that goes into extensive depth about how arm64EC and arm64X work, and Koh Nakagawa has done an extensive analysis of this system as well.

One thing that Windows 11’s emulation system does not seem to be able to do is make special accomodations for TSO memory ordering. As I explored previously, Rosetta 2 gains a very significant performance boost from Apple Silicon’s hardware-level support for emulating x86-64’s strong memory ordering. However, since Microsoft cannot control and custom tailor the hardware that Windows 11 will be running on, arm64 Windows 11 can’t make any guarantees about hardware-level TSO memory ordering support. I don’t know if this situation is any different with the new Prism emulator running on the Snapdragon X Pro/Elite, but in the case of the current emulation framework, the lack of hardware TSO support is likely a huge problem for performance. In my testing of Rosetta 2, I found that Takua typically ran about 10-15% slower as x86-64 under Rosetta 2 with TSO mode enabled (the default) compared with native arm64, but ran 40-50% slower as x86-64 under Rosetta 2 with TSO mode disabled compared with native arm64.

Below are some numbers comparing running Takua on arm64 Windows 11 as a native arm64 application versus as an emulated x86-64 application. The tests used are the same as the ones I used in my Rosetta 2 tests, with the same settings as before. In this case though, because this was all running in a virtual machine (with 6 allocated cores) instead of directly on hardware, the absolute numbers are not as important as the relative difference between native and emulated modes:

  CORNELL BOX  
  1024x1024, PT  
Test: Wall Time: Core-Seconds:
Native arm64 (VM): 60.219 s approx 361.314 s
Emulated x86-64 (VM): 202.242 s approx 1273.45 s
  TEA CUP  
  1920x1080, VCM  
Test: Wall Time: Core-Seconds:
Native arm64 (VM): 244.37 s approx 1466.22 s
Emulated x86-64 (VM): 681.539 s approx 4089.24 s
  BEDROOM  
  1920x1080, PT  
Test: Wall Time: Core-Seconds:
Native arm64 (VM): 530.261 s approx 3181.57 s
Emulated x86-64 (VM): 1578.76 s approx 9472.57 s
  SCANDINAVIAN ROOM  
  1920x1080, PT  
Test: Wall Time: Core-Seconds:
Native arm64 (VM): 993.075 s approx 5958.45 s
Emulated x86-64 (VM): 1745.5 s approx 10473.0 s

The emulated results are… not great; for compute-heavy workloads like path tracing, x86-64 emulation on arm64 Windows 11 seems to to be around 1.7x to 3x slower than native arm64 code. These results are much slower compared with how Rosetta 2 performs, which generally sees only a 10-15% performance penalty over native arm64 when running Takua Renderer. However, a critical caveat has to be pointed out here: reportedly Windows 11’s x86-64 emulation works worse in a VM on Apple Silicon than it does on native hardware because Arm RCpc instructions on Apple Silicon are relatively slow. For Rosetta 2 this behavior doesn’t matter because Rosetta 2 uses TSO mode instead of RCpc instructions for emulating strong memory ordering, but since Windows on Arm does rely on RCpc for emulating strong memory ordering, this means that the results above are likely not fully representative of emulation performance on native Windows on Arm hardware. Nonetheless though, having any form of x86-64 emulation at all is an important part of making Windows on Arm viable for mainstream adoption, and I’m looking forward to see how much of an improvement the new Prism emulation system in Windows 11 24H2 brings. I’ll update these results with the Prism emulator once 24H2 is released, and I’ll also update these results to show comparisons on real Windows on Arm hardware whenever I actually get some real hardware to try out.

Conclusion

I don’t think that x86-64 is going away any time soon, but at the same time, the era of mainstream desktop arm64 adoption is here to stay. Apple’s transition to arm64-based Apple Silicon already made the viability of desktop arm64 unquestionable, and now that Windows on Arm is finally ready for the mainstream as well, I think we will now be living in a multi-architecture world in the desktop computing space for a long time. Having more competitors driving innovation ultimately is a good thing, and as new interesting Windows on Arm devices enter the market alongside Apple Silicon Macs, Takua Renderer is ready to go!

References

ARM Holdings. 2022. Load-Acquire and Store-Release instructions. Retrieved June 7, 2024.

Petr Beneš. 2018. Wow64 Internals: Re-Discovering Heaven’s Gate on ARM. Retrieved June 5, 2024.

Cylance Research Team. 2019. Teardown: Windows 10 on ARM - x86 Emulation. In BlackBerry Blog. Retrieved June 5, 2024.

Angela Jiang. 2020. Announcing the OpenCL™ and OpenGL® Compatibility Pack for Windows 10 on ARM. In DirectX Developer Blog. Retrieved June 5, 2024.

Yusuf Mehdi. 2024. Introducing Copilot+ PCs. In Official Microsoft Blog. Retrieved June 5, 2024.

Derek Mihocka. 2024. ARM64 Boot Camp. Retrieved June 5, 2024.

Koh M. Nakagawa. 2021. Discovering a new relocation entry of ARM64X in recent Windows 10 on Arm. In Project Chameleon. Retrieved June 5, 2024.

Koh M. Nakagawa. 2021. Relock 3.0: Relocation-based obfuscation revisited in Windows 11 on Arm. In Project Chameleon. Retrieved June 5, 2024.

Michael Niehaus. 2021. Running x64 on Windows 10 ARM64: How the heck does that work?. In Out of Office Hours. Retrieved June 5, 2024.

Quinn Radich, Karl Bridge, David Coulter, and Michael Satran. 2020. WOW64 Implementation Details. In Programming Guide for 64-bit Windows. Retrieved June 5, 2024.

Marc Sweetgall, Drew Batchelor, Scott Jones, and Matt Wojciakowski. 2023. Arm64EC - Build and port apps for native performance on ARM. Retrieved June 5, 2024.

Wikipedia. 2024. WoW64. Retrieved June 5, 2024.

SIGGRAPH 2023 Conference Paper- Progressive Null-tracking for Volumetric Rendering

This year at SIGGRAPH 2023, we have a conference-track technical paper in collaboration with Zackary Misso and Wojciech Jarosz from Dartmouth College! The paper is titled “Progressive Null-tracking for Volumetric Rendering” and is the result of work that Zackary did while he was a summer intern with the Hyperion development team last summer. On the Disney Animation side, Brent Burley, Dan Teece, and I oversaw Zack’s internship work, while on the the Dartmouth side, Wojciech was involved in the project as both Zack’s PhD advisor and as a consultant to Disney Animation.

Figure 1 from the paper: Most existing unbiased null-scattering methods for heterogeneous participating media require knowledge of a maximum density (majorant) to perform well. Unfortunately, bounding majorants are difficult to guarantee in production, and existing methods like ratio tracking and weighted delta tracking (top, left) suffer from extreme variance if the “majorant” (𝜇𝑡 =0.01) significantly underestimates the maximum density of the medium (𝜇𝑡 ≈3.0). Starting with the same poor estimate for a majorant (𝜇𝑡 = 0.01), we propose to instead clamp the medium density to the chosen majorant. This allows fast, low-variance rendering, but of a modified (biased) medium (top, center). We then show how to progressively update the majorant estimates (bottom row) to rapidly reduce this bias and ensure that the running average (top right) across multiple pixel samples converges to the correct result in the limit.

Here is the paper abstract:

Null-collision approaches for estimating transmittance and sampling free-flight distances are the current state-of-the-art for unbiased rendering of general heterogeneous participating media. However, null-collision approaches have a strict requirement for specifying a tightly bounding total extinction in order to remain both robust and performant; in practice this requirement restricts the use of null-collision techniques to only participating media where the density of the medium at every possible point in space is known a-priori. In production rendering, a common case is a medium in which density is defined by a black-box procedural function for which a bounding extinction cannot be determined beforehand. Typically in this case, a bounding extinction must be approximated by using an overly loose and therefore computation- ally inefficient conservative estimate. We present an analysis of how null-collision techniques degrade when a more aggressive initial guess for a bounding extinction underestimates the true maximum density and turns out to be non-bounding. We then build upon this analysis to arrive at two new techniques: first, a practical, efficient, consistent progressive algorithm that allows us to robustly adapt null-collision techniques for use with procedural media with unknown bounding extinctions, and second, a new importance sampling technique that improves ratio-tracking based on zero-variance sampling.

The paper and related materials can be found at:

One cool thing about this project is that this project both served as a direct extension of Zack’s PhD research area and served as a direct extension of the approach we’ve been taking to volume rendering in Disney’s Hyperion Renderer over the past 6 years. Hyperion has always used unbiased transmittance estimators for volume rendering (as opposed to biased ray marching) [Fong et al. 2017], and Hyperion’s modern volume rendering system is heavily based on null-collision theory [Woodcock et al. 1965]. We’ve put significant effort into making a null-collision based volume rendering system robust and practical in production, which led to projects such as residual ratio tracking [Novák et al. 2014], spectral and decomposition tracking [Kutz et al. 2017] and approaches for unbiased emission and scattering importance sampling in heterogeneous volumes [Huang et al. 2021]. Over the past decade, many other production renderers [Christensen et al. 2018, Gamito 2018, Novák et al. 2018] have similarly made the shift to null-collision based volume rendering because of the many benefits that the null-collision framework brings, such as unbiased volume rendering and efficient handling of volumes with lots of high-order scattering due to the null-collision framework’s ability to cheaply perform distance sampling. Vanilla null-collision volume rendering does have shortcomings, such as difficulty in efficiently sampling optically thin volumes due to the fact that null-collision tracking techniques produce a binary transmittance estimate that is super noisy. A lot of progress has been made in improving null-collision volume rendering’s efficiency and robustness in these thin volumes cases [Villemin and Hery 2013, Villemin et al. 2018, Herholz et al. 2019, Miller et al. 2019]; the intro to the paper goes into much more extensive detail about these advancements.

However, one major limitation of null-collision volume rendering that remained unsolved until this paper is that the null-collision framework requires knowing the maximum density, or bounding majorant of a heterogeneous volume beforehand. This is a fundamental requirement of null-collision volume rendering that makes using procedurally defined volumes difficult, since the maximum possible density value of a procedurally defined volume cannot be known a-priori without either putting into place a hard clamp or densely evaluating the procedural function. As a result, renderers that use null-collision volume rendering typically only support procedurally defined volumes by pre-rasterizing the procedural function onto a fixed voxel grid, à la the volume pre-shading in Manuka [Fascione et al. 2018]. The need to pre-rasterize procedural volumes negates a lot of the workflow and artistic benefits of using procedural volumes; this is one of several reasons why other renderers continue to use ray-marching based integrators for volumes despite the bias and loss of efficiency at handling high-order scattering. Inspired by ongoing challenges we were facing with rendering huge volume-scapes on Strange World at the time, we gave Zack a very open-ended challenge for his internship: brainstorm and experiment with ways to lift this limitation in null-collision volume rendering.

Zack’s PhD research coming into this internship revolved around deeply investigating the math behind modern volume rendering theory, and from these investigations, Zack had previously found deep new insights into how to formulate volumetric transmittance [Georgiev et al. 2019] and cool new ways to de-bias previously biased techniques such as ray marching [Misso et al. 2022]. Zack’s solution to the procedural volumes in null-collision volume rendering problem very much follows in the same trend as his previous papers; after initially attempting to find ways to adapt de-biased ray marching to fit into a null-collision system, Zack went back to first principles and had the insight that a better solution was to find a way to de-bias the result that one gets from clamping the majorant of a procedural function. This idea really surprised me when he first proposed it; I had never thought about the problem from this perspective before. Dan, Brent, and I were highly impressed!

In addition to the acknowledgements in the paper, I wanted to acknowledge here Henrik Falt and Jesse Erickson from Disney Animation, who spoke with Zack and us early in the project to help us better understand how better procedural volumes support in Hyperion could benefit FX artist workflows. We are also very grateful to Disney Animation’s CTO, Nick Cannon, for granting us permission to include example code implemented in Mitsuba as part of the paper’s supplemental materials.

One of my favorite images from this paper: a procedurally displaced volumetric Stanford bunny rendered using the progressive null tracking technique from the paper.

A bit of a postscript: during the Q&A session after Zack’s paper presentation at SIGGRAPH, Zack and I had a chat with Wenzel Jakob, Merlin Nimier-David, Delio Vicini, and Sébastien Speierer from EPFL’s Realistic Graphics Lab. Wenzel’s group brought up a potential use case for this paper that we hadn’t originally thought of. Neural radiance fields (NeRFs) [Mildenhall et al. 2020, Takikawa et al. 2023] are typically rendered using ray marching, but this is often inefficient. Rendering NeRFs using null tracking instead of ray marching is an interesting idea, but the neural networks that underpin NeRFs are essentially similar to procedural functions as far as null-collision tracking is concerned because there’s no way to know a tight bounding majorant for a neural network a-priori without densely evaluating the neural network. Progressive null tracking solves this problem and potentially opens the door to more efficient and interesting new ways to render NeRFs! If you happen to be interested in this problem, please feel free to reach out to Zack, Wojciech, and myself.

Getting to work with Zack and Wojciech on this project was an honor and a blast; I count myself as very lucky that working at Disney Animation continues to allow me to meet and work with rendering folks from across our field!

References

Brent Burley, David Adler, Matt Jen-Yuan Chiang, Hank Driskill, Ralf Habel, Patrick Kelly, Peter Kutz, Yining Karl Li, and Daniel Teece. 2018. The Design and Evolution of Disney’s Hyperion Renderer. ACM Transactions on Graphics. 37, 3 (2018), 33:1-33:22.

Per H. Christensen, Julian Fong, Jonathan Shade, Wayne L Wooten, Brenden Schubert, Andrew Kensler, Stephen Friedman, Charlie Kilpatrick, Cliff Ramshaw, Marc Bannister, Brenton Rayner, Jonathan Brouillat, and Max Liani. 2018. RenderMan: An Advanced Path Tracing Architecture for Movie Rendering. ACM Transactions on Graphics. 37, 3 (2018), 30:1-30:21.

Luca Fascione, Johannes Hanika, Mark Leone, Marc Droske, Jorge Schwarzhaupt, Tomáš Davidovič, Andrea Weidlich, and Johannes Meng. 2018. Manuka: A Batch-Shading Architecture for Spectral Path Tracing in Movie Production. ACM Transactions on Graphics. 37, 3 (2018), 31:1-31:18.

Julian Fong, Magnus Wrenninge, Christopher Kulla, and Ralf Habel. 2017. Production Volume Rendering. In ACM SIGGRAPH 2021 Courses. 2:1-2:97.

Manuel Gamito. Path Tracing the Framestorian Way. In SIGGRAPH 2018 Course Notes: Path Tracing in Production. 52-61.

Sebastian Herholz, Yangyang Zhao, Oskar Elek, Derek Nowrouzezahrai, Hendrik P A Lensch, and Jaroslav Křivánek. Volume Path Guiding Based on Zero-Variance Random Walk Theory. ACM Transactions on Graphics. 38, 3 (2019), 24:1-24:19.

Wei-Feng Wayne Huang, Peter Kutz, Yining Karl Li, and Matt Jen-Yuan Chiang. 2021. Unbiased Emission and Scattering Importance Sampling For Heterogeneous Volumes. In ACM SIGGRAPH 2021 Talks. 3:1-3:2.

Peter Kutz, Ralf Habel, Yining Karl Li, and Jan Novák. 2017. Spectral and Decomposition Tracking for Rendering Heterogeneous Volumes. ACM Transactions on Graphics. 36, 4 (2017), 111:1-111:16.

Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. In ECCV 2020: Proceedings of the 16th European Conference on Computer Vision. 405-421.

Bailey Miller, Iliyan Georgiev, and Wojciech Jarosz. 2019. A Null-Scattering Path Integral Formulation of Light Transport. ACM Transactions on Graphics. 38, 4 (@019), 44:1-44:13.

Jan Novák, Iliyan Georgiev, Johannes Hanika, and Wojciech Jarosz. 2018. Monte Carlo Methods for Volumetric Light Transport Simulation. Computer Graphics Forum. 37, 2 (2018), 551-576.

Jan Novák, Andrew Selle and Wojciech Jarosz. 2014. Residual Ratio Tracking for Estimating Attenuation in Participating Media. ACM Transactions on Graphics. 33, 6 (2014), 179:1-179:11.

Towaki Takikawa, Shunsuke Saito, James Tompkin, Vincent Sitzmann, Srinath Sridhar, Or Litany, and Alex Yu. Neural Fields for Visual Computing. In ACM SIGGRAPH 2023 Courses. 10:1-10:227.

Ryusuke Villemin and Christophe Hery. 2013. Practical Illumination from Flames. Journal of Computer Graphics Techniques. 2, 2 (2013), 142-155.

Ryusuke Villemin, Magnus Wrenninge, and Julian Fong. 2018. Efficient Unbiased Rendering of Thin Participating Media. Journal of Computer Graphics Techniques. 7, 3 (2018), 50-65.

E. R. Woodcock, T. Murphy, P. J. Hemmings, and T. C. Longworth. 1965. Techniques used in the GEM code for Monte Carlo neutronics calculations in reactors and other systems of complex geometry. In Applications of Computing Methods to Reactor Problems. Argonne National Laboratory.

SIGGRAPH 2022 Talk- "Encanto" - Let's Talk About Bruno's Visions

This year at SIGGRAPH 2022, Corey Butler, Brent Burley, Wei-Feng Wayne Huang, Benjamin Huang, and I have a talk that presents the technical and artistic challenges and solutions that went into creating the holographic look for Bruno’s visions in Encanto. In Encanto, Bruno is a character who has a magical gift of being able to see into the future, and the visions he sees of the future get crystalized into a sort of glassy emerald tablet with the vision embedded in the glassy surface with a holographic effect. Coming up with this unique look and an efficient and robust authoring workflow required a tight collaboration between visual development, lookdev, lighting, and the Hyperion rendering team to develop a custom solution in Disney’s Hyperion Renderer. On the artist side, Corey was the main lighter and Benjamin was the main lookdev artist for this project, while on the rendering team side, Wayne and I worked closely together to develop a series of prototype shaders that were instrumental in defining how the effect should look and then Brent came up with the implementation approach for the final production version of the shader. This project was a lot of fun to be a part of and in my opinion really demonstrates the benefits of having an in-house rendering team that works closely with and embedded within a production context.

An alternate, higher-res version of Figure 1 from the paper: creating the holographic look for Bruno’s visions required close collaboration between visdev, look, lighting, and technology. The final look for Bruno's visions required a new, bespoke teleportation shader developed in Disney's Hyperion Renderer

Here is the paper abstract:

In Walt Disney Animation Studios’ “Encanto”, Mirabel discovers the remnants of her Uncle Bruno’s mysterious visions of the future. Developing the look and lighting for the emerald shards required close collaboration between our Visual Development, Look Development, Lighting, and Technology departments to create a holographic effect. With an innovative new teleporting holographic shader, we were able to bring a unique and unusual effect to the screen.

The paper and related materials can be found at:

When Corey first came to the rendering team with the request for a more efficient way to create the hologram effect that lighting had prototyped using camera mapping, our initial instinct actually wasn’t to develop a new shader at all. Hyperion has an existing “hologram” shader that was developed for use on Big Hero 6 [Joseph et al. 2014], and our initial instinct was to tell Corey that they should use the hologram shader. The way the Big Hero 6 era hologram shader works is: upon hitting a surface that has the hologram shader applied, the ray is moved into a virtual space containing a bunch of imaginary parallel planes, with each plane textured with a 2D slice of a 3D interior. In some ways the hologram shader can be thought of as raymarching through a sparse volumetric representation of a 3D interior, but the sparse volumetric interior really is just a stack of 2D slices. This technique works really well for things like building interiors seen through glass windows. However, our artists… really dislike using the hologram shader, to put things lightly. The problem with the hologram shader is that setting up the 2D slices that are inputs to the shader is an incredibly annoying and difficult process, and since the 2D slice baker has to be run as an offline process before the shader can be authored and rendered, making changes and iterating on the contents of the hologram shader is a slow process. Furthermore, if the inside of the hologram shader has to be animated, the slice baker needs to be run for every frame. We were told in no uncertain terms that the hologram shader was likely more work to set up and iterate on than the already painful manual camera mapping approach that the artists had prototyped the effect with. This request also came to us fairly late in Encanto’s production schedule, so easy setup and fast iteration times along with an extremely accelerated development timeline were hard requirements for whatever approach we took.

Upon receiving this feedback, Wayne and I set out to prototype a version of the teleportation shader that Pixar came up with for the portals in Incredibles 2 [Coleman et al. 2014]. This process was a lot of fun; Wayne and I spent a few days rapidly iterating on several different ideas for both how to implement ray teleportation in Hyperion and on how the artist workflow and interface for this new teleportation system should work. At the same time that we were prototyping, we started giving test builds of our latest prototypes to Corey to try out, which produced a feedback loop where Corey would use our prototypes to further iterate on how the final effect would look and go back and forth with the movie’s production designer and we would use Corey’s feedback to further improve the prototype. One example of where our prototype directly informed the final look was in how the prophecies fade away towards the edges of the emerald tablet- Wayne and I threw in a feature where artists could use a map to paint in the ratio of teleportation effect versus normal surface BSDF that would be applied at each surface point, and this feature wound up driving the faded edges.

The key thing that made our new approach work better than the old hologram shader was in simplicity of setup. Instead of having to run a pre-bake process and then wire up a whole bunch of texture slices into the renderer, our new approach was designed so that all an artist had to do was set up the 3D geometry that they wanted to put inside of the hologram in a target space hidden somewhere in the overall scene (typically below the ground plane in a black box or something), and then select the geometry in the main scene that they wanted to act as the “entrance” portal, select the geometry in the target space that they wanted to act as the “exit” portal, and link the two using the teleportation shader. The renderer then did all of the rest of the work of figuring out how each point on the entrance portal corresponded to the surface of the exit portal, how transforms needed to be calculated, and so on and so forth. Multiple portal pairs could be set up in a single scene too, and the contents of a world seen through a portal could contain more portals, all of which was important because in the movie, Mirabel initially finds Bruno’s prophecy broken into shards, which had to be set up as a separate entrance portal per shard all into the same interior world. Since all of this just piggy-backed off of the normal way artists set up scenes, things like animation just worked out-of-the-box with no additional code or effort.

The last piece of the puzzle fell into place when Wayne and I discussed our progress with Brent. One of the big remaining challenges for us was that tracking correspondences between entrance and exit geometry and transforms was prone to easy breakage if input geometry wasn’t set up exactly the way we expected. At the time Brent was working on a new fracture-aware tessellation system for subdivision surfaces in Hyperion [Burley and Rodriguez 2022], and Brent quickly realized that the approach we were using for figuring out the transform from the entrance to the exit portal could be replaced with something he had already developed for the fracture-aware tessellation system. Specifically, the fracture-aware tessellation system has to be able to calculate correspondences between undeformed unfractured reference points and corresponding points in a deformed fractured fragment space; this is done using a best-fit process to find orthonormal transforms [Horn et al. 1998]. Brent realized that the problem we were trying to solve was actually the same problem he that he had already solved in the fracture system, so he took our latest prototype and reworked the internals to use the same best-fit orthonormal transform solution as in the fracturing system. With Brent’s improvements, we arrived at the final production version of the teleportation shader used on Encanto.

Going from the start of brainstorming and prototyping to delivering the final production version of the shader took us a little over a week, which anyone who has worked in an animation/VFX production setting before will know is very fast for a large new rendering feature. Working tightly with Corey and Benjamin to simultaneously iterate on the art and the software and inform each other was key to this project’s fast development time and key to achieving an amazing looking effect in the film. At Disney Animation, we have a mantra that goes “art challenges technology and technology inspires the art”- this project was a case that exemplifies how we carry out that mantra in real-world filmmaking and demonstrates the amazing results that come out of such a process. Bruno’s visions in Encanto are every bit a case where the artistic vision challenged us to develop new technology, and the process of iterating on the new technology between engineers and artists in turn informed the final artwork that made it into the movie; for me, projects like these are one of the things that makes Disney Animation such a fun and amazing place to be.

A short GIF showing two examples of the final effect. For many more examples, go watch Encanto on Disney+!

References

Brent Burley, David Adler, Matt Jen-Yuan Chiang, Hank Driskill, Ralf Habel, Patrick Kelly, Peter Kutz, Yining Karl Li, and Daniel Teece. 2018. The Design and Evolution of Disney’s Hyperion Renderer. ACM Transactions on Graphics. 37, 3 (2018), 33:1-33:22.

Brent Burley and Francisco Rodriguez. 2022. Fracture-Aware Tessellation of Subdivision Surfaces. In ACM SIGGRAPH 2022 Talks. 10:1-10:2.

Patrick Coleman, Darwyn Peachey, Tom Nettleship, Ryusuke Villemin, and Tobin Jones. 2018. Into the Voyd: Teleportation of Light Transport in Incredibles 2. In DigiPro ‘18: Proceedings of the 8th Annual Digital Production Symposium. 12:1-12:4.

Berthold K. P. Horn, Hugh M. Hilden, and Shahriar Negahdaripour. 1988. Close-Form Solution of Absolute Orientation using Orthonormal Matrices. Journal of the Optical Society of America A. 5, 7 (July 1988), 1127–1135.

Norman Moses Joseph, Brett Achorn, Sean D. Jenkins, and Hank Driskill. Visualizing Building Interiors Using Virtual Windows. In ACM SIGGRAPH Asia 2014 Technical Briefs. 18:1-18:4.

Encanto

For the first time since 2016, Walt Disney Animation Studios is releasing not just one animated feature in a year, but two! The second Disney Animation release of 2021 is Encanto, which marks a major milestone as Disney Animation’s 60th animated feature film. Encanto is a musical set in Colombia about a girl named Mirabel and her family: the amazing, fantastical, magical Madrigals. I’m proud of every Disney Animation project that I’ve had the privilege to work on, but I have to admit that this year was something different and something very special to me, because this year we completed both Raya and the Last Dragon and Encanto, which are together two of my favorite Disney Animation projects so far. Earlier this year, I wrote about the amazing work that went into Raya and the Last Dragon and why I loved working on that project; with Encanto now in theaters, I now get to share why I’ve loved working on Encanto so much as well!

Disney Animation feature films take many years and hundreds of people to make, and often the film’s story can remain in a state of flux for much of the film’s production. All of the above isn’t unusual; large-scale creative endeavors like filmmaking often entail an extremely complex and challenging process. More often than not, a film requires time and many iterations to really find its voice and gain that spark that makes it a great film. Encanto, however, is a film that a lot of my coworkers and I realized was going to be really special very early on in production. Now obviously, that hunch didn’t mean that making Encanto was easy by any means; every film requires tons of hard work from the most amazing, inspiring, talented artists and engineers that I know. But, I think in the end, that initial hunch about Encanto was proven correct: the finished Encanto has a story that is bursting with warmth and meaning, has one of Disney Animation’s best main characters to date with a huge cast of charming supporting characters, has the most beautiful, magical animation and visuals we’ve ever done, and sets all of the above to a wonderful soundtrack with a bunch of catchy, really cleverly written new songs. Both the production process and final film for Encanto were a strong reminder for me of why I love working on Disney Animation films in the first place.

From a technical perspective, Encanto also represents something very special in the history of Disney Animation’s continual advancements in animation technology. To understand why this is, a very brief history review about Disney Animation’s modern production pipeline and toolset is helpful. In retrospect, Disney Animation’s 50th animated feature film, Tangled, was probably one of the most important films the studio has ever made from a technical perspective, because the production of Tangled required a near-total ground-up rebuild of the studio’s production pipeline and tools that wound up laying the technical foundations for Disney Animation’s modern era. While every film we’ve made since Tangled has seen us make enormous technical strides in a variety of eras, the starting point of the production pipeline we’ve used and evolved for every CG film up until Encanto were put into place during Tangled. The fact that Encanto is Disney Animation’s 60th animated feature film is therefore fitting; Encanto is the first film made using the successor to the production pipeline that was first built for Tangled, and just like how Tangled laid the technical foundations for the subsequent ten films that followed, Encanto lays the technical foundations for many more future films to come! As presented in the USD Birds of a Feather session at SIGGRAPH 2021, this new production pipeline is built on the open-source Universal Scene Description project and brings massive upgrades to almost every piece of software and every custom tool that our artists use. An absolutely monumental amount of work was put into building a new USD-based world at Disney Animation, but I think the effort was extremely worthwhile: thanks to the work done on Encanto, Disney Animation is now well set up for another decade of technical innovation and another decade of pushing animation as a medium forward. I’m hoping that we’ll be able to present much more on this topic at SIGGRAPH 2022!

Moving to a new production pipeline meant also moving Disney’s Hyperion Renderer to work in the new production pipeline. To me, one of the biggest advantages of an in-house production renderer is the ability for the renderer development team to work extremely closely with other teams in the studio in an integrated fashion, and moving Hyperion to work well in the new USD-based world exemplifies just how important this collaboration is. We couldn’t have pulled off this effort without the huge amount of amazing work that engineers and TDs and artists from many other departments pitched in. However, having to move an existing renderer to a new pipeline isn’t the only impact on rendering that the new USD-based world has had. One of the most exciting things about the new pipeline is all of the new possibilities and capabilities that USD and Hydra unlocks; one of the biggest projects our rendering team worked on during Encanto’s production was a new, very exciting next-generation rendering project. I can’t talk too much about this project yet; all I can say is that we see it as a major step towards the future of rendering at Disney Animation, and that even in its initial deployment on Encanto, we’ve already seen huge fundamental improvements to how our lighters work every day. Hopefully we’ll be able to reveal more soon!

Of course, just because Encanto saw huge foundational changes to how we make movies doesn’t mean that there weren’t the usual fun and interesting show-specific challenges as well. Encanto presented many new, weird, fun problems for the rendering team to think about. Geometry fracturing was a major effect used extensively throughout Encanto, and in order to author and render fractured geometry as efficiently as possible, the rendering team had to devise some really clever new geometry-processing features in Hyperion. Encanto’s cinematography direction called for a beautiful, really colorful look that required pushing artistic controllability in our lighting capabilities even further, and to that end our team developed a bunch of cool new artistic control enhancements in Hyperion’s volume rendering and light shaping systems. One of my favorite show-specific challenges that I got to work on for Encanto was for the holographic effect in Bruno’s emerald crystal prophecies. For a variety of reasons, the artists wanted this effect done completely in-render; coming up with an in-render solution required many iterations and prototypes and experiments carried out over several months through a close collaboration between a number of artists and TDs and the rendering team. Encanto also saw continued advancements to Hyperion’s state-of-the-art deep-learning denoiser and stereo rendering solutions and saw continued advancements in Hyperion’s shading models and traversal system. These advancements helped us tackle many of the interesting complexity and scaling challenges that Encanto presented; effects like Isabella’s flowers and the glowing magical particles associated with the Madrigal family’s miracle pushed instancing counts to incredible new record levels, and for the first time ever on a Disney Animation film, we actually rendered some of the gorgeous costumes in the movie not as displaced triangle meshes with fuzz on top, but as actual woven curves at the thread-level. The latter proved crucial to creating the chiffon and tulle in Isabella’s outfit and was a huge part in creating the look of Mirabel’s characteristic custom-embroidered skirt. My mind was thoroughly blown when I saw those renders for the first time; on every film, I’m constantly amazed and impressed by what our artists can do with the tools we provide them with. Again, I’m hoping that we’ll be able to share much more about all of these things later; keep an eye on SIGGRAPH 2022!

Encanto also saw rendering features that we first developed for previous films pushed even further and used in interesting new ways. We first deployed a path guiding implementation in Hyperion back on Frozen 2, but path guiding wound up not seeing too much use on Raya and the Last Dragon since Raya’s setting was mostly outdoors, and path guiding doesn’t help as much in direct-lighting dominant scenarios such as outdoor scenes. However, since a huge part of Encanto takes place inside of the magical Madrigal casita, indoor indirect illumination was a huge component of Encanto’s lighting. We found that path guiding provided enormous benefits to render times in many indoor scenes, and especially in settings like the Madrigal family’s kitchen at night, where lighting was almost entirely provided by outdoor light sources coming in through windows and from candles and stuff. I think this case was a great example of how we benefit from how closely our lighting artists and our rendering engineers work together on many shows over time; because we had all worked together on similar problems before, we all had shared experiences with past solutions that we were able to draw on together to quickly arrive at a common understanding of the new challenges on Encanto. Another good example of how this collaboration continues to pay dividends over time is in the choices of lens and bokeh effects that were used on Encanto. For Raya and the Last Dragon, we learned a lot about creating non-uniform bokeh and interesting lensing effects, and what we learned on Raya in turn helped further inform early cinematography and lensing experiments on Encanto.

In addition to all of the cool renderer development work that I usually do, I also got to take part in something a little bit different on Encanto. Every year, the lighting department brings on a handful of trainees, who are put through several months of in-studio “lighting school” to learn our tools and pipeline and approach to lighting before lighting real shots on the film itself. This year, I got to join in with the lighting trainees while they were going through lighting training; this experience wound up being one of my favorites from the past year. I think that having to sit down and actually learn and use software the same way that the users have to is an extraordinarily valuable experience for any software engineer that is building tools for users. Even though I’ve been working at Disney Animation for six years now, and even though I know the internals of how our renderer works extensively, I still learned a ton from having to actually use Hyperion to light shots and address notes from lighting supervisors and stuff! Encanto’s lighting style required really leaning on the tools that we have for art-directing and pushing and modifying fully physical lighting, which really changed my perspective on some of these tools. For most rendering engineers and researchers, features that allow for breaking purely physical light transport are often seen as annoying and difficult to implement but necessary concessions to the artists. Having now used these features in order to hit artistic notes on short time frames though, I now have a better understanding of just how critical a component these features can be in an artist’s toolbox. I owe a huge amount of thanks to Disney Animation’s technology department leadership and to the lighting department for having made this experience possible and for having strongly supported this entire “exchange program”; I’d strongly recommend that every rendering engineer should go try lighting some shots sometime!

Finally, here are a handful of stills from the movie, 100% created using Disney’s Hyperion Renderer by our amazing artists. I’ve ordered the frames randomly, to try to prevent spoiling anything important. These frames showcase just how gorgeous Encanto looks, but since they’re pulled from only the marketing materials that have been released so far, they only represent a small fraction of how breathtakingly beautiful and colorful the total film is. Hopefully I’ll be able to share a bunch more cool and beautiful stills closer to SIGGRAPH 2022. I highly recommend seeing Encanto on the biggest screen you can; if you are a computer graphics enthusiast, go see it twice: the first time for the wonderful, magical story and the second time for the incredible artistry that went into every single shot and every single frame! I love working on Disney Animation films because Disney Animation is a place where some of the most amazing artists and engineers in the world work together to simultaneously advance animation as a storytelling medium, as a visual medium, and as a technology goal. Art being inspired by technology and technology being challenged by art is a legacy that is deeply baked into the very DNA of Disney Animation, and that approach is exemplified by every single frame in Encanto:

All images in this post are courtesy of and the property of Walt Disney Animation Studios.

Also, be sure to catch our new short, Far From the Tree, which is accompanying Encanto in theaters. Far From the Tree deserves its own discussion later; all I’ll write here is that I’m sure it’s going to be fascinating for rendering and computer graphics enthusiasts to see! Far From the Tree tells the story of a parent and child raccoon as they explore a beach; the short has a beautiful hand-drawn watercolor look that is actually CG rendered out of Disney’s Hyperion Renderer and extensively augmented with hand-crafted elements. Be sure to see Far From the Tree in theaters with Encanto!

Rendering on the Apple M1 Max Chip

Over the past year, I ported my hobby renderer, Takua Renderer, to 64-bit ARM. I wrote up the entire process and everything I learned as a three-part blog post series covering topics ranging from assembly-level comparison between x86-64 and arm64, to deep dives into various aspects of Apple Silicon, to a comparison of x86-64’s SSE and arm64’s Neon vector instructions. In the intro to part 1 of my arm64 series, I wrote about my motivation for exploring arm64, and in the conclusion to part 2 of my arm64 series, I wrote the following about the Apple M1 chip:

There’s really no way to understate what a colossal achievement Apple’s M1 processor is; compared with almost every modern x86-64 processor in its class, it achieves significantly more performance for much less cost and much less energy. The even more amazing thing to think about is that the M1 is Apple’s low end Mac processor and likely will be the slowest arm64 chip to ever power a shipping Mac; future Apple Silicon chips will only be even faster.

Well, those future Apple Silicon chips are now here! Last week (relative to the time of posting), Apple announced new 14 and 16-inch MacBook Pro models, powered by the new Apple M1 Pro and Apple M1 Max chips. Apple reached out to me last week immediately after the announcement of the new MacBook Pros, and as a result, for the past week I’ve had the opportunity to use a prerelease M1 Max-equipped 2021 14-inch MacBook Pro as my daily computer. So, to my extraordinary surprise, this post is the unexpected Part 4 to what was originally supposed to be a two-part series about Takua Renderer on arm64. This post will serve as something of a coda to my Takua Renderer on arm64 series, but will also be fairly different in structure and content to the previous three parts. While the previous three parts dove deep into extremely technical details about arm64 assembly and Apple Silicon and such, this post will focus on a single question: now that professional-grade Apple Silicon chips exist in the wild, how well do high-end rendering workloads run on workstation-class arm64?

Figure 1: The new 2021 14-inch MacBook Pro with an Apple M1 Max chip, running Takua Renderer.

Before we dive in, I want to get a few important details out of the way. First, this post is not really a product review or anything like that, and I will not be making any sort of endorsement or recommendation on what you should or should not buy; I’ll just be writing about my experiences so far. Many amazing tech reviewers exist out there, and if what you are looking for is a general overview and review of the new M1 Pro and M1 Max based MacBook Pros, I would suggest you go check out reviews by The Verge, Anandtech, MKBHD, Dave2D, LinusTechTips, and so on. Second, as with everything in this blog, the contents of this post represent only my personal opinion and do not in any way represent any kind of official or unofficial position, endorsement, or opinion on any matter from my employer, Walt Disney Animation Studios. When Apple reached out to me, I received permission from Disney Animation to go ahead on a purely personal basis, and beyond that nothing with this entire process involves Disney Animation. Finally, Apple is not paying me or giving me anything for this post; the 14-inch MacBook Pro I’ve been using for the past week is strictly a loaner unit that has to be returned to Apple at a later point. Similarly, Apple has no say over the contents of this post; Apple has not even seen any version of this post before publishing. What is here is only what I think!

Now that a year has passed since the first Apple Silicon arm64 Macs were released, I do have my hobby renderer up and running on arm64 with everything working, but I’ve only rendered relatively small scenes so far on arm64 processors. The reason I’ve stuck to smaller scenes is because high-end workstation-class arm64 processors so far just have not existed; while large server-class arm64 processors with large core counts and tons of memory do exist, these server-class processors are mostly found in huge server farms and supercomputers and are not readily available for general use. For general use, the only arm64 options so far have been low-power single-board computers like the Raspberry Pi 4 that are nowhere near capable of running large rendering workloads, or phones and tablets that don’t have software or operating systems or interfaces suitable for professional 3D applications, or M1-based Macs. I have been using an M1 Mac Mini for the past year, but while the M1 performance-wise punches way above what a 15 watt TDP typically would suggest, the M1 only supports up to 16 GB of RAM and only represents Apple’s entry into Apple Silicon based Macs. The M1 Pro and M1 Max, however, are are Apple’s first high powered arm64-based chips targeted at professional workloads, meant for things like high-end rendering and many other creative workloads; by extension, the M1 Pro and M1 Max are also the first arm64 chips of their class in the world with wide general availability. So, in this post, answering the question “how well do high-end rendering workloads run on workstation-class arm64” really means examining how well the M1 Pro and M1 Max can do rendering.

Spoiler: the answer is extremely well; all of the renders in the post were rendered on the 14-inch MacBook Pro with an M1 Max chip. Here is a screenshot of Takua Renderer running on the 14-inch MacBook Pro with an M1 Max chip:

Figure 2: Takua Renderer running on arm64 macOS 12 Monterey, on a 14-inch MacBook Pro with an M1 Max chip.

The 14-inch MacBook Pro I’ve been using for the past week is equipped with the maximum configuration in every category: a full M1 Max chip with a 10-core CPU, 32-core GPU, 64 GB of unified memory, and 8 TB of SSD storage. However, for this post, I’ll only focus on the 10-core CPU and 64 GB of RAM, since Takua Renderer is currently CPU-only (more on that later); for a deep dive into the M1 Pro and M1 Max’s entire system-on-a-chip, I’d suggest taking a look at Anandtech’s great initial impressions and later in-depth review.

The first M1 Max spec that jumped out at me is the 64 GB of unified memory; having this amount of memory meant I could finally render some of the largest scenes I have for my hobby renderer. To test out the M1 Max with 64 GB of RAM, I chose the forest scene from my Mipmapping with Bidirectional Techniques post. This scene has enormous amounts of complex geometry; almost every bit of vegetation in this scene has highly detailed displacement mapping that has to be stored in memory, and the large amount of textures in this scene is what drove me to implement a texture caching system in my hobby renderer in the first place. In total, this scene requires just slightly under 30 GB of memory just to store all of the subdivided, tessellated, and displaced scene geometry, and requires an additional few more GB for the texture caching system (the scene can render with just a 1 GB texture cache, but having a larger texture cache helps significantly with performance).

I have only ever published two images from this scene: the main forest path view in the mipmapping blog post, and a closeup of a tree stump as the title image on my personal website. I originally had several more camera angles set up that I wanted to render images from, and I actually did render out 1080p images. However, to showcase the detail of the scene better, I wanted to wait until I had 4K renders to share, but unfortunately I never got around to doing the 4K renders. The reason I never did the 4K renders is because I only have one large personal workstation that has both enough memory and enough processing power to actually render images from this scene in a reasonable amount of time, but I needed this workstation for other projects. I also have a few much older spare desktops that do have just barely enough memory to render this scene, but unfortunately, those machines are so loud and so slow and produce so much heat that I prefer not to run them at all if possible, and I especially prefer not running them on long render jobs when I have to work-from-home in the same room! However, over the past week, I have been able to render a bunch of 4K images from my forest scene on the M1 Max 14-inch MacBook Pro; quite frankly, being able to do this on a laptop is incredible to me. Here is the title image from my personal website, but now rendered at 4K resolution on the M1 Max 14-inch MacBook Pro:

Figure 3: Forest scene title image from my personal website. Rendered using Takua Renderer on a M1 Max 14-inch MacBook Pro. Click through for full 4K version.

The M1 Max-based MacBook Pro is certainly not the first laptop to ever ship with 64 GB of RAM; the previous 2019 16-inch MacBook Pro was also configurable up to 64 GB of RAM, and there are crazy PC laptops out there that can be configured up even higher. However, this is where the M1 Max and M1 Pro’s CPU performance comes into play: while previous laptops could support 64 GB of RAM and more, actually utilizing large amounts of RAM was difficult since previous laptop CPUs often couldn’t keep up! Being able to fit a large data set into memory is one thing, but being able to run processing fast enough to actually make use of large data sets in a reasonable amount of time is the other half of the puzzle. My wife has a 2019 16-inch MacBook Pro with 32 GB of memory, which is just enough to render my forest scene. However, as seen in the benchmark results later in this post, the 2019 16-inch MacBook Pro’s Intel Core-i7 9750H CPU with 6 cores and 12 threads is over twice as slow as the M1 Max at rendering this scene at best, and can be even slower depending on thermals, power, and more. Rendering each of the images in this post took a few hours on the M1 Max, but on the Core-i7 9750H, the renders have to become overnight jobs with the 16-inch MacBook Pro’s fans running at full speed. With only a week to write this post, a few hours per image versus an overnight job per image made the difference between having images ready for this post versus not having any interesting renders to show at all!

Actually, the M1 Max isn’t just fast for a chip in a laptop; the M1 Max is stunningly competitive even with desktop workstation CPUs. For the past few years, the large personal workstation that I offload large projects onto has been a machine with dual Intel Xeon E5-2680 workstation processors with 8 cores / 16 threads each for a total of 16 cores and 32 threads. Even though the Xeon E5-2680s are ancient at this point, this workstation’s performance is still on-par with that of the current Intel-based 2020 27-inch iMac. The M1 Max is faster then the dual-Xeon E5-2680 workstation at rendering my forest scene, and considerably so. But of course, a comparison with aging Sandy Bridge era Xeons isn’t exactly a fair sporting competition; the M1 Max has almost a decade of improved processor design and die shrinks to give it an advantage. So, I also tested the M1 Max against… the current generation 2019 Mac Pro, which uses a Intel Xeon W-3245 CPU with 16 cores and 32 threads. As expected, the M1 Max loses to the 2019 Mac Pro… but not by a lot, and for a fraction of the power used. The Intel Xeon W-3245 has a 205 watt TDP just for the CPU alone and has to be utilized in a huge desktop tower with an extremely elaborate custom-engineered cooling solution, whereas the M1 Max 14-inch MacBook Pro has a reported whole-system TDP of just 60 watts!

How does Apple pack so much performance with such little energy consumption into their arm64 CPU designs? A number of factors come into play here, ranging from partnering with TSMC to manufacture on cutting-edge 5 nm process nodes to better microarchitecture design to better software and hardware integration; outside of Apple’s processor engineering labs, all anyone can really do is just hypothesize and guess. However, there are some good guesses out there! Several plausible theories have to do with the choice to use the arm64 instruction set; the argument goes that having been originally designed for low-power use cases, arm64 is better suited for efficient energy consumption than x86-64, and scaling up a more efficient design to huge proportions can mean more capable chips that use less power than their traditional counterparts. Another theory revolving around the arm64 instruction set has to do with microarchitecture design considerations. The M1, M1 Pro, and M1 Max’s high-performance “Firestorm” cores have been observed to have an absolutely humongous reorder buffer, which enables extremely deep out-of-order execution capabilities; modern processors attain a lot of their speed by reordering incoming instructions to do things like hide memory latency and bypass stalled instruction sequences. The M1 family’s high-performance cores posses an out-of-order window that is around twice as large as that in Intel’s current Willow Cove microarchitecture and around three times as large as that in AMD’s current Zen3 microarchitectures. Having a huge reordering buffer supports the M1 family’s high-performance cores also having a high level of instruction-level parallelism enabled by extremely wide instruction execution and extremely wide instruction decoding. While wide instruction decoding is certainly possible on x86-64 and other architectures, scaling wide instruction-issue designs in a low power budget is generally accepted to be a very challenging chip design problem. The theory goes that arm64’s fixed instruction length and relatively simple instructions make implementing extremely wide decoding and execution far more practical for Apple, compared with what Intel and AMD have to do in order to decode x86-64’s variable length, often complex compound instructions.

So what does any of the above have to do with ray tracing? One concrete application has to do with opacity mapping in a ray tracing renderer. Opacity maps are used to produce finer geometric detail on surfaces by using a texture map to specify whether a part of a given surface should actually exist or not. Implementing opacity mapping in a ray tracer creates a surprisingly large number of design considerations that need to be solved for. For example, texture lookups are usually done as part of a renderer’s shading system, which in a ray tracer only runs after ray intersection has been carried out. However, evaluating whether or not a given hit point against a surface should be ignored or not after exiting the entire ray traversal system leads to massive inefficiencies due to the need to potentially re-enter the entire ray traversal system from scratch again. As an example: imagine a tree where all of the leaves are modeled as rectangular cards, and the shape of each leaf is produced using an opacity map on each card. If the renderer wants to test if a ray hits any part of the tree, and the renderer is architected such that opacity map lookups only happen in the shading system, then the renderer may need to cycle back and forth between the traversal and shading systems for every leaf encountered in a straight line path through the tree (and trees have a lot of leaves!). An alternative way to handle opacity hits is to allow for direct texture map lookups or to evaluate opacity procedurally from within the traversal system itself, such that the renderer can immediately decide whether to accept a hit or not without having to exit out and run the shading system; this approach is what most renderers use and is what ray tracing libraries like Embree and Optix largely expect. However, this method produces a different problem: tight inner loop ray traversal code is now potentially dependent on slow texture fetches from memory! Both of these approaches to implementing opacity mapping have downsides and potential performance impacts, which is why often times just modeling detail into geometry instead of using opacity mapping can actually result in faster ray tracing performance, despite the heavier geometry memory footprint. However, opacity mapping is often a lot easier to set up compared with modeling detail into geometry, and this is where a deep out-of-order buffer coupled with good branch prediction can make a big difference in ray tracing performance; these two tools combined can allow the processor to proceed with a certain amount of ray traversal work without having to wait for opacity map decisions. Problems similar to this, coupled with the lack of out-of-order and speculative execution on GPUs, play a large role in why GPU ray tracing renderers often have to be architecture fairly differently from CPU ray tracing renderers, but that’s a topic for another day.

I give the specific example above because it turns out that the M1 Max’s deep reordering capabilities seem to make a fairly noticeable difference in my Takua Renderer’s performance when opacity maps are used extensively! In the following rendered image, the ferns have an extremely detailed, complex appearance that depends heavily on opacity maps to cut out leaf shapes from simple underlying geometry. In this case, I found that the slowdown introduced by using opacity maps in a render on the M1 Max is proportionally much lower than the slowdown introduced when using opacity maps in a render on the x86-64 machines that I tested. Of course, I have no way of knowing if the above theory for why the M1 Max seems to handle renders that use opacity maps better is correct, but whichever way, the end results look very nice and renders faster than on any other computer that I have!

Figure 4: Detailed close-up of a fern in the forest scene. Rendered using Takua Renderer on a M1 Max 14-inch MacBook Pro. Click through for full 4K version.

In terms of whether the M1 Pro or the M1 Max is better for CPU rendering, I only have the M1 Max to test, but my guess is that there shouldn’t actually be too large of a difference as long as the scene fits in memory. However, the above guess comes with a major caveat revolving around memory bandwidth. Where the M1 Pro and M1 Max differ is in the maximum number of GPU cores and maximum amount of unified memory configurable; the M1 Pro can go up to 16 GPU cores and 32 GB of RAM, while the M1 Max can go up to 32 GPU cores and 64 GB of RAM. Outside of the GPU and maximum amount of memory, the M1 Pro and M1 Max chips actually share identical CPU configurations: both of them have a 10-core arm64 CPU with 8 high-performance cores and 2 energy-efficient cores, implementing a custom in-house Apple-designed microarchitecture. However, for some workloads, I would not be surprised if the M1 Max is actually slightly faster since the M1 Max also has twice the memory bandwidth over the M1 Pro (400 GB/s on M1 Max versus 200 GB/s M1 Pro); this difference comes from the M1 Max having twice the number of memory controllers. While consumer systems such as game consoles and desktop GPUs often do ship with memory bandwidth numbers comparable or even better than the M1 Max’s 400 GB/s, seeing these levels of memory bandwidth in even workstation CPUs is relatively unheard of. For example, AMD’s monster flagship Ryzen Threadripper 3990X is currently the most powerful high-end desktop CPU on the planet (outside of server processors), but the 3990X’s maximum memory bandwidth tops out at 95.37 GiB/s, or 165.944 GB/s; seeing the M1 Max MacBook Pro ship with over twice the memory bandwidth compared to the Threadripper 3990X is pretty wild. The M1 Max also has twice the amount of system-level cache as the M1 Pro; on the M1 family of chips, the system-level cache is loosely analogous to L3 cache on other processors, but serves the entire system instead of just the CPU cores.

Production-grade CPU ray tracing is a process that depends heavily on being able to pin fast CPU cores at close to 100% utilization for long periods of time, while accessing extremely large datasets from system memory. In an ideal world, intensive computational tasks should be structured in such a way that data can be pulled from memory in a relatively coherent, predictable manner, allowing the CPU cores to rely on data in cache over fetching from main memory as much as possible. Unfortunately, making ray tracing coherent enough to utilize cache well is an extremely challenging problem. Operations such as BVH traversal, which finds the closest point in a scene that a ray intersects, essentially represent an arbitrarily random walk through potentially vast amounts of geometry stored in memory, and any kind of incoherent walk through memory makes overall CPU performance dependent on memory performance. As a result, operations like BVH traversal tend to be heavily bottlenecked by memory latency and memory bandwidth. I expect that the M1 Max’s strong memory bandwidth numbers should provide a some performance boost for rendering compared to the M1 Pro. A complicating factor, however, is how the additional memory bandwidth on the M1 Max is utilized; not all of it is available to just the CPU, since the M1 Max’s unified memory needs to also serve the system’s GPU, neural processing systems, and other custom onboard logic blocks. The actual real-world impact should be easily testable by rendering the same scene on a M1 Pro and a M1 Max chip both with 32 GB of RAM, but in the week that I’ve had to test the M1 Max so far, I haven’t had the time or ability to be able to carry out this test on my own. Stay tuned; I’ll update this post if I am able to try this test soon!

I’m very curious to see if the increased memory bandwidth on the M1 Max will make a difference over the M1 Pro on this forest scene in particular, due to how dense some of the geometry is and therefore how deep some of the BVHs have to go. For example, every single pine needle in this next image is individually modeled geometry, and every tree trunk has sub-pixel-level tessellation and displacement; being able to render this image on a MacBook Pro instead of a giant workstation is incredible:

Figure 5: Forest canopy made up of pine trees, with every pine needle modeled as geometry. Rendered using Takua Renderer on a M1 Max 14-inch MacBook Pro. Click through for full 4K version.

In the previous posts about running Takua Renderer on arm64 processors, I included performance testing results across a wide variety of machines ranging from the Raspberry Pi 4B to the M1 Mac Mini all the way up to my dual Intel Xeon E5-2680 workstation. However, all of those tests weren’t necessarily indicative of what real world rendering performance on huge scenes would be like, since all of those tests had to use scenes that were small enough to fit in to a M1 Mac Mini’s 16 GB memory footprint. Now that I have access to a M1 Max MacBook Pro with 64 GB of memory, I can present some initial performance comparisons with larger machines rendering my forest scene. I think these results are likely more indicative of what real-world production rendering performance looks like, since the forest scene is the closest thing I have to true production complexity (I haven’t ported the Disney’s Moana Island data set to work in my renderer yet).

The machines I tested this time are a 2021 14-inch MacBook Pro with an Apple M1 Max chip with 10 cores (8 performance, 2 efficiency) and 10 threads, a 2019 16-inch MacBook Pro with an Intel Core i7-9750H CPU with 6 cores and 12 threads, a 2019 Mac Pro with an Intel Xeon W-3245 CPU with 16 cores and 32 threads, and a Linux workstation with dual Intel Xeon E5-2680 CPUs with 8 cores and 16 threads per CPU for a total of 16 cores and 32 threads. The Xeon E5-2680 workstation is, quite franky, ancient, and makes for something of a strange comparison point, but it’s the main workstation that I use for personal rendering projects at the moment, so I included it. I don’t exactly have piles of the latest server and workstation chips just laying around my house, so I had to work with what I got! However, I was also able to borrow access to a Windows workstation with an AMD Threadripper 3990X CPU, which weighs in with 64 cores and 128 threads. I figured that the Threadripper 3990X system is not at all a fair comparison point for the exact opposite reason why the Xeon E5-2680 is not a fair comparison point, but I thought I’d throw it in anyway out of sheer curiosity. Notably, the regular Apple M1 chip does not make an appearance in these tests, since the forest scene doesn’t fit in memory on the M1. I also borrowed a friend’s Razer Blade 15 to test, but wound up not using it since I discovered that it has the same Intel Core i7-9750H CPU as the 2019 16-inch MacBook Pro, but only has half the memory and therefore can’t fit the scene.

In the case of the two MacBook Pros, I did all tests twice: once with the laptops plugged in, and once with the laptops running entirely on battery power. I wanted to compare plugged-in versus battery performance because of Apple’s claim that the new M1 Pro/Max based MacBook Pros perform the same whether plugged-in or on battery. This claim is actually a huge deal; laptops traditionally have had to throttle down CPU performance when unplugged to conserve battery life, but the energy efficiency of Apple Silicon allows Apple to no longer have to do this on M1-family laptops. I wanted to verify this claim for myself!

In the results below, I present three tests using the forest scene. The first test measures how long Takua Renderer takes to run subdivision, tessellation, and displacement, which has to happen before any pixels can actually be rendered. The subdivision/tessellation/displacement process has an interesting performance profile that looks very different from the performance profile of the main path tracing process. Subdivision within a single mesh is not easily parallelizable, and even with a parallel implementation, scales very poorly beyond just a few threads. Takua Renderer attempts to scale subdivision widely by running subdivision on multiple meshes in parallel, with each mesh’s subdivision task only receiving an allocation of at most four threads. As a result, the subdivision step actually benefits slightly more from single-threaded performance over a larger number of cores and greater multi-threaded performance. The second test is rendering the main view of the forest scene from my mipmapping blog post, at 1920x1080 resolution. I chose to use 1920x1080 resolution since most of the time this is a more common maximum resolution to be using while working on artistic iteration. The third test is rendering the fern view of the forest scene from Figure 2 of this post, at final 4K 3840x2160 resolution. For both of the main rendering tests, I only ran the renderer for 8 samples per pixel, since I didn’t want to sit around for days to collect all of the data. For each test, I did five runs, discarded the highest and lowest results, and averaged the remaining three results to get the numbers below. Wall time (as in a clock on a wall) measures the actual amount of real-world time that each test took, while core-seconds is an approximation of how long each test would have taken running on a single core. So, wall time can be thought of as a measure of total computation power, whereas core-seconds is more a measure of computational efficiency; in both cases, lower numbers are better:

  Forest Subdivision/Displacement  
Processor: Wall Time: Core-Seconds:
Apple M1 Max (Plugged in): 128 s approx 1280 s
Apple M1 Max (Battery): 128 s approx 1280 s
Intel Core i7-9750H (Plugged in): 289 s approx 3468 s
Intel Core i7-9750H (Battery): 307 s approx 3684 s
Intel Xeon W-3245: 179 s approx 5728 s
Intel Xeon E5-2680 x2: 222 s approx 7104 s
AMD Threadripper 3990X: 146 s approx 18688 s
  Forest Rendering (Main Camera)  
  1920x1080, 8 spp, PT  
Processor: Wall Time: Core-Seconds:
Apple M1 Max (Plugged in): 127.143 s approx 1271.4 s
Apple M1 Max (Battery): 126.421 s approx 1264.2 s
Intel Core i7-9750H (Plugged in): 288.089 s approx 3457.1 s
Intel Core i7-9750H (Battery): 347.898 s approx 4174.8 s
Intel Xeon W-3245: 106.332 s approx 3402.6 s
Intel Xeon E5-2680 x2: 158.255 s approx 5064.2 s
AMD Threadripper 3990X: 38.887 s approx 4977.5 s
  Forest Rendering (Fern Camera)  
  3840x2160, 8 spp, PT  
Processor: Wall Time: Core-Seconds:
Apple M1 Max (Plugged in): 478.247 s approx 4782.5 s
Apple M1 Max (Battery): 496.384 s approx 4963.8 s
Intel Core i7-9750H (Plugged in): 1084.504 s approx 13014.0 s
Intel Core i7-9750H (Battery): 1219.59 s approx 14635.1 s
Intel Xeon W-3245: 345.292 s approx 11049.3 s
Intel Xeon E5-2680 x2: 576.279 s approx 18440.9 s
AMD Threadripper 3990X: 108.2596 s approx 13857.2 s

When rendering the main camera view, the 2021 14-inch MacBook Pro used on average about 7% of its battery charge, while the 2019 16-inch MacBook Pro used on average about 39% of its battery charge. When rendering the fern view, the 2021 14-inch MacBook Pro used on average about 19% of its battery charge, while the 2019 16-inch MacBook Pro used on average about 48% of its battery charge. Overall by every metric, the 2021 14-inch MacBook Pro achieves an astounding victory over the 2019 16-inch MacBook Pro: a little over twice the performance for a fraction of the total power consumption. The 2021 14-inch MacBook Pro also lives up to Apple’s claim of identical performance plugged in and on battery power, whereas in the results above, the 2019 16-inch MacBook Pro suffers anywhere between a 25% to 50% performance hit just from switching to battery power. The 2021 14-inch MacBook Pro’s performance win is even more astonishing when considering that the 2019 16-inch MacBook Pro is the previous flagship that the new M1 Pro/Max MacBook Pros are the direct successors to. Seeing this kind of jump in a single hardware generation is unheard of in modern tech and represents a massive win for both Apple and for the arm64 ISA. The M1 Max also handily beats the old dual Intel Xeon E5-2680 that I am currently using by a comfortable margin; for my personal workflow, this means that I can now do everything that I previously needed a large loud power-hungry workstation for on the 2021 14-inch MacBook Pro, and I can do everything faster on the 2021 14-inch MacBook Pro too.

The real surprises to me came with the 2019 Mac Pro and the Threadripper 3990X workstation. In both of those cases, I expected the M1 Max to lose, but the 2021 14-inch MacBook Pro came surprisingly close to the 2019 Mac Pro’s performance in terms of wall time. Even more importantly as a predictor of future scalability, the M1 Max’s efficiency as measured by core-seconds comes in at far far superior to both the Intel Xeon W-3245 and the AMD Threadripper 3900X. Imagining what a hypothetical future Apple Silicon iMac or Mac Pro with an even more scaled up M1 variant, or perhaps some kind of multi-M1 Max chiplet or multisocket solution, is extremely exciting! I think that with the upcoming Apple Silicon based large iMac and Mac Pro, Apple has a real shot at beating both Intel and AMD’s highest end CPUs to win the absolute workstation performance crown.

Of course, what makes the M1 Max’s performance numbers possible is the M1 Max’s energy efficiency; this kind of performance-per-watt is simply unparalleled in the desktop (meaning non-mobile, not desktop form factor) processor world. The M1 architecture’s energy efficiency is what allows Apple to scale the design out into the M1 Pro and M1 Max and hopefully beyond. Below is a breakdown of energy utilization for each of the rendering tests above; the total energy used for each render is the wall clock render time multiplied by the maximum TDP of each processor to get watt-seconds, which is then translated to watt-hours. I assume maximum TDP for each processor since I ran Takua Renderer with processor utilization set to 100%. For the two MacBook Pros, I’m just reporting the plugged-in results.

  Forest Rendering (Main Camera)  
  1920x1080, 8 spp, PT  
Processor: Max TDP: Total Energy Used:
Apple M1 Max: 60 W 2.1191 Wh
Intel Core i7-9750H: 45 W 3.6011 Wh
Intel Xeon W-3245: 205 W 6.0550 Wh
Intel Xeon E5-2680 x2: 260 W 11.4295 Wh
AMD Threadripper 3990X: 280 W 3.0246 Wh
  Forest Rendering (Fern Camera)  
  3840x2160, 8 spp, PT  
Processor: Max TDP: Total Energy Used:
Apple M1 Max: 60 W 7.9708 Wh
Intel Core i7-9750H: 45 W 13.5563 Wh
Intel Xeon W-3245: 205 W 19.6625 Wh
Intel Xeon E5-2680 x2: 260 W 41.6202 Wh
AMD Threadripper 3990X: 280 W 8.4202 Wh

At least for my rendering use case, the Apple M1 Max is easily the most energy efficient processor, even without taking into account that the 60 W TDP of the M1 Max is for the entire system-on-a-chip including CPU, GPU, and more, while the TDPs for all of the other processors are just for a CPU and don’t take into account the rest of the system. The M1 Max manages to beat the 2019 16-inch MacBook Pro’s Intel Core i7-9750H in absolute performance by a factor of two whilst using anywhere between a two-thirds to half of the energy, and the M1 Max comes close to matching the 2019 Mac Pro’s absolute performance while using about a third of the energy. Of course the comparison with the Intel Xeon E5-2680 workstation isn’t exactly fair since the M1 Max is manufactured using a 5 nm process while the ancient Intel Xeon E5-2580s were manufactured on a 35 nm process a decade ago, but I think the comparison still underscores just how far processors have advanced over the past decade leading up to the M1 Max. The only processor that really comes near the M1 Max in terms of energy efficiency is the AMD Threadripper 3990X, which makes sense since the AMD Threadripper 3990X and the M1 Max are the closest cousins in this list in terms of manufacturing process; both are using leading-edge TSMC photolithography. However, on a whole, the M1 Max is still more efficient than the AMD Threadripper 3990X, and again, the AMD Threadripper 3990X TDP is for just a CPU, not an entire SoC! Assuming near-linear scaling, a hypothetical M1-derived variant that is scaled up 4.5 times to a 270 W TDP should be able to handily defeat the AMD Threadripper 3990X in absolute performance.

The wider takeaway here though is that in order to give the M1 Max some real competition, one has to skip laptop chips entirely and reach for not just high end desktop chips, but for server-class workstation hardware to really beat the M1 Max. For workloads that push the CPU to maximum utilization for sustained periods of time, such as production-quality path traced rendering, the M1 Max represents a fundamental shift in what is possible in a laptop form factor. Something even more exciting to think about is how the M1 Max really is the middle tier Apple Silicon solution; presumably the large iMac and Mac Pro will push things into even more absurd territory.

So those are my initial thoughts on the Apple M1 Max chip and my initial experiences with getting my hobby renderer up and running on the 2021 14-inch MacBook Pro. I’m extremely impressed, and not just with the chip! This post mostly focused on the chip itself, but the rest of the 2021 MacBook Pro lineup is just as impressive. For rendering professionals and enthusiasts alike, one aspect of the 2021 MacBook Pros that will likely be just as important as the processor is the incredible screen. The 2021 MacBook Pros ship with what I believe is an industry first: a micro-LED backlit 120 Hz display with an extended dynamic range that can go up to 1600 nits peak brightness. The screen is absolutely gorgeous, which is a must for anyone who spends their time generating pixels with a 3D renderer! One thing on my to-do list was to add extended dynamic range support to Thomas Müller’s excellent tev image viewer, which is a popular tool in the rendering research community. However, it turns out that Thomas already added extended dynamic range support, and it looks amazing on the 2021 MacBook Pro’s XDR display.

In this post I didn’t go into the M1 Max’s GPU at all, even though the GPU in many ways might actually be even more interesting than the CPU (which is saying a lot considering how interesting the CPU is). On paper at least, the M1 Max’s GPU aims for roughly mobile NVIDIA GeForce RTX 3070 performance, but how the M1 Max and a mobile NVIDIA GeForce RTX 3070 actually will compare for ray traced rendering is difficult to say without actually conducting some tests. On one hand, the M1 Max’s unified memory architecture grants its GPU far more memory than any NVIDIA mobile GPU by a huge margin, and the M1 Max’s unified memory architecture opens up a wide variety of interesting optimizations that are otherwise difficult to do when managing separate pools of CPU and GPU memory. On the other hand though, the M1 Max’s GPU lacks the dedicated hardware ray tracing acceleration that modern NVIDIA and AMD GPUs and the upcoming Intel discrete GPUs all have, and in my experience so far, dedicated hardware ray tracing acceleration makes a huge difference in GPU ray tracing performance. Maybe Apple will add hardware ray tracing acceleration in the future; Metal already has software ray tracing APIs, and there already is a precedent for Apple Silicon including dedicated hardware for accelerating relatively niche, specific professional workflows. As an example, the M1 Pro and M1 Max include hardware ProRes acceleration for high-end video editing. Over the next year, I am undertaking a large-scale effort to port the entirety of Takua Renderer to work on GPUs through CUDA on NVIDIA GPUs, and through Metal on Apple Silicon devices. Even though I’ve just gotten started on this project, I’ve already learned a lot of interesting things comparing CUDA and Metal compute; I’ll have much more to say on the topic hopefully soon!

Beyond the CPU and GPU and screen, there are still even more other nice features that the new MacBook Pros have for professional workflows like high-end rendering, but I’ll skip going through them in this post since I’m sure they’ll be thoroughly covered by all of the various actual tech reviewers out on the internet.

To conclude for now, here are two more bonus images that I rendered on the M1 Max 14-inch MacBook Pro. I originally planned on just rendering the earlier three images in this post, but to my surprise, I found that I had enough time to do a few more! I think that kind of encapsulates the M1 Pro and M1 Max MacBook Pros in a nutshell: I expected incredible performance, but was surprised to find even my high expectations met and surpassed.

Figure 6: A mossy log, ferns, and debris on the forest floor. Rendered using Takua Renderer on a M1 Max 14-inch MacBook Pro. Click through for full 4K version.

Figure 7: Sunlight transmitting through pine leaves in the forest canopy. Rendered using Takua Renderer on a M1 Max 14-inch MacBook Pro. Click through for full 4K version.

A huge thanks to everyone at Apple that made this post possible! Also a big thanks to Rajesh Sharma and Mark Lee for catching typos and making some good suggestions.