Storage Informer
Storage Informer

Optimizing FaceFX 2009

by on May.27, 2009, under Storage

Optimizing FaceFX 2009

In May of 2009, OC3 Entertainment released FaceFX 2009, a major revision of the popular facial animation package for video game developers. FaceFX 2009 introduced several major new features to the SDK, including playing animations on multiple channels and blending multiple facial animations together. FaceFX has been used in well over a hundred games, and so it&aposs always a delicate challenge to push the boundaries of the technology to help power licensees, while retaining blazing-fast performance to support developers targeting legacy systems.

Once the product was feature-complete, we set about doing our performance tuning pass. Since FaceFX is not a full game, we can&apost use the simple frames per second metric to judge performance. Also, microarchitecture-specific metrics like branch mispredictions or CPI don&apost fully capture the product&aposs in-game performance. We use two main metrics: one relative and one absolute. The relative metric is the current version&aposs execution time to complete a common task as compared to the previous version&aposs execution time to complete the same task. In this case, the task is to compute all the frames at 60fps of a given animation. The absolute metric is one that&aposs intuitive for all our licensees to understand when they ask us about performance: the percentage of a frame at 30 or 60 fps that is required for FaceFX to compute a full frame of facial animation for a character, including blending all the bones in the character&aposs face. This makes it easy to judge if FaceFX will fit in their per-frame budget.

FaceFX Studio 2009

When we first compared FaceFX 2009 to the previous version, we found that it was 25% slower. We were willing to make some performance sacrifice to add the new features, but we were hoping more in the neighborhood of 5% slower overall, not 25%. We used VTune to see where we could make improvements to get our performance under control. With VTune&aposs guidance, we tweaked several functions in the critical path, mostly with regards to branch misprediction and cache misses. This gave us a decent win, bringing 2009 to 20% slower than the previous version. However, that wasn&apost in line with our targets, so we turned to VTune&aposs tuning assistant to see if it would offer some good advice.

We spent a while pouring over the results, which boiled down to "you&aposre doing too much floating-point math." This was a bit frustrating, because we were nearing our ship date and our performance numbers were still well off where we would like. We realized the only way to get further gains was to actually do less.

Of course, doing less means precomputing more. It was the classic memory/CPU trade off, but because we were happy with the memory footprint, we decided to precompute quite a bit of data which, in theory at that point, would allow us to bypass a large swath of the per-frame evaluation algorithm. By taking the idea of a precomputed potentially-visible set and applying it to the directed acyclic graph, we thought we could quickly, with some bit fiddling, come up with the set of nodes in the graph where we would need to do the full evaluation algorithm for a given frame. The rest of the nodes would use the precomputed default state.

This major restructuring worked better than we ever imagined. With one (large) change, 2009&aposs performance went from 20% worse to 30% better than the previous version in our test case.

We broke out VTune again to verify our changes, and realized we could extract a little more performance by separating the hot data from the cold data in our new structures. This final tweak gave us an additional boost, with 2009 now outperforming its predecessor by 35%. Computing one frame of a facial animation for a character was now taking just five one-hundredths of a percent of a frame at 60 fps, or just 8.37 microseconds, on average.

The Face Graph

Looking back on our performance pass, the PVS optimization was by far the biggest win. It never fails that the best way to speed up an algorithm is not to do it in the first place. By precomputing large parts of our per-frame algorithm, we drastically improved FaceFX&aposs general performance at the expense of a few KB of memory. We also had a big win by separating hot data from cold data in the per-frame evaluation that VTune helped identify.

Although VTune didn&apost tell us how to restructure the code for the major PVS optimization, it helped us understand where we were spending our time thereby enabling us to design an improved algorithm. VTune is an important tool in our arsenal for fine-tuning performance critical code and helping to find places where algorithmic change can yield major performance improvements.


:, , , , , , , , , , ,

Leave a Reply

Powered by WP Hashcash

Looking for something?

Use the form below to search the site:

Still not finding what you're looking for? Drop a comment on a post or contact us so we can take care of it!

Visit our friends!

A few highly recommended friends...