Testing Graphics Code, 4 years later

Almost four years ago I wrote how we test rendering code at Unity. Did it stand the test of time and more importantly, growing the company from less than 10 people to more than 100 people?

I’m happy to say it did! That’s it, move on to read the rest of the internets.

The earlier post was more focused on hardware compatibility area (differences between platforms, GPUs, driver versions, driver bugs and their workarounds etc.). In addition to that, we do regression tests on a bunch of actual Unity made games. All that is good and works, let’s talk about what tests the rendering team at Unity is using in the daily lives instead.

Graphics Feature & Regression Testing

In daily life of a graphics programmer, you care about two things related to testing:

1. Whether a new feature you are adding, more or less, works.
2. Whether something new you added or something you refactored broke or changed any existing features.

Now, “works” is a vague term. Definitions can range from equally vague

Works For Me!

to something like

It has been battle tested on thousands of use cases, hundreds of shipped games, dozens of platforms, thousands of platform configurations and within each and every one of them there’s not a single wrong pixel, not a single wasted memory byte and not a single wasted nanosecond! No kittehs were harmed either!

In ideal world we’d only consider the latter as “works”, however that’s quite hard to achieve.

So instead we settle for small “functional tests”, where each feature has a small scene setup that exercises said feature (very much like talked about in previous post). It’s graphics programmer’s responsibility to add tests like that for his stuff.

For example, Fog handling might be tested by a couple scenes like this:

Another example, tests for various corner cases of Deferred Lighting:


So that’s basic testing for “it works” that the graphics programmers themselves do. Beyond that, features are tested by QA and a large beta testing group, tried, profiled and optimized on real actual game projects and so on.

The good thing is, doing these basic tests also provides you with point 2 (did I break or change something?) automatically. If after your changes, all the graphics tests still pass, there’s a pretty good chance you did not break anything. Of course this testing is not exhaustive, but any time a regression is spotted by QA, beta testers or reported by users, you can add a new graphics test to check for that situation.

How do we actually do it?

We use TeamCity for the build/test farm. It has several build machines set up as graphics test agents (unlike most other build machines, they need an actual GPU, or a iOS device connected to them, or a console devkit etc.) that run graphics test configurations for all branches automatically. Each branch has it’s graphics tests run daily, and branches with “high graphics code activity” (i.e. branches that the rendering team is actually working on) have them run more often. You can always initiate the tests manually by clicking a button of course. What you want to see at any time is this:

The basic approach is the same as 4 years ago: a “game level” (“scene” in Unity speak) for each test, runs for defined number of frames, run everything at fixed timestep, take a screenshot at end of each frame. Compare each screenshot with “known good” image for that platform; any differences equals “FAIL”. On many platforms you have to allow a couple of wrong pixels because many consumer GPUs are not fully deterministic it seems.

So you have this bunch of “this is the golden truth” images for all the tests:

And each platform automatically tested on TeamCity has it’s own set:

Since the “test controller” can run on a different device than actual tests (the case for iOS, Xbox 360 etc.), the test executable opens a socket connection to transfer the screenshots. The test controller is a relatively simple C# application that listens on a socket, fetches the screenshots and compares them with the template ones. The result of it is output that TeamCity can understand; along with “build artifacts” that consist of failed tests (for each failed test: expected image, failed image, difference image with increased contrast).

That’s pretty much it! And of course, automated tests are nice and all, but that should not get too much into the way of actual programming manifesto.

7 Responses to 'Testing Graphics Code, 4 years later'

  1. martinsm

    How are you guys automating iOS testing? We are using Hudson to run CI tests, and currently we have found no way to automatically run simple application on device and send back printf’s it make. That’s on non-jailbroken device, on jailbroken of course we can use ssh to do that.

  2. Aras Pranckevičius

    On iOS, open a socket connection to the IP+port you specify somewhere (we add a little “connection configuration file” among app’s data files when building). On the testing machine, open listening socket. Route any logging & screenshot transfers over that. Biggest downside is, that it requires both machines to be on the same wireless network all the time (and tests start failing if wireless goes down).

    Looks like using actual USB connection (which is what Xcode uses to deploy & get info back) is very much undocumented. These might help: http://theiphonewiki.com/wiki/index.php?title=MobileDevice_Library and https://bitbucket.org/tristero/mobiledeviceaccess/wiki/Home – but I haven’t tried them personally.

  3. Florian Link

    We have the same kind of testing framework for our Volume Renderer, but since the blending/shader precision differs accross GFX cards, simple pixel-by-pixel compare is not sufficient and we are looking for a good image comparison algorithm for that task.

    You mentioned that you allow “some wrong pixels” on the consoles.
    It would be interested to know what kind of image comparision algorithms you currently use for that.

  4. Aras Pranckevičius

    Oh, in our case the set of images is totally tied to the specific GPU (we do allow several wrong pixels on PCs, even when using exactly the same GPU and driver version… because they happen once in a while).

    The differences in filtering, precision, AA etc. are too big between various GPUs to use same set of images on them, in my opinion. So if everything is tied to a single GPU then you don’t need a fancy image diff algorithm; just go over pixels and compare them. That’s what we do at least.

  5. martinsm

    Ah, so you run there special test client to run tests. That currently won’t work for us, but thanks for those libraries, I’ll look into them.

  6. Arseny Kapoulkine

    Hey Aras!

    What’s the avg number of screenshots for a test (i.e. for one image, how many gpu/platform/driver variations are there)?

    If you change something that *should* change the result (improving shadow filtering, improving matrix inversion precision – horror!), do you launch an automatic process that regenerates all affected image variations and then just verify them visually before submitting? How long does the process take?

  7. Nico Galoppo

    All of you interested in non-exact image difference checking may be interested in this:

    Perceptual Image Differencing
    http://pdiff.sourceforge.net/

Leave a Reply