Affectionately Titled: Rock Band Calibration, a Love/Hate Story.
Part 1, the Fight
When we started Rock Band, TV technology was exploding in all different directions, due to the desire to have large, high definition displays. A 36-inch CRT-based TV weighs well over 200 pounds, I kid you not, I had one. That weight wasn’t reasonable, or practical, for normally muscled humans. So kind of suddenly, we had Plasma, LCD, and DLP based TVs, all of which could be big, but lightweight.
For our previous games we had relied on the user having a CRT monitor. CRT technology is super simple. You’re looking at one end of an enormous vacuum tube. The back of the tube is spitting out electrons, which smack into colored phosphors at the front. The electrons are steered left to right and up to down by magnetic fields, at the rate of 60 times a second. The input going into the back of the TV directly modulates those electrons as they hit your screen. Because of this simplicity, there’s zero delay between TV getting the input, and the user seeing it. Since there was zero visual delay, we only had to worry about when the user would hear the music. We knew that, like in other types of systems, the amount of audio delay could vary by user because people were putting their audio through amplifiers and real speaker systems. This could easily delay the audio by 30 milliseconds. That might not sound like much, but it’s roughly two frames of video animation, and we only give you 100ms before and after the gem hits what we call the “now bar” to strum the controller. We call that 100ms the “slop window.” 30 milliseconds is a sizable chunk of that slop, enough to make you score badly if we don’t try to account for it.
In earlier games, we would have you do a typical single step calibration to figure out how delayed your audio was. That was enough to let us make the audio line up better with the video, and score you accurately.
But now I realized that with these new high def TVs, we were in trouble. These new TVs weren’t like CRTs. The signal going into these TVs was getting processed and filtered and upsampled. This could add an additional delay of 35, 50, or in the case of one of our DLP TVs, more than a 100ms. A very long time in terms of scoring. Longer than our scoring slop window. We couldn’t ignore that. We would have to introduce two-part calibration to figure out the audio delay AND the video delay, since both were unpredictable, and could be greater than our scoring slop window. That we needed this new calibration step was entirely obvious, to me.
I was one of three lead programmers on the project. I was nominally in charge of character and venue animation and rendering, but this calibration revelation struck me as really important. And coming from a signal processing background, I was well trained and eager to deal with it.
I approached our Tech Director, Eric Malafeew, who was answerable only to Eran Egozy, the cofounder and CTO. All three of us went to M.I.T., but Eran is pleasant, reasonable, and rational, whereas both Eric and I know we’re always right, are super opinionated, egotistical, etc. Eric is usually the advocate for simplicity in design, and I am usually the advocate for functionality. I love Eric like a brother, but we would routinely have yelling fights about code design and technology. The fight we were about to have wasn’t as bad as the famous “Day of the Proxy,” in which Eric, Jason Booth and I had a giant and protracted three-way fight about how best to encapsulate an entity in a scene, but it was close...
I walked into Eric’s office and the conversation went something like this:
Me: Hey Eric, I’ve been looking at these new high def TVs. They have unpredictable latencies. (latency is engineering-speak for “delay”)
Eric: Yeah, what a pain.
Me: So I’m adding a two-step calibration process for the user to determine them separately.
Eric: (looking up, alarmed) No way, not going to happen.
Me: Er… we really need it.
Eric: That’s crazy, people already ignore the one step calibration, no one is going to do a two step process.
Me: (getting surly) Eric, it’s a mathematically necessity, there’s no way around it. It is required.
Eric: (voice rising) That’s silly, one number is good enough, no one will have that big a difference between the audio and visual latencies.
Me: Jesus Christ Eric, what the hell, it’s obvious we need two step calibration. Are you trying to be stupid?
Anyway, you get the picture. In the end he wouldn’t agree with me, so I had to run to Eran. As cofounder and CTO, he was the only one who could break the impasse. Eran called us into his office, I felt a bit like we were two annoying children trying to appeal to our parent. I’ll let Eran describe what that was like, but suffice it to say that in the end we all agreed it was necessary, and that I should look into it some more.
Part 2, the Makening
So I had a blast. That’s the stuff I live for. How was I going to investigate this? How was I going to measure video and audio latencies across the various consoles and TVs? How was I going to verify that our calibration process actually worked, and brought the video and audio together, and made the scoring super accurate?
I thought a bit, and then drove down to U-Do-It Electronics, which is like a giant nerdly supermarket in Needham, MA. Instead of vegetables and meats, you can get useful things there: resistors, capacitors, transistors, ICs, soldering irons, oscilloscopes, etc. It’s basically exactly like heaven, but better. I bought (with my own money mind you) an analog oscilloscope, and what at M.I.T. we called a “nerd kit.” Basically, a little briefcase sized thing that when you open it up has breadboards that you can temporarily wire chips into, and includes power taps at different voltages, signal generators, display LEDs, and toggle switches. I bought it in kit form, since it was cheaper as a kit, and spent the whole weekend soldering it together.
Once I had that, I bought a photoresistor (to detect light), and a tiny piezo microphone (to detect sound), and wired those up to 741 op amps and transistors to amplify the signal. Then I modified the Rock Band code to draw a big white rectangle in the lower left of the screen exactly at the start of each beat. I asked Eric Brosius, lead audio guy, to author me a Rock Band “song” that consisted of nothing but 1000 Khz beeps, each played exactly on the beat at 120 bpm. In a properly calibrated system, the beep should be audible exactly when the white rectangle is shown on screen. To see, I physically taped the photoresistor to the TV screen where the white rectangle showed up and put the microphone next to the TV speaker. Then I hooked both up to the oscilloscope, and set it up so that the light from the white rectangle would trigger the oscilloscope. Then I loaded and played the beeping song. And voila, you could see that the video and the audio occurred at different times! The signal jumped around a bit, but on average it was clear that on my TV, the audio was arriving around 50 ms later than the video. Since we derived the track display time from the audio time, I just subtracted 50 ms from where the audio said it was, and that fixed it, because now the track was shifted 50 ms later, and the two lined up.
But that was only half the battle, we also needed to figure out the round trip delay, i.e., how long the total delay is from computer to calibrated video and audio and then back to the computer again through the controller. We need that to score you accurately. So I opened up an XBox 360 controller, scraped the insulation away from the traces coming off of one of the buttons, and wired a relay to it, so that when my circuit board triggered the relay, it would effectively press the button. To close the loop, I made the photoresistor drive the transistor that would trigger the relay. This made it so that the white rectangle showing up on screen on the beat would press the button, so I could see when the game thought the button was getting pressed, and could adjust the “score time” accordingly to make the scoring as perfect as possible.
So those are the two numbers we need, one is the difference between the audio and video signals, and the other is the total round trip delay from the console to the screen and then back through the controller back into the game.
I made a tidy little package out of all the sensors, including nice plugs to go into the oscilloscope and for the sensors. I took a picture of it this morning, so you can see it. It all still works.
In the end, this setup was basically the first version of what we now call “Calbert”. In Rock Band 2, we included a little microphone and little light receiving diode in the guitar controller itself, to make life easy on the user when calibrating. You just hold it up to the screen and speaker, and it does the rest.
And that, my friends, is how signal processing, electrical engineering, and what kids these days call “making” sits at the core of the Rock Band play experience. And for you Rock Band 4 players out there, calibrate your system! It’s right under Settings on the main menu, can’t miss it. :)
- James Fleming
[Ed Note: It’s okay if a lot of this went over your head. It went over mine too. Just trust that James is a genius and is the main reason that you can play Rock Band on your fancy 1080p flatscreen TV.]