How does Shazam work?
Everyone loves Shazam, a mobile app that can name any tune you play to it. It’s the most miraculous app ever created since sliced bread. I just had to find out, once and for all, what on earth is the magic behind Shazam. So I did some research and would like to share the secret…
Shazam is an application which you can use to analyze/match music. When you install it on your phone, and hold the microphone to some music for about 20 to 30 seconds, it will tell you which song it is.
When I first used it, I was instantly intrigued by its magic. “How did it do that!?” Even after using it a lot, it still has a bit of magical feel to it.
There are a couple of ways to use Shazam, but one of the more convenient is to install their free app onto an iPhone. Just hit the “tag now” button, hold the phone’s mic up to a speaker, and it will usually identify the song and provide artist information, lyrics, as well as a link to purchase the album and share on the social network.
What is so remarkable about the service, is that it works on very obscure songs and will do so even with extraneous background noise. I’ve gotten it to work while sitting down inside a crowded coffee shop and pizzeria.
(ARTICLE FROM SLATE.COM)
Shazam is the closest a cell phone can come to magic. Say you’re in a restaurant, a song comes on, and you can’t quite place the tune. In the past, your options were limited; you could try asking your spouse or the waiter for a clue, but that approach risked revealing your ignorance. (That’s “Sex Machine,” dumb ass.) Shazam—which launched in the United Kingdom in 2002 as a call-in service and became widely known in the United States last year when it hit the iPhone—solves the dilemma in a few clicks. Press a button on your phone, and in seconds you’ll get the artist and song title. Other than playing video games, it’s the most useful thing you can do on your phone.
Last week, Shazam announced that more than 50 million people worldwide have used the service—up from 35 million at the start of the year. The company also said that it’s received an undisclosed investment from the fabled Silicon Valley venture-capital firm KPCB. Shazam’s success seems justified—it’s the one app you can show to iPhone skeptics to get them to reconsider their position (though Shazam is also available on Android, BlackBerry, Windows Mobile, and pretty much any other phone). Yet for all the acclaim it garners, Shazam’s inner workings are pretty mysterious. How does it actually ID your song? How does the company make money? (Here’s one hint: iPhone users should expect to see a pay version soon.) And what are the long-term prospects for a firm whose sole purpose is satisfying an acute, very occasional need?
First, a short explanation of how Shazam works. The company has a library of more than 8 million songs, and it has devised a technique to break down each track into a simple numeric signature—a code that is unique to each track. “The main thing here is creating a ‘fingerprint’ of each performance,” says Andrew Fisher, Shazam’s CEO. When you hold your phone up to a song you’d like to ID, Shazam turns your clip into a signature using the same method. Then it’s just a matter of pattern-matching—Shazam searches its library for the code it created from your clip; when it finds that bit, it knows it’s found your song.
OK, but how does Shazam make these fingerprints? As Avery Wang, Shazam’s chief scientist and one of its co-founders, explained to Scientific American in 2003, the company’s approach was long considered computationally impractical—there was thought to be too much information in a song to compile a simple signature. But as he wrestled with the problem, Wang had a brilliant idea: What if he ignored nearly everything in a song and focused instead on just a few relatively “intense” moments? Thus Shazam creates a spectrogram for each song in its database—a graph that plots three dimensions of music: frequency vs. amplitude vs. time. The algorithm then picks out just those points that represent the peaks of the graph—notes that contain “higher energy content” than all the other notes around it, as Wang explained in an academic paper he published to describe how Shazam works (PDF). In practice, this seems to work out to about three data points per second per song.
You’d think that ignoring nearly all of the information in a song would lead to inaccurate matches, but Shazam’s fingerprinting technique is remarkably immune to disturbances—it can match songs in noisy environments over bad cell connections. Fisher says that the company has also recently found a way to match music that has been imperceptibly sped up (as club DJs sometimes do to match a specific tempo or as radio DJs do to fit in a song before an ad break). And it can tell the difference between different versions of the same song. I just tried it on three different versions of “Landslide”—the original by Fleetwood Mac and covers by the Smashing Pumpkinsand the Dixie Chicks—and it nailed each one.
Fisher declined to tell me Shazam’s overall hit-and-miss rate. All he would say is that the service is good enough to keep people coming back for more—the average user looks for songs eight times a month. The most common reason Shazam fails to identify a song is that it doesn’t have enough data. The system needs at least five seconds of music to make a match, and sometimes people turn it on just as the song is ending. There are also frequently errors when people look up live performances—if you hold up your phone to your TV during the musical segment on Saturday Night Live, Shazam will most probably fail to ID the song. (If you do get a match from SNL, you’re probably watching that episode with Ashlee Simpson—Shazam is a great way to catch lip-syncers in the act.) Fisher says that Shazam is technically capable of working on live performances, but they’ve turned off that ability for what he terms “business reasons.” “Right now people trust the brand—trying to match live songs wouldn’t get very high accuracy,” he says. (If you’ve got a tune stuck in your head, try using Midomi, a rival of Shazam’s that can ID songs based on your humming or singing.)
Shazam’s iPhone version has been a blockbuster, but it still represents just 20 percent of the service’s customer base, which spans more than 150 countries and pretty much every mobile carrier in the world. The iPhone version also marked a departure for the company—it was the first version that Shazam offered for free. Fisher says this proved to be a good idea; it brought Shazam instant renown, and the company now has enough of a customer base that it can make decent money through in-app ads and by getting a cut of each song purchase people make through the app. But staying fully free forever isn’t sustainable, Fisher says. The company recently unveiled a Windows Mobile version of its app that operates under a “freemium” pricing model—users who download the free version can search for five songs a month, while a premium version that goes for a one-time fee of $5 will allow unlimited song searches. Fisher says that the $5 version for the iPhone (and most other platforms) will launch by the end of the year.
The company is also planning to add a lot more services to its apps—a recommendations engine, a way to let you share your musical tastes with your friends, and charts that show the songs that people are searching for. Every Monday, Shazam sends out its charts to record labels, and execs have been known to sign artists based on the data. This has led to a new way for artists to break into the mainstream: getting featured in TV ads. In 2005, for instance, Volkswagen ran an ad in Europe for the Golf GTI that featured a remixed version of “Singin’ in the Rain” by Mint Royale. The song inspired a lot of searching on Shazam—and prompted the band’s label to release the track, which then shot to the top of the European charts. “We probably see that at least once a month around the world,” Fisher says. In other words, Shazam doesn’t only help an audience find music. Sometimes it helps music find an audience.
Luckily, I found a paper written by one of the developers explaining just on how Shazam works. It’s worth checking out. Of course, they leave out some of the details, but the basic idea is exactly what you would expect: it relies on fingerprinting music based on the spectrogram.
Here are the basic steps:
1. Beforehand, Shazam fingerprints a comprehensive catalog of music, and stores the fingerprints in a database.
2. A user “tags” a song they hear, which fingerprints a 10 second sample of audio.
3. The Shazam app uploads the fingerprint to Shazam’s service, which runs a search for a matching fingerprint in their database.
4. If a match is found, the song info is returned to the user, otherwise an error is returned.
Here’s how the fingerprinting works:
You can think of any piece of music as a time-frequency graph called a spectrogram. On one axis is time, on another is frequency, and on the 3rd is intensity. Each point on the graph represents the intensity of a given frequency at a specific point in time. Assuming time is on the x-axis and frequency is on the y-axis, a horizontal line would represent a continuous pure tone and a vertical line would represent an instantaneous burst of white noise. Here’s one example of how a song might look:
The Shazam algorithm fingerprints a song by generating this 3d graph, and identifying frequencies of “peak intensity.” For each of these peak points it keeps track of the frequency and the amount of time from the beginning of the track. Based on the paper’s examples, I’m guessing they find about 3 of these points per second. [Update: A commenter below notes that in his own implementation he needed more like 30 points/sec.] So an example of a fingerprint for a 10 seconds sample might be:
|Frequency in Hz||Time in seconds|
Shazam builds their fingerprint catalog out as a hash table, where the key is the frequency. When Shazam receives a fingerprint like the one above, it uses the first key (in this case 823.44), and it searches for all matching songs. Their hash table might look like the following:
|Frequency in Hz||Time in seconds, song information|
|823.43||53.352, “Song A” by Artist 1|
|823.44||34.678, “Song B” by Artist 2|
|823.45||108.65, “Song C’ by Artist 3|
|1892.31||34.945, “Song B” by Artist 2|
[Some extra detail: They do not just mark a single point in the spectrogram, rather they mark a pair of points: the “peak intensity” plus a second “anchor point”. So their key is not just a single frequency, it is a hash of the frequencies of both points. This leads to less hash collisions which in turn speeds up catalog searching by several orders of magnitude by allowing them to take greater advantage of the table’s constant (O(1)) look-up time. There’s many interesting things to say about hashing, but I’m not going to go into them here, so just read around the links in this paragraph if you’re interested.]
If a specific song is hit multiple times (based on examples in the paper I think it needs about 1 frequency hit per second), it then checks to see if these frequencies correspond in time. They actually have a clever way of doing this They create a 2d plot of frequency hits, on one axis is the time from the beginning of the track those frequencies appear in the song, on the other axis is the time those frequencies appear in the sample. If there is a temporal relation between the sets of points, then the points will align along a diagonal. They use another signal processing method to find this line, and if it exists with some certainty, then they label the song a match.
- developify posted this