Monday, November 28, 2005

The billing disaster

The AdWords launch went fairly smoothly, and I spent most of the next two weeks just monitoring the system, fixing miscellaneous bugs, and answering emails from users. (Yes, I was front-line AdWords support for the first month or so.)

The billing system that I had written ran as a cron job (for you non-programmers, that means that it ran automatically on a set schedule) and the output scrolled by in a window on my screen. Everything was working so well I didn't really pay much attention to it any more, until out of the corner of my eye I noticed that something didn't look quite right.

I pulled up the biller window and saw that a whole bunch of credit card charges were being declined one after another. The reason was immediately obvious: the amounts being charged were outrageous, tens of thousands, hundreds of thousands, millions of dollars. Basically random numbers, most of which no doubt exceeded people's credit limits by orders of magnitude.

But a few didn't. Some charges, for hundreds or thousands of dollars, were getting through. Either way it was bad. For the charges that weren't getting through the biller was automatically shutting down the accounts, suspending all their ads, and sending out nasty emails telling people that their credit cards had been rejected.

I got a sick feeling in the pit of my stomach, killed the biller, and started trying to figure out what the fsck was going on. (For you non-programmers out there, that's a little geek insider joke. Fsck is a unix command. It's short for File System ChecK.)

It quickly became evident that the root cause of the problem was some database corruption. The ad servers which actually served up the the ads would keep track of how many times a particular ad had been served and periodically dump those counts into a database. The biller would then come along and periodically collect all those counts, roll them up into an invoice, and bill the credit cards. The database was filled with entries containing essentially random numbers. No one had a clue how they got there.

I began the process of manually going through the database to clean up the bad entries, roll back the erroneous transactions, and send out apologetic emails to all the people who had been affected. Fortunately, there weren't a huge number of users back then, and I had caught the problem early enough that only a small number of them were affected. Still, it took several days to finally clean up the mess.

Now, it's a complete no-brainer that when something like that happens you add some code to detect the problem if it ever happens again, especially when you don't know why the problem happened in the first place. But I didn't. It's probably the single biggest professional mistake I've ever made. In my defense I can only say that I was under a lot of stress (more than I even realized at the time), but that's no excuse. I dropped the ball. And it was just pure dumb luck that the consequences were not more severe. If the problem had waited a year to crop up instead of a couple of weeks, or if I hadn't just happened to be there watching the biller window (both times!) when the problem cropped up Google could have had a serious public relations problem on its hands. As it happened, only a few dozen people were affected and we were able to undo the damage fairly easily.

You can probably guess what happened next. Yep. One week later. Same problem. This time I added a sanity check to the billing code and kicked myself black and blue for not thinking to do it earlier. At least the cleanup went a little faster this time because by now I had a lot of practice in what to do.

Stress.

And we still didn't know where the random numbers were coming from despite the fact that everyone on the ads team was trying to figure it out.

6 Comments:

Anonymous Anonymous said...

There's no need to apologise for your geekiness -- as if we don't all know what chron and fsck are.

I'd save the apologies and explanations for the "Steven Levy" style book. I'd guess that most know to go look at things like this for definitions.

12:20 PM  
Anonymous Jan said...

There are still problems with creditcards being declined without any reason - perhaps its all your fault :)

PS: support-team still can't explain the problem

12:48 PM  
Blogger Philipp said...

Uhm, just to contrast the first comment, yeah thanks for letting us known what Fsck means right here and now without us having to jump around websites too much :)
Who says you can't read a Steven Levy book online? Plus, I've got a suspicion this will turn into a book one day or another, and then it will come in handy to not rely on links for crucial points :)

2:26 PM  
Blogger Greg Linden said...

I'm curious, what database were you using at that time?

3:16 PM  
Anonymous Anonymous said...

The problem is still there , they are supposed to bill every other day or so but sometimes they will let it accumulate and the bill the card - But many times the transaction get declined (as many cards have a 10k limit for a single transaction)

6:02 PM  
Anonymous Anonymous said...

PS. fsck is file system consistency checker.

7:50 AM  

Post a Comment

<< Home