Building a New Transcoder Service (Part 1)

The company I work for allows users to record videos of tests.  These tests are stored on Amazon S3, and users can download them, etc at their desire.

Unfortunately, the way this all works requires a lot of resources; VNC sessions are recorded and transcoded on the same boxes that also serve those VNC sessions to the users.  This can start to overwhelm those boxes, which also handle other services.   With usage that is just outside what we typically see in a day, the system can come to a grinding halt across the board.  Obviously, we don’t want this to happen.

So I built a system that offloads the transcoding – for performance reasons, we can’t move the recording itself, but that’s the lightest part of the whole thing, so not really a problem.

Videos get uploaded to S3 still, but there’s a feature that makes this whole process easier: S3 is able to create a message in SQS, with information about new files.  On top of that, we can have it only make these messages for certain file types; in the case, we only care about new .flv files.

So when an FLV file is uploaded, a message gets added to the message queue telling it that the file was uploaded.  On its own, this is actually pretty useless – so we know it exists, but we can’t do anything useful with it just yet.

I ended up writing a tiny Python application that uses this queue to do the transcoding.  It pulls from the queue, which contains the filename inside the S3 bucket, and then retrieves this file from the bucket, and runs it through ffmpeg to convert it to a web-ready mp4; this file is then uploaded to S3 beside the source FLV file, and a message is sent to our API to notify that the conversion has successfully completed, and the process loops.

Compared to other services that do transcoding, this is *insanely* cheap.  Amazon’s Elastic Transcoder service, for example, charge 3 cents per minute of HD video (all of our videos are at a resolution that is at least 720p).  Other services are between 60 and 80% as costly.  This doesn’t seem like that much at first, but take into account that our users generate roughly 300,000 videos in a month: even if every single video is only 30 seconds – which is a laughably low estimate, to be honest – that will cost $4,500 dollars a month just for the transcoding service.  A more reasonable estimate of 2 minutes, since many of our videos are at the extremes of 10 minutes or under a minute, gives us a total of $18,000/month.

This transcoder service runs on a few t2.xlarge instances.  Total price per month for us to keep up with 300k videos?  About $300/month.  I assure you, there are no numbers missing from that.  Each instance is right around $140/month.  Even given an extremely low estimate of 30 seconds per video, this gives us a savings of about 94%.  A week of development and testing, and we were able to offload a CPU-intensive task from sensitive infrastructure up to a place that, really, doesn’t care about how much we throw at it.  Those two boxes, despite being fairly small, churn through those videos and keep up quite handily.  Usually – and in the next part, I’ll show what happens when our videos come in too fast for those two boxes to keep up.

Writing a library in Go

At work, we do a lot with HAProxy.  I mean, a lot – our HAProxy configuration is over 6000 lines.  We’ve been looking into ways to pare this down, but it still leaves us with one issue: how can we even start to track what’s going in this system at any given, when there are so many moving parts?

We’ve started using the ELK stack (though it’s now called the Elastic Stack by Elastic, the company that really makes most of it), but only for logging API calls in our stack.

HAProxy allows you to create a “stats socket”, a unix domain socket that you can connect to and send commands to control parts of HAProxy itself, or, more usefully for this case, to get a list of statistics for each server, listener, backend, and frontend.  Problem is, in some setups, with multiple threads for HAProxy (used to handle high load), we can get multiple sockets, and these have to be aggregated.

I’ve been experimenting with Golang lately, and so I started writing a tool to handle this data.  I have written a library, HAProxyGoStat, that makes it easier to handle the data.  It handles parsing the CSV format that HAProxy outputs by default (other formats, such as the JSON output and the split out format it can use, are both more text and harder to parse, for the most part, in addition to being unsupported on older versions of HAProxy).  You can find the library here: https://github.com/hmschreck/HAProxyGoStat.

As a demonstration of just how fast this library is, in our current environment for testing, HAProxy has 4 stats sockets, and each one reports about 1550 stats (each server, listener, backend, and frontend presents a stat).  Each stat is composed of 82 attributes; this means that, total, there are over half a million records in a single set of ‘snapshots’ of sockets.

The aggregation combines each of the stats snapshots, either passing through, or returning average, max, or sum, and returns a single snapshot after the aggregation process.  In a test that creates a parser, reaches out to each of the four sockets, and creates a set of snapshots simultaneously, then filters it.

The entire program runs in 0.22-0.27 seconds, hovering around 0.25 seconds – keep in mind, this includes several steps for initialization, that a properly daemonized version wouldn’t need to do, such as creating the parser.

This processes half a million records down to size in a quarter of a second.  I can literally run this every second and it will not back up.  That’s *fast*.