October 2, 2014

Scaling SAM Embeds

Sean Solbak by Sean Solbak

One of the new features we added at SAM was the ability to embed Stories into a series of different templates. We call these outputs SAM Publishing. They are ideal for placing curated social content in web pages, hosting on big screen TVs at events and have many other use cases going forward. These embeds are delivered and hosted as a SAM service to all of our users of which many are the largest publishers in the world, receiving large volumes of traffic each day. As a result, SAM needed to scale to handle the load of all of our users, and by extension all of their page views. It was crucial that we test our ability to scale with load. We want our clients to trust us and in order to do that, we have to provide a service that is available at all times. This article is an overview of the process we took and the journey to get each instance of our web servers to handle hundreds of millions of requests per month.

At SAM our core app technologies are Node.js and Mongo DB hosted primarily on AWS. Instead of listing all of our technologies, the following diagram provides a good overview of our infrastructure.SAM infrastructure diagram v0.2 (1)SAM infrastructure diagram v0.2 (1)

Throughout our journey of scaling SAM we worked with a company called RunWithIT which have a proprietary set of technologies for spinning up multiple AWS instances that run load tests and gather results into graphs. Our team worked closely with Dean Bittner (founder of RunWithIT) whose specialty is “bringing down servers to identify bottlenecks and achieve scale.” We’d highly recommend anyone who wants to release with confidence to reach out to the RunWithIT team. We also used a combo of Async Queue and hyperquest to run our baseline test of continuous use of 50 concurrent users. Once we felt we had solved the current bottleneck, we’d get Dean to run the full load against our test servers.

Our initial alpha launch target was 500 million requests per month per instance. This would safely let us handle any of our alpha client’s monthly load if it were to all go through just one SAM embed story.

It seemed easy, bundle the CSS and JS files in a grunt build and ship it right?!? Well, the initial results failed miserably…

Our initial bundled JS file was 234 KB. As soon as we experienced basic load of 20 concurrent users, Amazon Elastic Beanstalk (EB) started spinning up new servers. Surely a server could handle more than 20 users! We soon realized the default EB load balancing policy was to scale based on network traffic. The solution: reduce the amount of data sent per request. To do this we offloaded the vendor (non SAM) JS files into CDNs. This reduced our bundled JS file to 4 KB and would also offered quicker delivery times of our vendor dependencies to our clients outside North America. A substantial gain.

At this point we had a 10x increase in results. Or at least we thought we did because the acceptable results would only last for 60 seconds of load. What we noticed was MongoDB wait times started growing over time – to the point where the response times were over 30s. This was not acceptable. At this point we thought Mongo should be able to perform better but we ignored that and added a caching mechanism. Since the content of the stories are somewhat static (except for live event style stories) we could cache requests for around 10s and use pusher for real-time updating. We implemented a basic in memory cache that would expire content after a specified time. This worked but the results were the same 30s response time under load. This pointed to a new bottleneck though – the web server itself. Our test results began showing a consistent up and down pattern. We needed to dig deeper.

tcp_limittcp_limit

The problem was our web servers were running out of TCP connections. As a former .NET developer and non sys admin, it was time to learn how to scale a Linux box to handle more requests. Terms like ULIMIT were new concepts. A base AWS instance allows 1024 TCP requests/per 60 seconds. Once all the TCP connections were used up, the applications would wait for an available one resulting in the spikes in the graph. Inside AWS, servers would be at 100% CPU which was a bit misleading because they are simply waiting and not running at capacity. 1024 requests per 60 seconds meant that each server could only handle 1024 requests per minute or about 1.5 million embed views per day. FAIL. The scale at which our users get page views would be very expensive. We were surprised the default instances from Amazon only allowed for that many requests. The up side is that it allows you to fail quickly and understand what you are doing. For SAM to scale, we needed to modify the following servers in AWS.

  1. Node.JS application servers – this is a good starting point to learn how to increase system resources on a linux box. Within Elastic Beanstalk, you can use ebextensions to customize the EB containers.
    1. Machine TCP limits – increase the nofile and nproc limits
    2. NGINX – look at the worker_rlimit_nofile, worker_processes and the events.worker_connections settings
  2. MongoDB – MongoDB provides a good article with a set of production release notes including their recommended production configuration settings. The whole article is a good read.

This took a while to realize that all 3 applications/servers needed to be setup properly. In hindsight, it would be better to scale the MongoDB servers before adding caching but the end result is a highly performant request that doesn’t burden the mongo instances much at all.

At this point we proudly proclaimed the world was solved.

We weren’t done.

We started noticing that our response times gradually decay over time. Our New Relic results said that the server processing times were taking 13s. How could that be possible, the code is simple and it does almost no processing? Thankfully New Relic breaks down each request into resource usage.

screenshotscreenshot

It was sure taking a long time to render our simple pages via Handlebars. We had neglected to setup Handlebars caching in our production environments, a simple fix. In development it is nice to be able to edit the Handlebars templates and hit refresh without having to restart the server. In production, an I/O request per web request is a nightmare with the end result being an infinitely growing I/O request queue.

Finally, even though more optimizations could be made we had hit our initial goals. We confidently started rolling out SAM Publishing to our customers and testing in live environments, keeping a very close eye on our stats. Success! We hit smooth scalability, comfortable handling our users traffic through our way.

Last week we fully launched SAM publishing to all users and we’re thrilled to see all the stories being published and powered by SAM.

It was fun to document our journey, and hopefully this article will help some developer or sys admin who is staring at their load tests results in frustration. Other than Linux sys administration skills acquired, the biggest thing we’ve learned from this experience is the importance of solving the right problem. In achieving scale there are many variables and it’s crucial to get to the initial problem and not “the problem from the problem”. The combination of Dean’s insight and experience paired with the ability of New Relic to breakdown requests was a valuable combination for SAM. If anyone would like more clarification on any of these performance improvement steps, comment below or email DEV at SAMDESK dot IO and we’ll gladly elaborate.