This post is the first in a two-part series about migrating to DynamoDB by Runscope Engineer Garrett Heel (see Part 2). You can also catch Principal Infrastructure Engineer Ryan Park at the AWS Pop-up Loft on January 26 to learn more about our migration. Note: This event has passed.
At Runscope, we have a small but mighty DevOps team of three, so we’re constantly looking at better ways to manage and support our ever growing infrastructure requirements. We rely on several AWS products to achieve this and we recently finished a large migration over to DynamoDB. During this process we made a few missteps and learnt a bunch of useful lessons that we hope will help you and others in a similar position.
Outgrowing Our Old Database
Our customers use Runscope to run a wide variety of API tests: on local dev environments, private APIs, public APIs and third-party APIs from all over the world. Every time an API test is run, we store the results of those tests in a database. Customers can then review the logs and debug API problems or share results with other team members or stakeholders.
When we first launched API tests at Runscope two years ago, we stored the results of these tests in a PostgreSQL database that we managed on EC2. It didn’t take long for scaling issues to arise as usage grew heavily, with many tests being run on a by-the-minute schedule generating millions of test runs. We considered a few alternatives, such as HBase, but ended up choosing DynamoDB since it was a good fit for the workload and we’d already had some operational experience with it.
The initial migration to DynamoDB involved a few tables, but we’ll focus on one in particular which holds test results. Take, for instance, a “Login & Checkout” test which makes a few HTTP calls and verifies the response content and status code of each. Every time a run of this test is triggered, we store data about the overall result - the status, timestamp, pass/fail, etc.
For this table, test_id and result_id were chosen as the partition key and range key respectively. From the DynamoDB documentation:
To achieve the full amount of request throughput you have provisioned for a table, keep your workload spread evenly across the partition key values.
We realized that our partition key wasn’t perfect for maximizing throughput but it gave us some indexing for free. We also had a somewhat idealistic view of DynamoDB being some magical technology that could “scale infinitely”. Besides, we weren’t having any issues initially, so no big deal right?
Challenges with Moving to DynamoDB
Over time, a few things not-so-unusual things compounded to cause us grief.
1. Product changes
First, some quick background: a Runscope API test can be scheduled to run up to once per minute and we do a small fixed number of writes for each. Additionally, these can be configured to run from up to 12 locations simultaneously. So the number of writes each run, within a small timeframe, is:
<number_of_locations> × <fixed>
Shortly after our migration to DynamoDB, we released a new feature named Test Environments. This made it much easier to run a test with different/reusable sets of configuration (i.e local/test/production). This had a great response in that customers were condensing their tests and running more now that they were easier to configure.
Unfortunately this also had the impact of further amplifying the writes going to a single partition key since there are less tests (on average) being run more often. Our equation grew to
<number_of_locations> × <number_of_environments> × <fixed>
Today we have about 400GB of data in this table (excluding indexes), which continues to grow rapidly. We’re also up over 400% on test runs since the original migration. Due to the table size alone, we estimate having grown from around 16 to 64 partitions (note that determining this is not an exact science).
So let’s recap:
- Each write for a test run is guaranteed to go to the same partition, due to our partition key
- The number of partitions has increased significantly
- Some tests are run far more frequently than others
Discovering a Solution
After examining the throttled requests by sending them to Runscope with a Boto plugin we built, the issue became clear. We were writing to some partitions far more frequently than others due to our schema design, causing a temperamentally imbalanced distribution of writes. This is commonly referred to as the "hot partition" problem and resulted in us getting throttled. A lot.
One might say, “That’s easily fixed, just increase the write throughput!” The fact that we can do this quickly is one of the big upshots of using DynamoDB, and it’s something that we did use liberally to get us out of a jam.
The thing to keep in mind here is that any additional throughput is evenly distributed amongst every partition. We were steadily doing 300 writes/second but needed to provision for 2,000 in order to give a few hot partitions just 25 extra writes/second - and we still saw throttling. This is not a long term solution and quickly becomes very expensive.
It didn’t take us long to figure out that using the result_id as the partition key was the correct long-term solution. This would afford us truly distributed writes to the table at the expense of a little extra index work.
Learn More In Person & On the Blog
If you’re interested in learning more about our migration to DynamoDB and hearing from our DevOps team in person, Runscope Principal Infrastructure Engineer Ryan Park will be at the AWS Pop-up Loft in San Francisco on Tuesday, January 26 presenting Behind the Scenes with Runscope—Moving to DynamoDB: Practical Lessons Learned. Make sure to reserve your spot! [Note: This event has passed.]
In Part 2 of our journey migrating to DynamoDB, we’ll talk about how we actually changed the partition key (hint: it involves another migration) and our experiences with, and the limitations of, Global Secondary Indexes. If you have any questions about what you've read so far, feel free to ask in the comments section below and we're happy to answer them.