Runscope API Monitoring and Testing    Learn More →

3 Easy Steps to Cloud Operational Excellence

By Bruce Wang on .

This is a guest post by Bruce Wang and Glen Semino from SYNQ, a video API built for developers. In this post, they explain the tools and processes they use to keep the company's API operations running smoothly, and share a real-world story of how they found an API bug before launching a new feature.

You can also read this blog post in our Medium publication!

There are a lot of tools out there, and sometimes its hard to sift through them all. Here’s a simple guide to combine 3 tools, Runscope, PagerDuty and StatusPage to create a powerful cloud operational workflow that will give you peace of mind and clear visibility to your application for your customers and internal teams alike!

In case you’re not familiar with the tools, here’s a quick rundown:

  • Runscope — highly flexible API testing and monitoring service
  • PagerDuty — incident management system
  • StatusPage — Customer-facing API health status page

The workflow

It’s important to implement tools for specific purposes, and we wanted to integrate these 3 tools to help manage our operational process better. In the following examples, we’re going to show you how we added a new feature to our product (Live Streaming APIs), and added operational visibility to it. First let’s walk you through our workflow:

  • A Runscope test monitors the service on a schedule and sometimes tests from different geo locations depending on how it has been configured
  • If a Runscope test fails, PagerDuty creates an incident, alerts our Slack channel, and alerts the appropriate engineers
  • PagerDuty also updates the service status on StatusPage to alert our customers the service is having problems
  • Once the problem is resolved and the Runscope test that was failing starts to pass, the incident on PagerDuty will resolve itself and the service on StatusPage will revert back to operational status automatically

Step 1: Create a Runscope test for the Live Streaming API

Runscope provides an easy way to make POST requests on an API and then make assertions on the response.

Here is what our POST request to our live streaming service looks like in Runscope:

Detail view of a Runscope API request test step, showing a POST request to {{url}}/v1/video/stream, and two query parameters of api_key and video_id set to their respective variables

Note: the {{xxx}} is a variable that can be set from previous tests or configured via “environment” specific settings. You may hard code values in the beginning, but using variables is invaluable for creating richer tests across your various service environments

When our live stream API is called, the JSON response we expect should include a playback and stream url, so we just need to add some simple assertions in Runscope:

Assertions tab view of a Runscope request step. It has three assertions, one for the status code equals to 200, and that the JSON body property of playback_url and stream_url are not empty

We check that the HTTP response is 200 and then we check that playback_url and stream_url are not empty. We also save the values that are in playback_url and stream_url:

Variables tab view of a Runscope request step, saving the JSON body contents of the properties stream_url and playback_url to variables for subsequent use

The reason for saving the values is that we will then call our video details API and assert that the values stream_url and playback_url are present:

Runscope request step detail view of a POST request to {{url}}/v1/video/details, with the query parameters of api_key and video_id

We then make the assertion on the details API that the playback_url and stream_url are the values we expect:

Assertions tab view of a Runscope request step showing three assertions: status code equals 200, and the JSON Body properties of stream_info.playback_url and stream_info.stream_url are equal to the saved values from the previous request of playback_url and stream_url

After we built this test, we put it on a schedule using the ‘Schedules’ menu in Runscope and we were ready to add a PagerDuty alert so that we could be notified if the test for the live streaming API fails.

Step 2: Setting up PagerDuty with Runscope

Luckily, Runscope and PagerDuty have a pre-built integration. So all we had to do was go to PagerDuty and create a new service under the ‘Configuration’ menu. When adding the service for ‘Integration Type’ we specified ‘Runscope’:

PagerDuty's Services > Add Service screen showing the integration settings for Runscope

Then we configured the ‘Incident Settings’ and ‘Incident Behavior’ and then simply clicked ‘Add Service’ . Once the service was added, we were able to see it under our ‘Services’ in PagerDuty:

PagerDuty's Services tab showing a Runscope integration row setup

To then connect to our live stream test in Runscope to PagerDuty, we went into Runscope under ‘Connected Services’ and clicked the button that said ‘Connect PagerDuty’:

Runscope's Integration page showing the PagerDuty integration highlighted

Then the Runscope system asked us to authorize our PagerDuty account with Runscope, so we put in our PagerDuty credentials and clicked ‘Authorize Integration’. Finally we choose the service from PagerDuty that we want to integrate with Runscope and clicked ‘Finish Integration’:

Runscope's and PagerDuty integration page setup, showing the selected service from the PagerDuty's account

Once we did that, inside of ‘Connected Services’ in Runscope we could see our PagerDuty integration:

The Connected Services tab inside the user's Runscope account displaying a list of existing services, including Ghost Inspector and the PagerDuty integration setup in the previous steps

As you can see from screenshot our PagerDuty service called ‘SYNQ Live Stream Check’ is now integrated into Runscope. The last step was connecting the PagerDuty service to our Runscope test for the live streaming service. To do that we simply went to the live stream Runscope test and went into the ‘Editor’, we then modified the integrations for the environment we are using. Then we just flipped the integration to ‘ON’:

Runscope's Environment Settings for an API test of SYNQ's Live Stream feature, showing the Integrations tab with the newly created PagerDuty integration set to "on"

Note: Again, this notification is available in a per environment setting, as you can see this environment is “Production”

We now had the live stream test from Runscope connected to PagerDuty. Thus we would get alerted by text message or phone call if the Runscope test fails. In addition to that, we connected PagerDuty to our Slack channel following this guide, so that if a PagerDuty incident is triggered by Runscope, we get alerted on our Slack channel. The last piece left was to connect PagerDuty to StatusPage, so that our clients could be alerted if the live streaming service fails.

Step 3: Adding the Live Streaming Service to StatusPage

Now that we have a way to monitor and alert our live streaming service, we need to expose this to our clients. We do this with our public facing StatusPage (having a transparent operational status is very important and you can read more about that here.)

To connect PagerDuty and StatusPage, we followed this PagerDuty guide. Once we had both of the accounts connected, the rest of the setup occurred on StatusPage. Inside of our StatusPage configuration, we now had a section for PagerDuty. Inside that section, to connect a component to a PagerDuty service, we needed to add a rule:

StatusPages's Inactive Services tab, showing a list of connected PagerDuty options and the "Add Rules" link on the right-hand side

Under the `SYNQ Live Stream Check’, we clicked ‘Add Rules’ and then that brought us to another page, where we were able to connect the ‘Live Stream’ component on our StatusPage to the PagerDuty ‘SYNQ Live Stream Check’ service:

The "Add Rules" detail page of the PagerDuty integration, showing the settings connecting the Live Stream feature on PagerDuty to the setting that StatusPage should show in case of an incident, and also template rules to be set

We clicked on ‘Save Rules’ and we were done. On StatusPage under ‘PagerDuty Setup’ and ‘Active Services’ we could now see our ‘SYNQ Live Stream Check’ present:

StatusPage's Active Services tab showing the newly created PagerDuty integration on the list

Now our public facing StatusPage shows our ‘Live Stream’ status!

SYNQ's StatusPage public page displaying all systems as Operational

If our live stream service test failed on Runscope the ‘Live Stream’ component on our status page goes from ‘Operational’ to ‘Degraded’.

Mayday! Mayday! We have a Problem

Although our live stream service was still in alpha, we had no issues and our Runscope test for the service were all green. Then one day, we get a text message from PagerDuty, alerting us that our Runscope test for our live stream service was failing. In the meantime we were also alerted on Slack and our ‘Live Stream’ component on StatusPage went from ‘Operational’ to ‘Degraded’.

Next, we immediately went into our live stream Runscope test and noticed that we were not getting the appropriate HTTP response code from our live stream API. We knew at this point that our live stream service was having an actual failure. We then checked the server logs for our streaming servers in Amazon Cloudwatch and we noticed that it was not taking any requests for creating new streams. We eventually traced this to a backend service we depended on that that ran out of resources.

There were two issues we discovered. One, we were not deleting old and unused streams, which resulted in excessive streams and running out of resources. The second issue was that our Runscope tests were running too often, thus exacerbating the issue by creating 288 unused streams a day. We learned that in some cases running a Runscope test too often is not ideal and that building a test and monitoring model around new features can help you find bugs in your platform.

Conclusion

Thanks for sticking with us for the whole article. Hopefully you got a lot of value in it, and feel free to ask us any questions you may have about our process or any individual services we use in the comments below. Happy Service Building!

Categories: apis, community, customers, debugging, howto, integrations, monitoring, tutorial

Everything is going to be 200 OK®