Trustpilot is a review platform where people can read, write, and share reviews for all kinds of businesses. They have also been Runscope users since 2015, and we have featured them in other blog posts, such as how they monitor over 600 microservices.
One of the most important parts of API monitoring is getting notified when something breaks. Trustpilot uses Slack as their main communication hub, and so they rely on notifications that are sent from Runscope to Slack in case something breaks.
The default Runscope-Slack integration was enough for the Trustpilot team for a while. They made sure to use the threshold feature to only receive notifications after a test failed 3 times in a row, and again when the test returns passing, to control the overall number of notifications they would get.
But as their architecture evolved, and their Runscope usage grew, the amount of notifications grew as well. Add those up with other 3rd-party services, and they really started to build up. And getting too many alerts can be just as bad as getting zero alerts. The team started suffering from notification fatigue.
Alerts and Visibility
When Trustpilot started scaling their API monitors to cover more use cases, they started having a visibility problem. Sometimes a system would fail and they would get dozens of notifications. Then they would ask themselves:
What is the source of the problem?
And also, who was working on each alert?
The team needed an easier way to manage each API failure notification to make sure that failures were not being missed by the team without getting fixed.
How to Solve Alert Fatigue
The engineering team at Trustpilot decided to look into how to build custom Slack notifications. Slack has a really great and expansive API documentation, and it allows users to send messages that include buttons next to it, to allow for user interaction.
So the team set out to build a custom Slack notification for when a Runscope test failed. They decided they would need:
The name of the test that failed
The exact step that failed
The environment that the test was running from (staging, production)
A button to assign/unassign who is working on the failure
Here's an example of what they have built:
When a test fails, the first notification that shows up in Trustpilot's Slack has a red bar on the left. In the example above, a user has acknowledged the failure so the bar on the left turns yellow. The notification includes the name of the team member who acknowledged the failure, and also has a button for the user to "Unassign" themselves. That way the team can make sure that two people are not doing the same work, or worse, that no one is actually looking into what caused the API failure.
Once the API test starts passing again, they receive a notification with a green bar on the left and a "Resolved" status:
Going Further with Subtests
Trustpilot also decided to leverage another Runscope feature: subtests steps. They standardized their test structure so they would all follow the same template: create data, run tests, cleanup.
Since the create data and cleanup parts are usually the same across multiple workflows, they were broken off into their own separate tests. So whenever a workflow needs to make those same API calls, they can just call that separate workflow as a subtest:
And they also created a custom Slack notification for when a subtest step fails:
An application's architecture is always evolving, and usually increasing in complexity. One of the side-effects of that is the increase in systems and points of failure, which can also lead to a increase in complexity in your API monitoring workflow.
Trustpilot's team did an amazing job in understanding how alert fatigue was affecting their team, how to customize their Runscope notifications to fit into their process, and how to effectively fix problems in their workflow.
Our built-in Slack integration is enough for the majority of different workflows, and can be customized with threshold and retry on failure options. But we also know that our audience can have different processes, and we've built our tools with flexibility in mind.
If your Runscope usage is growing and you want to start building something like Trustpilot did, you can do it by leveraging our custom webhooks integration, as well as the Runscope API. Also, if you do build something custom or would like to learn more about how to build a custom Slack notification, please let us know. We'd love to hear from you!