On Wednesday, June 29, 2022, Semgrep App experienced degraded availability for approximately 90 minutes. This caused CI jobs and scans to fail for multiple customers due to internal errors. As soon as customers encountered the issue, our engineers were immediately paged. Our team created an incident response Slack channel and initiated the troubleshooting and recovery procedures.
The team discovered the failure was caused by an infrastructure upgrade being performed that resulted in approximately 37.5% of requests to the app not being able to be processed. This included requests to semgrep.dev and within Cl. We have completed an internal postmortem and have taken steps to reduce the impact this kind of migration could cause moving forward. This includes: increasing our proactive pre-production monitoring, improving infrastructure playbooks, and decoupling aspects of our architecture to make CI jobs more independent and stable.
To help unblock CI scans due to internal Semgrep errors, we encourage you to configure Semgrep in CI to fail open. This will allow Semgrep to succeed on internal errors, and will only block builds when Blocking findings (as configured in the App) are present. We have also released a new version of the Semgrep Status page intended to help customers track uptime and availability of the core Semgrep services and view outage incidents and updates. Customers can subscribe to receive these notifications.
Finally, we want to apologize for the impact this event caused for our customers. We know how critical our services are to our customers, their applications and end users, and their businesses. We know this event impacted many customers in significant ways. We will do everything we can to learn from this event and use it to improve our availability even further.