For several hours on Sunday morning the Pushpay platform suffered from performance problems. Users would have experienced slow page load times, error pages, or the inability to load pages at all.
On Sunday morning at 7am PDT an automatic process caused one of our front end web servers to stop accepting traffic for a few seconds. This confused our load balancers and created a cycle of "flapping" where all traffic was routed to one server until it was overloaded, then all traffic was directed to another server until it too was overloaded and the process repeated from there. For the first 36 minutes of the incident, traffic thresholds were within acceptable limits.
At 7:36am PDT our automated alerts paged our on call engineers due to some thresholds being outside normal operating parameters. No symptoms were visible to end users. As engineers attempted to understand the cause of these alerts, performance continued to degrade.
At 8:22am PDT we alerted other members of the engineering team to support the investigation. The pattern of requests and behavior led to conflicting information making the root cause difficult to determine. Engineers continued to work to stabilize the system over the next hour by disabling unnecessary background processes and reconfiguring our load-balancers.
At around 10:15am PDT this work was complete and traffic began to be evenly distributed across our web servers resulting in a rapid improvement in performance for users and a return to normal.
The automatic process which triggered the incident occurred in the preliminary phases of transitioning our transaction processing infrastructure to an upgraded platform built on top of AWS. In the midst of making the transition, the shared environment created an unanticipated scenario which both caused the slowdown and made determining the underlying source of the problem more difficult.
We’ve been working for the last twelve months to prepare for this transition which significantly improves scalability, resiliency, and security. We’re pleased to report that the transition is now complete and the load testing protocol we’ve employed in the new environment pre-transition gives us a high degree of confidence that we’ll not experience a similar slowdown in the future. We believe we will continue to maintain best in class uptime in excess of 99.99% and are well positioned to serve you and your community with excellence.
This statuspage has always allowed you to subscribe for email alerts of status updates. In response to customer feedback, we've just enabled text message alerts too. Subscribe for SMS updates at https://pushpay.statuspage.io now for even faster notifications.