Slow response times
Incident Report for Pushpay
Postmortem

Incident Summary

For several hours on Sunday morning the Pushpay platform suffered from performance problems. Users would have experienced slow page load times, error pages, or the inability to load pages at all.

What Happened

On Sunday morning at 7am PDT an automatic process caused one of our front end web servers to stop accepting traffic for a few seconds. This confused our load balancers and created a cycle of "flapping" where all traffic was routed to one server until it was overloaded, then all traffic was directed to another server until it too was overloaded and the process repeated from there. For the first 36 minutes of the incident, traffic thresholds were within acceptable limits.

Response Timeline

At 7:36am PDT our automated alerts paged our on call engineers due to some thresholds being outside normal operating parameters. No symptoms were visible to end users. As engineers attempted to understand the cause of these alerts, performance continued to degrade.

At 8:22am PDT we alerted other members of the engineering team to support the investigation. The pattern of requests and behavior led to conflicting information making the root cause difficult to determine. Engineers continued to work to stabilize the system over the next hour by disabling unnecessary background processes and reconfiguring our load-balancers.

At around 10:15am PDT this work was complete and traffic began to be evenly distributed across our web servers resulting in a rapid improvement in performance for users and a return to normal.

What we’re doing about it

The automatic process which triggered the incident occurred in the preliminary phases of transitioning our transaction processing infrastructure to an upgraded platform built on top of AWS. In the midst of making the transition, the shared environment created an unanticipated scenario which both caused the slowdown and made determining the underlying source of the problem more difficult.

We’ve been working for the last twelve months to prepare for this transition which significantly improves scalability, resiliency, and security. We’re pleased to report that the transition is now complete and the load testing protocol we’ve employed in the new environment pre-transition gives us a high degree of confidence that we’ll not experience a similar slowdown in the future. We believe we will continue to maintain best in class uptime in excess of 99.99% and are well positioned to serve you and your community with excellence.

Subscribe to status notifications

This statuspage has always allowed you to subscribe for email alerts of status updates. In response to customer feedback, we've just enabled text message alerts too. Subscribe for SMS updates at https://pushpay.statuspage.io now for even faster notifications.

Posted Aug 23, 2017 - 11:34 PDT

Resolved
Payments are processing as expected. We will continue monitor the system.
Posted Aug 20, 2017 - 11:50 PDT
Monitoring
Payments are now working, but we are still experiencing slower than usual load times. Our engineers continue to work towards a full resolution.
Posted Aug 20, 2017 - 10:48 PDT
Investigating
We are currently investigating issues with slow page loads and interruptions to all payment experiences. Engineers are investigating the problem and hope to have it resolved shortly.
Posted Aug 20, 2017 - 08:58 PDT