Last month, Mandrill suffered a significant outage due to a database failure, which impaired our ability to send emails. The outage began at 05:35 UTC on February 4, and we resolved the issue on February 5 at 22:09 UTC. During the outage, we were sending only 80% of queued emails.
Once the outage was resolved, we committed to conducting a full review of the incident and our response, and to sharing what we learned with our customers. We held a series of formal debriefs over the course of a week, gathering input from dozens of individuals involved in our response. We’re ready to share what we learned.
Background: Postgres Transaction IDs
In order to understand the underlying issue, we need to explain briefly how transaction concurrency works inside Postgres. At a high level, every write (UPDATE, INSERT or DELETE) is assigned a monotonically incrementing transaction ID (XID). This value is stored on each row as XMIN, and subsequently used to determine which rows are visible during the execution of a given transaction, where visible rows satisfy XMIN < XID.
In practice it’s more complicated, but the important detail is that the XID is a globally incrementing counter critical to the operation of the database. These are stored with 32 bits, and compared using modulo 2^32 arithmetic. You can think of XIDs as a circular space of numbers. This means there are around 2 billion numbers greater than, and 2 billion less than, a given XID. As the current XID is consistently incrementing, you need to do something with older XIDs. If 2 billion new transactions occur, instead of representing transactions in the past, an old XID will wraparound, and suddenly represent something in the future. The feature that Postgres uses to combat this issue is a daemonized process called auto_vacuum which runs periodically and clears out old XIDs, protecting against wraparound. Tuning this is important, as there can be significant performance impacts while the vacuum is running.
For more on this Postgres failure mode, see these posts from Sentry and Joyent about their own XID wraparound outages. The PostgreSQL documentation also does a great job explaining the underlying mechanisms.
A Potential Problem
In November of 2018, engineers on our Mandrill team identified the potential to reach wraparound, as the XIDs were climbing to approximately half their total limit during peak load. Our team determined wraparound was not an immediate threat, but we added a ticket to the backlog to set up additional monitoring.
Wraparound Protection Triggered
Mandrill has a set of distinct databases which act as a unified K/V store sharded by key. The hashing algorithm used to balance load to these favored shard4, causing it to be hotter than the others and have a higher than normal load of writes.
The autovacuum process on shard4 likely fell behind due to the higher load, though it may have failed outright—we are unable to determine for certain which occurred. Either way, at 05:35 UTC on February 4, the XID hit the upper limit and the database went into safety shutdown mode.
The first exception was logged overnight on Sunday. Unrelated exceptions coming in from other systems drowned it out in our centralized reporting, which slowed our detection time. Mandrill engineers opened an investigation at 11:58 UTC. By then, attempted retries of failed writes had caused a significantly elevated load on shard4, and responders began looking for the source of the spiking load. An hour later, at 12:56 UTC, the wraparound was discovered.
When Postgres shuts down to protect against wraparound, it needs to be put into standalone mode and cleaned using the VACUUM command. At 14:16 UTC we started the vacuum.
Worst Case Scenario
It quickly became apparent that the full vacuum would take many days to complete. We stopped and restarted it with configuration tweaks to speed it up significantly. Even so, we determined the runtime would extend to days or even weeks. Our highest estimate was 40 days—far too long for our customers to have reduced ability to send emails.
We began to dump and restore the database while the vacuum was in progress. The dump and restore operation looked likely to complete at a faster rate but was still disastrously slow. To speed up the dump and restore option, we omitted the Search and Url tables, which were on the order of TB, compared to most other tables which were in the MB range.
A secondary effect of shard4 no longer accepting writes was that jobs which should, as part of their execution, write rows to shard4 would instead fail. These jobs then queue on disk on the Mandrill app servers. The increased volume of jobs being written to disk, as well as logs detailing the failures of those jobs, caused disk space to run low on every Mandrill app server. We spun up a separate track of work to replace storage volumes across app servers in Mandrill.
Shard4 also holds some of the operational data for Mandrill. Specifically, it held the tables we use to keep track of a few of the job queues. We then moved these to other live shards to allow normal processing of jobs. This did not end up helping much, as the jobs which were going to be queued on shard4 generally had to write some other data to shard4.
At 16:30 UTC on February 5, we convened a meeting to assess more radical recovery options. By this point, we had determined it would be possible to run the app without the Search and Url tables. We then discussed whether it would be possible to drop those tables on the locked database. After verifying that this would free the associated XIDs, we decided to truncate the tables. We executed the operation at 19:12 UTC. It worked! The vacuum completed roughly an hour after the truncation. Sending from the backlog began at 21:36 UTC. By 22:40 UTC, we had new alerting in place for XID wraparound, and by 01:00 UTC on February 6 the queues were down and we were back to normal.
We announced that we had resolved the outage to the Mandrill Twitter account, @Mandrillapp, at 22:09 UTC February 5. Many users replied to ask whether their queued mail would be sent automatically. We replied to individual users, letting them know that most queued mail was now sending as expected. We also tweeted that some users would see residual effects from the outage, but most customers would start to see features working as expected. We directed users with more detailed questions or questions specific to their account to our Support team.
Finally, we committed to following up with users about refunds and committed to sharing this blog post once our post-event review was completed.
We know that our customers rely on our service to send transactional emails when they need them. We expressed our sincere apologies to our customers during and immediately following the outage, and we refunded affected users for Mandrill purchases made between January 1, 2019 and February 13, 2019. We also credited affected users Mandrill blocks to cover future purchases in February. We’d like to take the opportunity to express again our regret and our apologies for the inconvenience this caused our customers, and by extension, their customers.
We’ve been working to update and strengthen our Incident Response protocol over the past year. This was our largest incident in that period, providing a good opportunity for us to reflect and examine how the systems we put in place have helped.
We’ve been training a bench of strong incident commanders (ICs), whose primary function is orchestrating communications and centralizing information during incidents. For an incident with such far-reaching impact to our customers, we tend to want to throw the full weight of our teams toward finding a fix. This can lead to an unorganized response that ultimately keeps us from efficiently working together toward a resolution. Having a well-trained team of ICs available to help streamline the response allowed us to bring together teams across the company to strategically address the technical issues at hand, as well as coordinate communication to our users across email, social media, and our support channels.
Our Incident Response structure also allowed us to make sure we were aware of and taking care of the health our responders. Because this incident lasted so long, we ended up establishing 3-hour rotations of ICs, with a short meeting to transfer the necessary context. This provided ICs a predictable schedule around which they could rest, and left us with an up-to-date incident record and notes. A deep bench of ICs worked in our favor, allowing us to give ICs a long rest between periods of activity.
In addition to our usual IC role, we also designated other roles to respond to this outage, including scribes, technical liaisons, and points of contact between teams, which allowed us to disseminate information clearly between the large number of teams coordinating their efforts. We also identified a need for our Support agents to provide more technical details than usual to Mandrill users. Our technical liaisons stepped up to make sure the information was clear to both the agents and the users, which they appreciated.
What We Learned
At Mailchimp, we take a blameless approach to post-incident reviews. Our primary objective is to learn as much as we can, not to point fingers at individuals or teams. Incidents are usually comprised of several systems—both technical and human—interacting and overlapping in unexpected ways. Hearing multiple perspectives on the same system helps us to clarify our understanding of the behavior of the system as a whole.
Because of the scope of this incident and the impact it had on our customers, we ended up doing a series of formal debriefs over the course of a week. Through these conversations, we learned more about what worked well, and what didn’t.
We identified some other factors which slowed down our response:
- Knowledge of and access to Mandrill systems and monitoring is concentrated in a small number of individuals.
- Logging of the Mandrill system is dependent on having a locally writable disk.
- Differentiating between logs for malformed API requests and actual application errors was not easy. This made it difficult or impossible to separate failures while inspecting logs.
- We had limited visibility into job health.
- Partial code deployments to aid triage were risky because of disk space issues.
Things that worked well and which we’ll continue to do going forward:
- The incident command muscles that we've been building paid off—close coordination of the response was critical to our recovery.
- We ultimately landed on the truncation strategy because of a video conference sync. Taking the time to regroup and reassess was valuable and led to a faster solution.
- Having clearly designated roles was essential: ICs, scribes, technical liaisons, and points of contact. These roles helped us communicate across many different teams and across a long stretch of time.
- A periodic cadence of summaries and a list of recent/outstanding decisions kept responders on track and aligned.
- Lots of folks volunteered to help, and they ramped up quickly.
- Mandrill by default deletes queued mail older than 48 hours. As we approached this limit, more drastic solutions were considered. We were forced to get creative, which was good.
- Responders felt well taken care of, and we were lucky that our volunteers were able to prioritize this work over other things.
Some things went surprisingly well during the incident:
- A lot of second-order impacts were related to disk queues filling up—running in the cloud added flexibility to our response, and we were able to expand storage for these instances quickly.
- We were lucky that we had data we could delete at all, and even more lucky that it included the largest table holding up the vacuum.
We know this won’t be the last incident we encounter, and there’s always more to learn. We’re pleased that many of the changes we’ve made in the last year have served us well, but we’re still working to build more resilient systems to serve our customers as best we can. We hope that sharing our findings with you is helpful. If you have questions, please reach out to our Support team here. We’d love to hear from you.