Communicate Clearly, Change Confidently: Lessons from Breaking Changes

Non-breaking API changes are not negotiable. They are foundational to your team's credibility and system's survival. This includes all of your upstream and downstream connections. Versioning, communicating changes, continuous integration, communicating changes, documentation, and communicating changes prevent you and other teams from accidentally breaking each others' systems. Of all the things to get right, communicating changes matters most—because even if everything else goes wrong, at least everyone is on the same page.

Witnessing Enterprise Chaos

I've personally witnessed breaking changes occur in enterprise-grade software. Breaking changes happen EVERYWHERE, not just at startups and medium-sized businesses. Fortunately for end-users, breaking changes can go unnoticed thanks to multiple layers of code designed to prevent user experience degradation. End-users see a nice message that masks the underlying chaos—teams trying to figure out what's happening and what broke. Sadly, in some cases, they're trying to figure out who broke it rather than what processes failed everyone.

Cisco

Situation

My time at Cisco was a good learning experience. APIs were versioned between teams and there was enough communication between the teams to prevent breaking changes. The team I joined worked on an API-driven web app. In addition to its own APIs, it sent and received data from APIs being developed by other Cisco teams. For the most part, the APIs we talked to were always up and running—sending and receiving the expected data. When an API responded with a 4xx or 5xx error, it was difficult to debug. Our starting point for debugging the issue was, "Let's get a meeting scheduled with the team that maintains the API," and that's where it would escalate into chaos.

The most memorable meetings for me were the ones that started with "It's not us, it's them," and that set the tone for the rest of the debugging sessions on the issue at hand. I can't speak for other teams' API development, but my team didn't have unit or integration tests to help us when I joined. There was no way to know for sure whether we were doing something incorrectly (sending a bad request or similar) or whether the issue stemmed from the other team's system. We (including the other team) couldn't answer the following questions simply because we didn't have integration tests:

Is the 4xx due to a bad request on our side?
Is the 4xx due to a new required field on their side?
Are we receiving a 5xx because their side is experiencing an outage?
Are we actually receiving a 200, but can't parse it because the response format changed, causing us to throw a 5xx?

If we had integration tests, we could've answered the above quickly and shut down that "it's not us, it's them" statement with unbiased facts. It was clear we had a process issue in our debugging workflow in addition to having a communication issue with the other team. Not to mention the psychologically unsafe debugging sessions. Just an overall counterproductive, dysfunctional debugging workflow that caused stress for everyone involved.

Forward with Resiliency

From my perspective, what we needed was a process aligned with preventive maintenance, more error handling, and raising concerns about psychologically unsafe meetings. Before I left Cisco, we did get some automated tests in place and more error handling (logs and alerts), so our MTTR did improve when breaking changes happened. The error handling I added lasted about a month before being removed from the codebase. The alerting system it triggered generated too much noise from frequent (intermittent, but frequent) errors, and this was frowned upon by some parties involved. I can't say much about the meetings, but catch-ups with colleagues showed they were improving.

Lessons

Keep psychological safety top of mind in meetings. This ensures everyone can move forward cohesively. A meeting to handle an outage is already stressful; there's no need for more.
When failures happen, question the process and not the people executing it. Most of the time, it's the processes that are failing everyone and needing improvement.
Document what went wrong so it doesn't happen again. Do the boring work and keep that report in your internal documentation pages so parties involved can reference it. This could also prevent unnecessary meetings.
Write your integration tests and run them in your CI. Running them there is a high-performing operation that (1) future-proofs your integrations, (2) keeps your MTTR metric low, and (3) prevents uncertainty about what's breaking.

LUMA

Situation

Many breaking changes happened during my time with LUMA. One of them was a simple back-end change to improve user experience. There were worse ones, but I highlight this one because it had good intentions and still caused chaos. The breaking change was to enum values that mapped to user-friendly status messages in the web app. From what I recall in a meeting, an on-prem team updated the values for more clarity (e.g., PENDING vs. IN_REVIEW). The change wasn't communicated to my team, so we didn't make an update to the web app. The web app continued to use PENDING and defaulted to UNKNOWN when it received IN_REVIEW. These UNKNOWN statuses triggered customer support calls from confused users, which then escalated to our team for corrective maintenance (a low-performing operation—opposite of preventive maintenance).

Correct and Prevent

Although the on-prem change was simple and had good intentions, it created enough chaos to pull me and my colleague into a war room. The war room was psychologically safe, so everyone involved worked cohesively. We quickly figured out the issue after checking the entire flow and corrected the front-end to handle the new values. The change to the enum values and the correction to the front-end did improve user experience, successfully achieving the intended outcome. During our investigation, we figured out a preventive maintenance solution so that future enum value changes would be captured in the back-end and not surfaced to users. Simply put, our solution:

defaulted to PENDING status instead of UNKNOWN to not confuse users;
added a new check that wrote error logs if we pulled an unrecognized enum value from the on-prem system; and
set up an alert to notify us so we could fix the problem behind the scenes and maintain user experience.

Lessons

Always communicate changes—even simple UX improvements—to avoid customer-facing issues and unnecessary corrective maintenance.
User-facing fallback logic should always make sense to the user. Something like an UNKNOWN status causes confusion.
Add logging/alerts when handling unknown values in request/response payloads. This buys time for your team to investigate and make changes.
Good intentions don't prevent chaos—checks do. Pair your good intentions with checks to ensure everything works as expected.

Payoneer

When I was on the Payoneer team, we had a strict policy where we needed to communicate breaking changes to partners N days before the breaking change occurred. We also had migration guides to make the migration process smooth and stress-free. Communication was top of mind on this team. I encountered just one breaking change. I don't recall the exact issue, but fundamentally it boils down to a pull request's blast radius. That is, how much of a system will a pull request break if it's merged and released into production.

In a fast-paced environment, small pull requests (even one-liners) can be overlooked and cause serious damage. It's because of Payoneer that I'm overly cautious on small pull requests. The first question I ask myself when presented with a pull request is, "What's the blast radius of this code being changed?" More times than not, I've found small pull requests to be the most fatal changes. This is mostly due to a collective "it's a small change, we're good to merge" mindset. In other companies, teammates got frustrated with my "slow" pull request reviews. I mean I could approve and merge the code quickly if (1) there were unit tests and (2) they passed in the CI. However, most of the projects I've worked on didn't have unit tests when I joined. Payoneer required unit, API, and/or browser tests before release.

Lessons

Scrutinize small pull requests. A small pull request doesn't mean its blast radius is small. It still has the potential to be fatal.
Write tests. They future-proof your code. It takes time to write tests, but it takes more time to figure out what's breaking in your system without them.
Add automated tests in your workflow. It will save you future debugging time. It's better to have your CI say your code is faulty rather than hearing, "There's a bug in production," from someone outside of your team.
Have a communication policy in place if you're developing/maintaining an API. Make sure everyone knows it and it's enforced.