July 14, 2025

About six months ago, I started a new job on a team that was building a new debit card fraud detection platform. About six days ago, I found out the parent company chose to lay off our division. Bummer. My team is being kept on for a few more months to transition our product to a new team, but we’ll all be job hunting soon.

A poorly cropped photo of ten guys posing in front of a sign that says FIS, on a white tile floor, in front of a white and gray wall.

The Wolverine Inference team, a few days before the layoffs

I’ve worked on some good engineering teams in the past, but this team is one of the smartest, kindest, and most high performing teams it’s been my pleasure to be a part of.

What we did

The product was called Wolverine. We would receive debit card transactions from an upstream provider, collect a bunch of extra information about the transaction, run them through one of several machine learning models, score the likelihood that transaction is fraudulent, run it through some business rules, and return a decision (Approve or Deny).

Importantly, all of this happens live, between the moment the user swipes their card and when the machine beeps. A bunch of stuff happens in those few seconds, so we had to do everything fast.

I also won office contests to name the new freezer and some patio furniture.

I’m proud of what we accomplished

When I joined, the team had most of a solution ready to go. They needed help with observability. My first major contribution was sitting down, listing failure modes, identifying the gaps in our metrics and alerts, and filling them. Then I took it a step further by identifying an aggressive but achievable service level objective:

Our objective was to return a score for every transaction in under 150ms (from the moment the POST hit one of our servers to the moment we returned a payload).

How did we do?

A graphana graph titled “Wolverine SLO Met (avg for time range)”. The big number is 99.982%, and the line is mostly straight across with a few dips here and there.

This is the percent of transactions for which we hit that SLO between April 2025 (when I started measuring) and July (today). Eight people, one EKS cluster, building on top of legacy banking systems, having just turned it on, and handling ~50% of the debit card traffic in the United States. I’m damn pleased with that number.

Bringing big tech to legacy finance

I worked for a new acquisition of a 60 year old company. They’ve got mainframes. They’ve got datacenters. They definitely run production COBOL, although I didn’t see any. I’ve largely worked at smaller, significantly newer tech companies, and so has most of my immediate team. There are two guiding principles we had that our counterparts at the big company didn’t:

Prod stays up
When prod goes down, we figure out what we need to change so prod stays up in the future.

Imagine our surprise when we found out that some teams take 30+ minutes every month to test the failover between datacenters. Imagine our surprise when we had to explain to the SRE team that, while we appreciate their diligence fixing a problem in the dev environment, they brought down Prod in doing so.

They had monitoring and observability, but we stepped in and made sure they were measuring the right things, they had the right alerts, and that the right people were being notified when something went wrong. When I fat-fingered a config and caused a significant accident, I created the postmortem template, filled it out, and called a meeting with other department leads where I explained exactly what happened and what we’re doing to prevent it from happening again. The subtext was: This is the level of commitment you get from us.