Engineering blog · Quality Engineering

Testing 420 bots: a love letter to pipeline engineers.

When we had 50 bots, we tested them by reviewing the test scripts and approving them individually. When we had 200 bots, we kept doing that and started to notice we were the bottleneck. When we crossed 400 bots, we stopped doing it entirely and rebuilt our testing around end-to-end behavior. The bank did not fall over. The opposite happened. This is what we learned.

Why script reviews stop working

For the first few years of our RPA program, every bot change went through a manual test script review. An engineer would write the bot. A QE would write the tests. A senior QE would review both. Everyone could point to the test script and say "yes, this is what we expect to happen". It worked. It also did not scale.

The moment we crossed a few hundred bots, the script review queue became a full-time job for two people, then three, and the actual information content of those reviews dropped off a cliff. Most reviews were checking box-ticking: does the script cover the happy path, does it cover three obvious unhappy paths, does it use our fixtures. It was catching typos and missing assertions. It was not catching production failures.

Production failures were almost always integration drift, environmental flakiness, or unexpected interactions between bots that were individually well tested. Things that cannot be caught by reviewing a single script because they do not live inside a single script.

What we do now

Our testing discipline now has three layers, and the middle one is where we spend the most attention.

Layer one is automated tests per bot. These are owned by the bot author, generated largely by our coding agent, and required for deploy. No human reviews the script. Pipeline enforces that the tests exist, pass, and achieve a coverage threshold. Humans do not gate on them.

Layer two is end-to-end behavioral tests. This is the layer that matters. These tests simulate a realistic workflow end-to-end, across multiple bots and sometimes multiple orchestration processes. They run in a near-production environment. They run nightly. They run on every platform-level change. When a behavioral test fails, we know something is wrong at the bank-level, not the bot-level.

Layer three is production observability with autonomous verification. Every bot emits structured telemetry. A separate tier of synthetic transactions runs continuously in production, exercising end-to-end flows with canary data, and verifies that the flows produce the expected side effects. This is the layer that catches the silent vendor drift.

The role of our coding agent in tests

Our coding agent drafts the per-bot tests when a bot is authored. It does this well because tests are a constrained output and the agent has every example we have ever written to learn from. It is not great at writing end-to-end behavioral tests, because those require understanding the workflow beyond any single bot. That is where our human QE engineers spend their time now.

This is the shift. We used to spend QE time on reviewing what the agent can now produce fine on its own. We now spend QE time on the thing humans are actually better at: thinking about the system as a whole.

We stopped caring about incremental test script reviews and started caring about what the bank does end to end. The bank got more reliable.

What the numbers look like

420
Bots under test
37
Behavioral tests covering critical workflows
99.94%
Platform uptime for P1 processes

The uptime number is the outcome we care about. Going from "every script is reviewed by a human" to "behavioral tests are the gate" did not cost us reliability. It improved reliability, because we caught more of the failures that actually occur in production.

The 37 behavioral tests number is the most interesting. 37 is far fewer tests than 420 bots, but those 37 tests exercise every important bot behavior at least once in context. This is the trade we made. Fewer total tests, each of which is harder to write, each of which gives us more information.

If you are testing automation, stop celebrating script counts

Test count is a vanity metric. Test script review throughput is a vanity metric. Coverage percentage is a vanity metric. The metric that matters is "does the bank work end to end when we deploy a change". That is the only measure that survives contact with 420 bots.

If you are testing automation at scale, invest in behavioral tests that run on a production-like environment, invest in synthetic transactions that exercise your flows in production continuously, and stop spending senior QE time reviewing per-bot scripts. Your engineers can write those. Your agent can draft those. Your behavioral test suite is where the quality lives.

Testing Quality Engineering RPA