Engineering blog · Quality Engineering

Testing 420 bots: a love letter to pipeline engineers.

PM Priya Mathur, Director of Quality Engineering · We are hiring in this team · March 8, 2026 · 6 min read

When we had 50 bots, we tested them by reviewing the test scripts and approving them individually. When we had 200 bots, we kept doing that and started to notice we were the bottleneck. When we crossed 400 bots, we stopped doing it entirely and rebuilt our testing around end-to-end behavior. The bank did not fall over. The opposite happened. This is what we learned.

Why script reviews stop working

For the first few years of our RPA program, every bot change went through a manual test script review. An engineer would write the bot. A QE would write the tests. A senior QE would review both. Everyone could point to the test script and say "yes, this is what we expect to happen". It worked. It also did not scale.

The moment we crossed a few hundred bots, the script review queue became a full-time job for two people, then three, and the actual information content of those reviews dropped off a cliff. Most reviews were checking box-ticking: does the script cover the happy path, does it cover three obvious unhappy paths, does it use our fixtures. It was catching typos and missing assertions. It was not catching production failures.

Production failures were almost always integration drift, environmental flakiness, or unexpected interactions between bots that were individually well tested. Things that cannot be caught by reviewing a single script because they do not live inside a single script.

What we do now

Our testing discipline now has three layers, and the middle one is where we spend the most attention.

Layer one is automated tests per bot. These are owned by the bot author, generated largely by our coding agent, and required for deploy. No human reviews the script. Pipeline enforces that the tests exist, pass, and achieve a coverage threshold. Humans do not gate on them.

DevCon 2026 Amol Marathe's demo · this is the section demoed live

UiPath Test Suite UiPath Test Manager UiPath Agentic Desktop UiPath Orchestrator

Layer two is end-to-end behavioral tests. This is the layer that matters. These tests simulate a realistic workflow end-to-end, across multiple bots and sometimes multiple orchestration processes. They run in a near-production environment. They run nightly as scheduled test sets in our test manager, with critical processes like the loan origination and dispute resolution flows executing on a fixed 24-hour cadence. When a behavioral test fails, we know something is wrong at the bank-level, not the bot-level.

Diagnosing a failure in a behavioral test used to mean navigating three separate systems: the test manager for execution logs, Orchestrator for the automation trace, and an internal reconciliation tool called the Test Data Viewer for the back-end system state. The Test Data Viewer is a desktop application our team built to surface the last 24 hours of transaction data from six back-end systems in one readable view. Without it, a tester would have to open each system individually and mentally reconcile the data. With it, the discrepancy between what the automation expected and what the back-end returned is usually visible in under two minutes.

Layer three is production observability with autonomous verification. Every bot emits structured telemetry. A separate tier of synthetic transactions runs continuously in production, exercising end-to-end flows with canary data, and verifies that the flows produce the expected side effects. This is the layer that catches the silent vendor drift.

The role of our coding agent in tests

Our coding agent drafts the per-bot tests when a bot is authored. It does this well because tests are a constrained output and the agent has every example we have ever written to learn from. It is not great at writing end-to-end behavioral tests, because those require understanding the workflow beyond any single bot. That is where our human QE engineers spend their time now.

This is the shift. We used to spend QE time on reviewing what the agent can now produce fine on its own. We now spend QE time on the thing humans are actually better at: thinking about the system as a whole.

We stopped caring about incremental test script reviews and started caring about what the bank does end to end. The bank got more reliable.

What the numbers look like

420

Bots under test

Behavioral tests covering critical workflows

99.94%

Platform uptime for P1 processes

The uptime number is the outcome we care about. Going from "every script is reviewed by a human" to "behavioral tests are the gate" did not cost us reliability. It improved reliability, because we caught more of the failures that actually occur in production.

The 37 behavioral tests number is the most interesting. 37 is far fewer tests than 420 bots, but those 37 tests exercise every important bot behavior at least once in context. This is the trade we made. Fewer total tests, each of which is harder to write, each of which gives us more information.

If you are testing automation, stop celebrating script counts

Test count is a vanity metric. Test script review throughput is a vanity metric. Coverage percentage is a vanity metric. The metric that matters is "does the bank work end to end when we deploy a change". That is the only measure that survives contact with 420 bots.

Footnote: yes, we also track the number of times someone in the weekly review asks "can we see that as a dashboard". It is exactly 1.3 per meeting, stable for 14 months, and we are not improving it on purpose.

If you are testing automation at scale, invest in behavioral tests that run on a production-like environment, invest in synthetic transactions that exercise your flows in production continuously, and stop spending senior QE time reviewing per-bot scripts. Your engineers can write those. Your agent can draft those. Your behavioral test suite is where the quality lives.

If this is the kind of work you want to do, we are hiring a QA Automation Engineer in Mumbai.

Testing Quality Engineering RPA

Testing 420 bots: a love letter to pipeline engineers.

Why script reviews stop working

What we do now

What we stopped doing

The role of our coding agent in tests

What the numbers look like

If you are testing automation, stop celebrating script counts

More from our engineering team.

When your vendor's API goes dark

Same product, new front door

Credit is not a straight line

Agentic orchestration at Accrual Bank

This demo runs on UiPath.