Blog

Test Automation Maturity Model: 5 Levels & 4 Dimensions

Abhilash

Industry Analyst, Test Automation

Published on

June 2, 2026

In this Article:

From TMMi to adaptive maturity: what the classical test automation model misses, why it matters, and how to assess your team in twenty minutes.

Imagine a quality team at a global insurance company. They just received a Level 4 score on a well-known test automation maturity model. The consultant signed off. The metrics look good. The nightly test pipeline runs without issues. The slides for the executive meeting are ready.

Eighteen months later, that same team is struggling. AI coding tools have rewritten parts of the application three times. Buttons have moved. Workflows have new steps. Forty-two percent of their automated tests no longer work against the live system.

The Level 4 score did not help them. The maturity model did not warn them.

This guide explains what the test automation maturity model is, why the traditional version is struggling to keep up with modern development, what a better approach looks like, and how to figure out where your team actually stands today.

What is a Test Automation Maturity Model?

A test automation maturity model is a way to measure how well an organisation manages its test automation. It defines stages that a team can move through, describes what good looks like at each stage, and gives teams a path to improve over time.

The most well-known version is TMMi (Test Maturity Model Integration). It is used by many large organisations and consulting firms as a benchmark. Most test maturity models follow the same five-level shape, regardless of the specific framework.

The idea behind any test automation maturity model is useful. It gives quality leaders a shared language, a way to ask for investment, and a way to compare where they are against other teams. The problem comes when teams chase the score instead of using it as a tool for improvement.

The Five Levels of Test Automation Maturity

The classical test automation maturity model has five levels. Each level builds on the one before it.

Level 1: Initial

‍Testing happens, but it is unplanned and inconsistent. There are no formal processes. Results depend on whoever is doing the testing that day. Defects are tracked loosely, if at all.

Level 2: Managed

‍Testing has a basic structure. There is a plan, a way to track defects, and defined environments to test in. Tests are written before execution starts, though most testing is still done manually.

Level 3: Defined

‍Testing is built into the development process. There is a testing strategy that the whole organisation follows. Code reviews happen. Automation starts to be used at scale.

Level 4: Measured

‍Data drives decisions about testing. Metrics are collected and used. Release quality is checked against clear criteria. The team can predict outcomes more reliably.

Level 5: Optimisation

‍The team focuses on preventing defects, not just finding them. Processes improve continuously. Lessons from incidents are fed back into how the team works.

This is the standard test automation maturity model shape. It is logical, it rewards discipline, and for a long time it worked well.

Why the Classical Test Automation Maturity Model is Struggling

The test automation maturity model most teams use was designed for a specific kind of world. Releases happened every few months. Codebases were relatively stable. Teams had time to document everything and optimise their processes gradually.

That world is changing fast. Here is why the classical model has trouble keeping up.

1. It Assumes the Code Does Not Change Much

The classical test maturity model rewards thorough documentation, traceability matrices, and well-defined coverage targets. All of these things take time to build and maintain. They only stay useful if the underlying code is not changing constantly.

Today, many teams are using AI coding tools that can rewrite entire sections of an application overnight. A test plan written on Monday can be out of date by Friday. The test automation maturity model was not designed for that pace.

2. It Measures How You Work, Not What You Achieve

A team can score Level 5 on a test maturity model and still ship software with serious workflow failures. The classical model checks whether you have documented processes, defined metrics, and peer reviews in place. It does not check whether customers can actually complete the journeys they rely on.

Process quality and outcome quality are different things. A mature test automation practice needs to measure both.

3. It Treats Quality as One Thing That Grows Linearly

The five-level ladder implies that quality capability moves in a straight line. You earn Level 2, then Level 3, then Level 4. But in practice, a team can be excellent at one thing and struggling at another at the same time.

A team might have deep end-to-end test coverage (good verification depth) but break tests every time the UI changes (poor adaptation speed). The single-number score from a classical test maturity model hides that kind of imbalance.

4. It Counts Inputs Instead of Results

Many of the checks in a traditional test automation maturity assessment are about whether things exist. Does a test plan exist? Are defects being tracked? Have coverage targets been written down?

These are reasonable starting points. But they do not tell you whether the test plan was right, whether defects were actually fixed before release, or whether the coverage targets covered the right things. The classical model measures the presence of inputs. What matters is what those inputs produce.

5. It Was Built for a Different Job

The original test automation maturity model was designed to help QA teams become more efficient and predictable. The goal was to reduce rework, run tests faster, and deliver software at lower cost.

That is still part of the job. But when AI tools are writing a large chunk of the code, the QA team's job is increasingly about trust and governance. Can we prove this release is safe to ship? Can we show an auditor what was verified and when? The classical model has no level for that kind of work.

‍

A Better Way to Think About Test Automation Maturity

A growing number of quality leaders are moving away from the single-number score. Instead of one five-level ladder, they are using four separate dimensions. Each one is measured independently. A team reports its score on all four, which gives a much clearer picture of where to invest next.

The four dimensions are verification depth, adaptation speed, composability, and defensibility.

Dimension 1: Verification Depth

Verification depth is about how close your tests are to what customers actually do. Are you testing individual functions, individual features, or complete customer journeys from start to finish?

‍Emerging: Most tests check individual functions. If a customer journey breaks, customers usually find out before the test suite does.‍
‍
Scaling: A mix of unit, integration, and feature tests exists. Some end-to-end tests are in place but they break often and are not always kept up to date.‍
‍
Mature: The test estate is built around the customer journeys that matter most. Unit and integration tests are there to support that coverage, not replace it.
‍

A team that has never run an automated end-to-end test on its ten most important customer journeys is at the Emerging stage on this dimension, no matter how many unit tests it has.

Dimension 2: Adaptation Speed

Adaptation speed is about how well the test estate keeps up when the application changes. Every team deals with code changes. What separates mature teams is how much time those changes cost them.

‍Emerging: When a developer moves a button or renames a field, multiple tests break. Engineers spend time fixing them manually. A notable chunk of each release goes to test repair rather than new work.‍
‍
Scaling: Self-healing tools are in place and reduce the number of manual fixes needed. But test maintenance still consumes more engineering time than the team would like.‍
‍
Mature: Tests adapt to structural changes automatically. Selector changes, layout shifts, and code refactors are handled without anyone having to step in. When something does need attention, it is because the behaviour actually changed, not just the code.
‍

This dimension does not appear in the classical test automation maturity model at all. In an AI-accelerated environment it is one of the most important things to measure.

Dimension 3: Composability

Composability is about how reusable your test work is. If you build a test for one application or one environment, can you use it again for another? Or do you start from scratch every time?

‍Emerging: Every test is purpose-built. Moving to a new application or environment means rebuilding the test coverage from zero.‍
‍
Scaling: Common steps and data sets are shared across tests. Some reuse happens at the snippet level. Full workflow-level reuse is not yet in place.‍
‍
Mature: Tests are assembled from a library of reusable workflow modules. Work done in one quarter carries forward. Coverage grows over time instead of being rebuilt each cycle.
‍

Composability is what turns test automation from something a team spends money on into something a team builds value with.

Dimension 4: Defensibility

Defensibility is about whether you can show your work. If an auditor, a regulator, or a senior executive asks what was tested before the last release, how quickly can you answer and how confident are you in that answer?

‍Emerging: Evidence is pulled together after the fact from logs, screenshots, and old message threads. It takes time and the result is not always complete.‍
‍
Scaling: Evidence is captured during testing but stored across multiple tools. Someone can reconstruct the picture with effort, but it is not quick or simple.‍
‍
Mature: Every test run produces a full record automatically. Screenshots, steps, video, and traceability links are all captured as part of normal testing. A release report is ready to share without anyone having to build it from scratch.
‍

In regulated industries like financial services and healthcare, defensibility is the dimension that determines whether the next audit is straightforward or stressful.

A Self-Assessment: Twelve Questions in Twenty Minutes

Score each question from one to four honestly. Add up the total. The weakest dimension is where to invest next.

Verification Depth

What proportion of your most important customer journeys has automated end-to-end test coverage?
(1 = under 25%, 2 = 25 to 50%, 3 = 50 to 80%, 4 = above 80%)
‍
Do your tests check whether the business outcome happened, like a claim being submitted or an order being shipped, or do they mainly check UI states and field values?
(1 = mostly UI states, 4 = mostly business outcomes)
‍
When a customer reports a broken workflow, can you point to the test that should have caught it?
(1 = rarely, 4 = almost always)
‍

Adaptation Speed

When a developer changes a screen, what proportion of related tests needs manual fixing?
(1 = above 50%, 4 = under 10%)
‍
After a major application change, how long does it take to get the test suite back to a passing state?
(1 = days or weeks, 4 = hours)
‍
When a structural change breaks a test, does it heal automatically, flag for review, or fail silently?
(1 = fail silently, 4 = heal automatically with a log the team can review)
‍

Composability

What proportion of the test estate can be reused across different applications or environments?
(1 = under 10%, 4 = above 60%)
‍
At the start of a new release cycle, does the team reuse existing test logic or rebuild it?
(1 = mostly rebuild, 4 = mostly reuse)
‍
Can a tester take an existing test module and apply it to a new part of the application without rewriting it?
(1 = rarely, 4 = routinely)
‍

Defensibility

Can you produce a release verification report that a non-engineer can read without explanation?
(1 = no, 4 = yes within an hour)
‍
Is every test linked to a requirement, journey, or business risk?
(1 = no, 4 = yes and the links are maintained automatically)
‍
If a regulator asked what was verified before the last release, how long would it take to produce that answer?
(1 = days, 4 = minutes)
‍

What Your Self-Assessment Score Means

‍36 to 48: Mature across all four dimensions. The focus now is on consistency and extending the practice to new applications and teams.‍
‍
24 to 35: Scaling. Find the weakest dimension and put investment there before strengthening the others further. A profile with low scores on defensibility is a regulatory risk even if the other three look strong.‍
‍
12 to 23: Emerging. Start with verification depth. Getting the ten most important customer journeys covered end to end is the first thing that compounds.
‍

The score is a starting point. The point of the exercise is to know where to focus next, not to write a number on a slide.

‍

What to Invest in, Based on Where You Are

Building Verification Depth

Teams that are weak on verification depth usually have plenty of unit tests but very few end-to-end tests. That is not because they are careless. End-to-end tests are harder to write and harder to maintain with traditional tools.

What to Do:

List the ten customer journeys that drive revenue, keep customers, or have compliance requirements
‍
Build end-to-end test coverage for all ten at the workflow level, not just the UI level
‍
Write assertions that check business outcomes, not intermediate screen states
‍
Review and update the journey list each quarter as the product changes
‍

Unit tests still do their job. End-to-end behaviour coverage is what protects the release.

Improving Adaptation Speed

Teams weak on adaptation speed are paying a maintenance tax on every release. Every code change breaks a share of the test suite. Every release cycle absorbs time that should go toward improving coverage.

What to Do:

Move to a test design approach that does not depend on specific UI selectors or implementation details
‍
Use self-healing tools for structural changes, with a reviewable log of every automatic fix
‍
Run a regular review of stale tests so that outdated tests are retired rather than kept on life support
‍
Connect change-impact analysis to the CI pipeline so only the relevant tests run for each pull request
‍

Adaptation speed requires changing how tests are written. Patching around brittle tests does not move this dimension.

Building Composability

Teams weak on composability rebuild their test coverage from scratch too often. New applications, new environments, and acquired businesses all mean starting over.

What to Do:

Build a library of reusable workflow modules at the behaviour level
‍
Standardise test data and how it is managed across the practice
‍
Treat test assets the same way the engineering team treats code: versioned, owned, and maintained
‍
Apply the same test modules across environments such as staging, UAT, and production smoke tests
‍

Composability is what lets a small team protect a large application estate.

Building Defensibility

Teams weak on defensibility often do not realise it until an audit or an incident makes it obvious. The testing happened. The evidence was not kept in a usable form.

What to Do:

Link requirements, user journeys, and business risks to specific tests as first-class connections, not manual notes
‍
Generate verification reports as a normal output of each test run, not as a quarterly catch-up exercise
‍
Use a consistent format that someone outside the QA team can read without a guide
‍
Capture screenshots, video, and step-by-step traces on every run and keep them indexed for retrieval
‍

Defensibility is worth building once. Every release benefits from it.

How AI Development Changes Which Dimension Matters Most

Three things shift when a significant share of the codebase is being written or rewritten by AI tools.

Verification Depth Matters More

‍When individual functions are being rewritten frequently, unit-level test coverage becomes less reliable as a signal. End-to-end behaviour coverage is the layer that survives code rewrites because it checks the outcome, not the implementation.

Adaptation Speed Becomes Essential

‍When refactors happen weekly, a test suite that needs manual repair after each one is not sustainable. Adaptation speed is no longer a nice improvement. It is a basic requirement for keeping the suite useful.

Defensibility Becomes a Compliance Question

‍Regulators in financial services, healthcare, and other regulated industries are starting to ask teams to show evidence of what was tested, not just confirm that testing happened. Evidence captured during testing is the only reliable way to answer that question.

Composability multiplies the value of the other three. Without it, every investment in depth, speed, and evidence has to be made again for each new application or environment. With it, the work builds on itself.

How Virtuoso QA Supports Each Dimension

1. Verification Depth

‍Tests in Virtuoso QA are written in plain English against the expected behaviour of a customer journey. The default starting point is the workflow the customer follows, not the selector on the screen. Behaviour coverage is a natural output of how the platform works.

2. Adaptation Speed

‍Self-healing in Virtuoso QA handles selector changes, layout shifts, and DOM restructuring automatically when the application changes. Tests keep verifying the journey they were written for even when the code underneath changes. Every automatic fix is logged and reviewable.

3. Composability

‍Virtuoso QA uses composable test libraries so a journey built once can be reused across releases, environments, and applications. Coverage built in one quarter carries forward rather than being rebuilt each time.

4. Defensibility

‍Every test run in Virtuoso QA produces a full record including step-by-step traces, screenshots, video, and traceability links. Release reports are produced automatically and written in a format that any stakeholder can read without needing an engineer to explain them.

A mature test automation practice means verifying what matters, keeping up with how the application changes, reusing the work, and being able to prove it all on demand. Those four things are what Virtuoso QA is built around.

‍

Frequently Asked Questions

What are the five levels of test maturity?

The five levels are Initial (testing is ad hoc), Managed (basic processes are in place), Defined (testing is integrated into the development process), Measured (data drives decisions), and Optimisation (continuous improvement is built into how the team works).

What is TMMi?

TMMi stands for Test Maturity Model Integration. It is a widely used framework maintained by the TMMi Foundation. It defines five maturity levels and the specific disciplines expected at each one. Teams can be formally assessed and certified against it.

Why is the classical test automation maturity model being questioned now?

Five issues come up regularly. The model was built for stable codebases, but AI tools now change code much faster. It rewards having good processes rather than producing good outcomes. It treats quality capability as a single linear scale when it is actually several different things. It measures whether inputs exist rather than what they produce. And it was designed for QA as a cost-reduction function rather than as a trust and governance function.

Should every team aim for Level 5?

Not necessarily. A team that has invested heavily in process documentation but cannot keep tests aligned with weekly code changes is spending more time than the confidence it is getting back. A better target is to be strong on all four dimensions to a level that matches the actual business risk, starting with verification depth and adaptation speed.

Is a test automation maturity assessment still worth doing?

Yes, when it is used as a diagnostic tool rather than a certificate. A maturity assessment is useful when it identifies specific gaps and helps the team decide where to invest. It becomes counterproductive when it produces a single score that hides more than it reveals or when the team focuses on improving the score rather than improving the practice.

What is the difference between TMMi and TPI Next?

TMMi is built around five sequential levels. Teams work through them in order. TPI Next uses a set of key areas that can each be improved independently without following a fixed sequence. Both are traditional frameworks that share the limitations described in this guide when applied to fast-moving, AI-accelerated development environments.

Tags:

Test Automation

Subscribe to our Newsletter

Try Virtuoso QA in Action

See how Virtuoso QA transforms plain English into fully executable tests within seconds.

Try Interactive Demo

Schedule a Demo