AI-driven mainframe modernization

Using AI to accelerate legacy COBOL mainframe modernization with confidence

Challenge

For decades, a large federal healthcare agency has relied on a large COBOL-based system to process national healthcare claims at scale. Millions of Americans depend on this system every day, even though few ever see it. To ensure long-term reliability, the agency launched a multi-year payment systems modernization program to modernize its claims processing systems. One of these initiatives, the claims adjudication modernization project, is being implemented by Flexion. At the core of the system sit decades of payment rules and policy decisions embedded in millions of lines of COBOL and JCL, with the code itself serving as the primary source of truth.

As modernization efforts moved forward, a fundamental risk emerged. Engineers who once understood both the COBOL language and the policy intent behind the code were retiring. What remained was a system everyone relied on, but few fully understood. For the agency, this created higher costs, longer timelines, and the risk that critical payment rules could be misinterpreted or lost. Engineers faced uncertainty when making changes, while business and policy stakeholders struggled to confirm whether system behavior still matched intent. Answers existed, but they lived deep inside legacy code written decades ago.

This challenge could not be solved through code translation alone. Modernization required restoring understanding before any code changed. Teams needed a way to make the system explain itself in clear, human terms, including what each program does, what data it uses, and how components depend on one another. Without this shared understanding, modernization would remain slow, risky, and dependent on scarce expertise.

To move from hypothesis to evidence, Flexion launched a focused two-week proof of concept. The goal was to determine whether AI could make legacy COBOL systems explainable at scale, reducing risk and uncertainty before modernization began.

Our hypothesis was simple. If large language models (LLMs) could reliably explain what legacy COBOL programs do, teams could reduce modernization risk before touching production code. The proof of concept was deliberately constrained to two weeks and designed to operate safely outside production systems. The goal was not to automate modernization, but to answer a focused question: Can AI help engineers and business stakeholders understand complex COBOL systems well enough to make confident, informed modernization decisions?

Approach

Flexion approached the proof of concept with the same rigor applied to production modernization efforts. A small, cross-functional team from the claims adjudication modernization project worked in a tightly scoped two-week window, operating outside production systems to ensure safety. The team aligned daily on goals, acceptance criteria, and results to move quickly without sacrificing discipline or traceability.

The work focused on a single technical objective: determining whether large language models could reliably explain legacy COBOL programs in plain language. Success was defined by clarity and consistency, not automation. The goal was to generate structured explanations that described what a COBOL program does, what data it reads and updates, what outputs it produces, and how it depends on other components. In large legacy systems, this level of understanding is the primary barrier to modernization.

Because machine learning outcomes are probabilistic rather than deterministic, the team established objective evaluation criteria early. Rather than building a custom dataset, which would exceed the proof of concept timeline, the team selected an existing COBOL-to-natural-language dataset from Hugging Face to evaluate model performance. While a client-specific dataset remains necessary for production use, this approach allowed rapid, repeatable testing within the time constraints.

To automate evaluation, the team developed a Python-based pipeline to run candidate models against the dataset and score results using Rouge (Recall-Oriented Understudy for Gisting Evaluation), a text similarity metric. This provided a consistent way to compare models and select those best suited for explanation rather than code generation. Ten models were evaluated, revealing meaningful differences in performance. Some models excelled at code assistance, while others performed better at reasoning and translation. IBM’s Granite Code and Wizard Coder emerged as the strongest candidates, with Wizard Coder selected due to performance, output quality, and execution efficiency.

Initial experiments showed that prompt-only approaches produced inconsistent results. To address this, the team introduced a defined JSON schema to constrain model output. This shift significantly improved the consistency, structure, and usefulness of the explanations, making them easier for engineers and stakeholders to review and validate.

All supporting code for the proof of concept was generated with assistance from a generative AI coding agent, Claude Code. Engineers reviewed and refined the output, allowing the team to focus on experimentation and problem-solving rather than boilerplate implementation. This approach enabled rapid iteration and kept costs low while maintaining engineering oversight.

This approach demonstrated that AI-assisted understanding of legacy COBOL systems is both feasible and practical when applied with clear goals, structured output and objective evaluation. By the end of the two weeks, the team had proven that AI could reliably turn complex COBOL programs into clear, structured explanations that engineers and business stakeholders could immediately use.

Outcomes

The proof of concept validated the core hypothesis. LLMs can meaningfully accelerate modernization by improving understanding of legacy COBOL systems without increasing risk or cost. The team successfully generated plain-language, structured descriptions of COBOL programs that explained system behavior, data usage, outputs, and dependencies in a form teams could trust and act on.

What changed for clients?

Shared understanding replaces guesswork: Before this work, understanding legacy COBOL systems depended on scarce expertise and manual analysis. The proof of concept showed that AI can translate complex code into clear explanations that engineers and business stakeholders review together. This common understanding becomes the foundation for every modernization decision and removes reliance on a shrinking pool of specialists.

Lower risk before any code changes: By making system behavior explainable early, teams can validate intent before modernization begins. Business and policy stakeholders confirm rules match expectations, while engineers gain clarity on what must be preserved. Risk shifts left when it is cheaper and safer to address.

A practical and trustworthy use of AI: The approach demonstrated responsible AI use in a high-stakes environment. Structured outputs, deliberate model selection, and objective evaluation produced results that teams could rely on. AI-supported engineering judgment, rather than replacing it, accelerates insight while keeping humans in control.

A repeatable path forward: Most importantly, the proof of concept established a scalable capability. Teams can automate understanding across large legacy systems, define modernization targets with confidence, write tests earlier, and move forward knowing they are modernizing the right behavior.

This proof of concept demonstrates how Flexion approaches modernization for systems at the federal healthcare program scale and beyond. Many federal agencies rely on large, policy-driven mainframe systems where risk, complexity, and public impact demand caution. Flexion applies AI only where evidence shows it reduces uncertainty and improves outcomes, using disciplined experimentation to validate ideas before they reach production. Rather than leading with tools, Flexion leads with understanding, helping agencies modernize critical systems with confidence, clarity, and control.

Learn more about how Flexion helps tie AI to real outcomes for your team.

Client

A federal healthcare agency

Project Name

Claims adjudication modernization project

Duration

2 weeks

Tech Stack

Testing

Pytest

Version Control and Collaboration

GitHub

(CI/CD)

GitHub Actions

Application Development

Python
UV

AI Models Development

Wizard Coder (primary),
IBM Granite Code (evaluated)
OpenAI GPT models
Anthropic Claude Sonnet
Opus
DeepSeek
Hugging Face
Ollama
OpenAI APIs

Generative AI Development Tools

Claude Code

Ready to change the way you’re doing business?

Contact us to talk about how Flexion can help your organization drive efficiency, optimize costs, and achieve your technology goals!

Let’s discuss your project

← Back to all case studies

Previous Case Study

Iterative delivery accelerates Medicare Claims modernization

Next Case Study

AI-driven mainframe modernization

Using AI to accelerate legacy COBOL mainframe modernization with confidence

Challenge

Approach

Outcomes

What changed for clients?

Client

Project Name

Duration

Tech Stack

Ready to change the way you’re doing business?

Iterative delivery accelerates Medicare Claims modernization

Standardizing food safety reports with AI

Using an agile, human-centered design mindset, we transform digital technology to create powerful experiences for all.

Contact us

Case studies

Services

Industries