A team of computer researchers at the University of Massachusetts Amherst has introduced an advanced method named Baldur to significantly reduce software bugs and improve code verification. Combining large language models (LLMs) like ChatGPT with the Thor tool, Baldur achieved an unprecedented efficacy rate of nearly 66%.

Traditional manual methods for code verification are error-prone and impractical for complex systems, while machine-checking, a more rigorous approach, is laborious and time-consuming. To address these limitations, the researchers developed Baldur, which uses an LLM called Minerva trained on mathematical scientific papers, refined with Isabelle/HOL language. Baldur collaborates with a theorem prover in a feedback loop to identify and rectify errors, achieving a 65.7% accuracy in automatically generating proofs when integrated with Thor.

The researchers claim Baldur represents the most effective means for software verification, earning them a Distinguished Paper award at a recent conference. The project received support from the US Defense Advanced Research Projects Agency and the US National Science Foundation.



Whilst the UMA team are to be congratualted and this appears to be a significant improvement over current methods, the “elephant in the room” might be what does the remaining approx. 35% error rate say about the robustness (or lack of it) of software in the 21st century, about relying on such software without manual checks or about using AI generated software systems with little/no human input?

We may of course also be wondering if AI and LLM’s are being developed to find exploits in existing systems ..

Technology is neither good nor bad, nor is it neutral.

Melvin Kranzberg