Tech

Microsoft launches open-source ASSERT framework for application-specific AI testing

The Adaptive Spec-driven Scoring for Evaluation and Regression Testing (ASSERT) framework addresses gaps in general model evaluations by focusing on context, policies, and tools specific to individual products.

Author
Owen Mercer
Markets and Finance Editor
Published
Draft
Source: TechCrunch · original
New Microsoft tool lets devs spin up AI behavior tests using text descriptions
New tool allows developers to convert natural-language policies into structured, scored regression tests

Microsoft has released ASSERT, an open-source framework designed to help developers evaluate application-specific artificial intelligence behaviour. Launched on Tuesday, the tool converts natural-language descriptions of goals, policies, or intended behaviours into structured, scored tests. The framework generates problem scenarios, runs them against target systems, and scores results, while recording AI paths and intermediate actions for inspection.

The release addresses a specific need for companies ensuring their AI systems behave as intended for particular products or services. While broader evaluations have advanced significantly, they often fail to capture nuances shaped by an application’s context, policies, and tools. ASSERT fills this gap by allowing developers to provide system context and constraints to customise evaluations, supporting testing during development, post-deployment, and for continuous monitoring.

Developers can specify detailed rules for the framework to generate test cases. For example, a developer could instruct a document research AI agent not to send emails to external parties, limit confidential information to C-level executives, and provide concise summaries. ASSERT uses these rules to create tests that verify whether the system adheres to these constraints on an ongoing basis.

Sarah Bird, chief product officer of Responsible AI at Microsoft, emphasised the importance of these capabilities. She stated that evaluations are critical for making good decisions and that trustworthy systems require evaluating many dimensions that are application-specific. Without understanding the specific behaviour of an AI system, it is difficult to determine if it meets an organisation’s standards.

The launch coincides with a broader industry shift towards repeatable testing and regression checks. As AI models become more capable, other groups such as Stanford’s HELM, MLCommons’ AILuminate, and METR are also rolling out benchmarks to measure how models behave under different conditions. ASSERT is part of this growing ecosystem of tools aimed at ensuring AI reliability and safety.

Continue reading

More from Tech

Read next: Apple opens developer access to iOS, iPadOS and macOS 27 betas
Read next: Apple confirms macOS 27 Golden Gate requires Apple Silicon, ending Intel support
Read next: Apple unveils watchOS 27 with Siri AI integration and hardware restrictions