Microsoft launches open-source ASSERT framework for application-specific AI testing
The Adaptive Spec-driven Scoring for Evaluation and Regression Testing (ASSERT) framework addresses gaps in general model evaluations by focusing on context, policies, and tools specific to individual products.

Microsoft has released ASSERT, an open-source framework designed to help developers evaluate application-specific artificial intelligence behaviour. Launched on Tuesday, the tool converts natural-language descriptions of goals, policies, or intended behaviours into structured, scored tests. The framework generates problem scenarios, runs them against target systems, and scores results, while recording AI paths and intermediate actions for inspection.
The release addresses a specific need for companies ensuring their AI systems behave as intended for particular products or services. While broader evaluations have advanced significantly, they often fail to capture nuances shaped by an application’s context, policies, and tools. ASSERT fills this gap by allowing developers to provide system context and constraints to customise evaluations, supporting testing during development, post-deployment, and for continuous monitoring.
Developers can specify detailed rules for the framework to generate test cases. For example, a developer could instruct a document research AI agent not to send emails to external parties, limit confidential information to C-level executives, and provide concise summaries. ASSERT uses these rules to create tests that verify whether the system adheres to these constraints on an ongoing basis.
Sarah Bird, chief product officer of Responsible AI at Microsoft, emphasised the importance of these capabilities. She stated that evaluations are critical for making good decisions and that trustworthy systems require evaluating many dimensions that are application-specific. Without understanding the specific behaviour of an AI system, it is difficult to determine if it meets an organisation’s standards.
The launch coincides with a broader industry shift towards repeatable testing and regression checks. As AI models become more capable, other groups such as Stanford’s HELM, MLCommons’ AILuminate, and METR are also rolling out benchmarks to measure how models behave under different conditions. ASSERT is part of this growing ecosystem of tools aimed at ensuring AI reliability and safety.


