SecureBio AI: 2025 in Review
2025 was a significant year for the SecureBio AI team – reflected by the fact that we tripled our headcount, adding to our roster of world-class research scientists and engineers. By deepening our interdisciplinary talent pool across virology, AI/ML, software engineering, and policy, we expanded our capacity for running multiple large projects in parallel. This allowed us to make major strides in turning AI-bio risk evaluation from a set of bespoke projects into something closer to an ecosystem.
A central focus of our work last year was building evaluation tools that move beyond “does the model know biology,” towards “does the model meaningfully expand harmful biological capabilities?” In practice, this meant building more rigorous benchmarks, developing deeper agentic evaluations, and more direct integrations with real-world safety pipelines. We increased the reach of our technical outputs by briefing senior decision-makers on how third-party evaluations are run and interpreted. This year we will intensify our work on mitigations, push the envelope on understanding frontier capabilities across agents, biological AI models, and work to increase our understanding of how advances in AI translate to actual risk.
If you’re working on adjacent problems (especially evaluation standards, mitigation tools, or safety audit readiness), we’re always keen to to compare notes and collaborate.
Benchmarks and Evaluations
Virology Capabilities Test (VCT) was our flagship effort in 2025. We designed and executed a large-scale benchmark and published the primary research paper, helping establish VCT as the leading reference point for AI-bio risk discussions and enabling large-scale expert/model comparison.1 In addition we expanded coverage across different parts of the biological landscape with non-public benchmarks like the World Class Bio benchmark (WCB), the Molecular Biology Capabilities Test (MBCT), and the Human Pathogen Capabilities Test (HPCT), each aimed at capturing distinct slices of capability that matter for real-world misuse, not just textbook knowledge.
We also pushed further into agentic and long-form evaluation of biological AI models (BAIMs). The Agentic Bio-Capabilities Benchmark (ABC-Bench) and Agentic BAIM-LLM Evaluation Benchmark (ABLE) were designed to test whether agentic systems can complete key components of dual-use workflows, such as using biological AI models to redesign viral proteins. ABC-Bench shows that AI agents can increasingly undertake biosecurity-relevant tasks across both in-silico design and wet-lab experiments, while ABLE shows that agents can effectively utilize AI protein design tools, but remain inconsistent at applying their knowledge across a multi-step computational design workflow. Several of these efforts were presented at NeurIPS and used in multiple real assessments, helping inform discussions about how agentic systems change the risk landscape.2
Our benchmarks and evaluations have been cited in model cards or risk management frameworks for major releases from all the frontier labs, including Anthropic, Google DeepMind, Meta, OpenAI, and xAI. VCT was also referenced throughout the House Energy and Commerce Committee hearing on Examining Biosecurity at the Intersection of AI and Biology3, and received coverage in Time. It remains an open question how model performance on benchmarks translates to changes in the real-world risk landscape; addressing this uncertainty is a key focus of our 2026 efforts.
Mitigations, Cross-Team Collaborations, and Safety Pipeline Deliverables
A major milestone for us this year was not just researching the capabilities of models once they are released, but actually working with frontier labs to make models safer. We delivered training datasets and lists of dangerous biological topics for pretraining data filtering that were directly fed into the design of several frontier models. This work sits at the interface of research and operational safety: it is empirically grounded, hard to game, and compatible with how frontier labs actually train and deploy safeguards. It reduces models’ capabilities for catastrophic bio, misusewhile preserving their beneficial capabilities.
We also contributed to work on AI-bio jailbreaking mitigations, helping to characterize how safety systems fail under pressure and what kinds of mitigations appear most promising. Complementing this, we secured dedicated funding for operational security research and follow-on methods development. This reflects a broader shift in the field toward treating misuse prevention as an end-to-end systems problem rather than a single refusal metric.
The team also conducted an exciting collaboration with the NAO, helping to build an AI tool that triages metagenomic sequences flagged by the NAO’s detection system for further investigation. The tool analyzes concerning sequences, enriches them with relevant facts and context, and surfaces the most important for human-expert review. We’re excited to undertake further such work that leverages each team’s strength.
Funding, Delivering, and Scaling our Work
On the funding side, we secured a multi-year Coefficient Giving grant that enables longer-horizon planning, deeper technical investment, and hiring. We also received multiple grants from the Foundation Model Forum, including support for research into agentic AI, operational security work, and follow-on research from earlier pilots.
We delivered several major evaluation projects with frontier labs, including expert baselining, quality control for evals, and holistic prerelease assessments. We also moved toward a more durable model by licensing evaluations to multiple frontier labs, creating a durable pathway for our tools to be used in real decision-making contexts rather than remaining purely academic artifacts.
Finally, we broadened our government-facing portfolio with a US CAISI Bio R&D contract and participation in an EU AI Office contract to deliver bio-evals, both steps toward institutionalizing evaluation as part of emerging governance and standards ecosystems.
Strategy and Policy Developments
As our technical evaluation capacity grows, the question of “what should decision-makers do with these results” becomes more pressing. We spent substantial effort to ensure our work aligned with the institutions that shape audit expectations and safety norms.
We delivered a national security briefing on frontier model capabilities, helping bring empirical evaluation results into senior biosecurity decision-making contexts. The team also presented to export-control policymakers through the BIS Technology Advisory Committee and briefed US CAISI staff working on bio-related standards, both efforts to translate technical work into governance-ready inputs.
Looking Ahead
The through-line of 2025 was a shift from one-off evaluations to a mature ecosystem posture: credible benchmarks, agentic evaluation methods, mitigation artifacts that plug into real safety pipelines, and growing institutional relevance with governments and standards bodies.
In 2026, we plan to keep pushing in four directions:
Mitigation strategies that measurably reduce risk in deployed systems and in contexts where malicious actors can use multiple models in combination.
Deeper work on measuring and understanding the “hard cases”: agentic systems, integrated toolchains, frontier BAIM capabilities, and most challenging of all, super-expert capability uplift.
Systematic and routine evaluations that are fast, reliable, and decision-relevant.
Better understanding of how increases in AI model performance translate to changes in real-world risks.
If you’re building adjacent infrastructure or want to pressure-test your own evaluation/mitigation approach, please reach out.
Ben Mueller, Executive Director and Seth Donoughe, Director of AI