Hey, I'm Anna. I'm currently an Anthropic safety fellow, and a PhD student at Imperial College London. Previously (Summer '25), I was a MATS scholar with Neel Nanda. I'm interested in interpretability and what this can teach us about how we can make language models safer.
Before I started working in AI safety, I studied Design Engineering, had a job designing and building experimental tilt-wing aircraft, and started a clean-tech company aimed at reducing household energy consumption. Outside research, I climb rocks and mountains.
News
- Sep. 2025 We had three papers accepted to NeurIPS workshops. I'll be in San Diego for the conference, and the FAR Alignment Workshop.
- Sep. 2025 I was interviewed for a piece by the Financial Times discussing Emergent Misalignment and misalignment risks.
- July 2025 I'm helping to organise the NeurIPS 2025 Workshop on Mechanistic Interpretability. We're looking for reviewers: please volunteer here!
- June 2025 We had two papers accepted to ICML workshops. More info in the research section below!
- June 2025 Our work on emergent misalignment was featured in MIT Tech Review, alongside OpenAI's recent paper.
- May 2025 Our paper on "Inducing, Detecting and Characterising Neural Modules" was accepted to ICML.
Research
- [arXiv, blog-post] June 2025 Convergent Linear Representations of Emergent Misalignment. ICML 2025 Workshop on Actionable Interpretability
- [arXiv, blog-post] June 2025 Model Organisms for Emergent Misalignment. ICML 2025 Workshop on Reliable and Responsible Foundation Models
- [arXiv] May 2025 Inducing, Detecting and Characterising Neural Modules: A Pipeline for Functional Interpretability in Reinforcement Learning. Proceedings of the 42nd International Conference on Machine Learning (ICML 2025)