Mechanistic Interpretability in Large Language Models
🔍 Research Focus: Understanding the Inner Workings of LLMs
I am a Research Assistant at the University of Tübingen, working with Dr. Thomas Wolfers and Dr. Çağatay Yıldız on the mechanistic interpretability of large language models. My research centers on developing novel approaches to understand how these complex systems acquire, represent, and access knowledge.
📰 Recent News
- June 2025: New preprint released: "Beyond Benchmarks: A Novel Framework for Domain-Specific LLM Evaluation and Knowledge Mapping" on arXiv
- April 2025: Paper "Investigating Continual Pretraining in Large Language Models" accepted at TMLR
- April 2025: Started Research Assistant position at University of Tübingen under Dr. Thomas Wolfers and Dr. Çağatay Yıldız
- March 2025: Successfully defended Master's thesis on "Mechanistic Understanding of Factual Knowledge in LLMs" at Bethge Lab
- 2024: Awarded Deutschlandstipendium scholarship for outstanding academic achievements
🧪 Current Research
My work focuses on two complementary areas:
Knowledge Measurement & Evaluation: Developing contamination-free evaluation frameworks for domain-specific knowledge in LLMs, extending beyond traditional perplexity metrics to understand true domain understanding.
Activation Engineering: Investigating how domain knowledge emerges as targetable directions in model activation space, enabling systematic control without traditional fine-tuning approaches.
💡 Research Philosophy
Following Richard Feynman’s principle “What I cannot create, I do not understand,” my research aims to reverse-engineer the internal mechanisms of language models. By understanding how these systems process and represent knowledge, we can build more reliable, controllable, and interpretable AI systems.
🔬 Key Contributions
- Domain-Specific Evaluation: Created deterministic pipelines for contamination-free LLM evaluation using large-scale datasets (arXiv: 1.56M documents, M2D2: 8.5B tokens)
- Continual Learning: Investigated how model size affects knowledge acquisition and retention during continual pretraining across diverse domains
- Activation Engineering: Developed techniques to access latent knowledge through steering vectors and activation patterns
🎯 Impact & Applications
This research enables practical breakthroughs in:
- Model Efficiency: Targeted knowledge editing without full retraining
- Safety & Control: Direct model steering through activation engineering
- Robust Evaluation: Better understanding of what models truly know vs. memorize
- Knowledge Transfer: Optimizing how models adapt to new domains
Research conducted at the Bethge Lab, Vernade Lab, and in collaboration with the Mental Health Mapping Lab, University of Tübingen.