Mechanistic Interpretability in Large Language Models

🔍 Research Focus: Understanding the Inner Workings of LLMs

I am a Research Assistant at the University of Tübingen, working with Dr. Thomas Wolfers and Dr. Çağatay Yıldız on the mechanistic interpretability of large language models. My research centers on developing novel approaches to understand how these complex systems acquire, represent, and access knowledge.

📰 Recent News

  • January 2026: Extended benchmarking work submitted to TMLR
  • December 2025: Abstract submitted to OHBM 2025 on normative modeling for neuroimaging (acceptance expected late January)
  • August 2025: Started supervising PhD student Jijia Xing on extending the benchmarking framework to medical and mental health domains
  • June 2025: New preprint released: "Beyond Benchmarks: A Novel Framework for Domain-Specific LLM Evaluation and Knowledge Mapping" on arXiv
  • April 2025: Paper "Investigating Continual Pretraining in Large Language Models" accepted at TMLR
  • April 2025: Started Research Assistant position at University of Tübingen under Dr. Thomas Wolfers and Dr. Çağatay Yıldız
  • March 2025: Received Master's degree in Neural Information Processing (grade: 1.17), University of Tübingen; thesis: "Mechanistic Understanding of Factual Knowledge in LLMs" at Bethge Lab
  • 2024: Awarded Deutschlandstipendium scholarship for outstanding academic achievements
  • November 2023: Best Presentation Award for essay rotation "Large Language Models and Psychotherapy: Bridging the Gap with Mechanistic Interpretability" from Graduate Training Centre of Neuroscience, Tübingen
  • 2022: Graduated with Department Gold Medal and Best Bachelor Thesis Award, Physics Department, Indian Institute of Technology Roorkee

🧪 Current Research

My work focuses on two complementary areas:

Steering Vectors for Knowledge Access: Developing activation engineering techniques using tuned lens, logit lens, causal tracing, and activation patching to localize domain-specific knowledge representations across model layers. Analyzing attribute extraction rates and layer-wise knowledge evolution through hook-based interventions to identify targetable directions for systematic model control.

Domain-Specific LLM Evaluation: Created a deterministic pipeline for contamination-free benchmark generation from raw corpora, tested on large-scale datasets (arXiv: 1.56M documents, M2D2: 8.5B tokens). Currently extending this framework to medical and mental health domains with focus on safety-critical evaluation and data contamination effects.

💡 Research Philosophy

Following Richard Feynman’s principle “What I cannot create, I do not understand,” my research aims to reverse-engineer the internal mechanisms of language models. By understanding how these systems process and represent knowledge, we can build more reliable, controllable, and interpretable AI systems.

🔬 Key Contributions

  • Domain-Specific Evaluation: Created deterministic pipelines for contamination-free LLM evaluation using large-scale datasets (arXiv: 1.56M documents, M2D2: 8.5B tokens)
  • Continual Learning: Investigated how model size affects knowledge acquisition and retention during continual pretraining across diverse domains (TMLR, 55+ citations)
  • Activation Engineering: Developing techniques to access latent knowledge through steering vectors, causal interventions, and activation pattern analysis
  • Layer-wise Knowledge Representation: Revealed that initial-to-mid layers are primarily responsible for attribute extraction while later layers focus on next token prediction, with implications for targeted fine-tuning and catastrophic forgetting mitigation

🎯 Impact & Applications

This research enables practical breakthroughs in:

  • Model Efficiency: Targeted knowledge editing without full retraining; early stopping strategies during model training
  • Safety & Control: Direct model steering through activation engineering
  • Robust Evaluation: Better understanding of what models truly know vs. memorize
  • Knowledge Transfer: Optimizing how models adapt to new domains while preserving critical knowledge

🤝 Opportunities & Collaboration

I am actively seeking collaborations and am open to supervising research projects in mechanistic interpretability.

Research Collaboration: If you’re working on related topics in mechanistic interpretability, circuit discovery, activation engineering, or knowledge representation in LLMs, I’d be delighted to discuss potential synergies and collaborative opportunities.

Student Supervision: I welcome Bachelor’s and Master’s students for theses, rotations, and internships on topics including:

  • Knowledge localization and steering vectors in language models
  • Circuit discovery and causal interventions
  • Domain adaptation and continual learning mechanisms
  • Activation engineering for model control
  • Safety-critical evaluation in specialized domains

Please reach out via email at nitinsharma3150@gmail.com to discuss how we might work together.


Research conducted at the Bethge Lab, Vernade Lab, and in collaboration with the Mental Health Mapping Lab, University of Tübingen.