• Speculative inferences about path dependence in LLM supervised fine-tuning from results on linear mode connectivity and model souping

    I claim that supervised fine-tuning of the existing largest LLMs is likely path-dependent (different random seeds and initialisations have an impact on final performance and model behaviour), based on the fact that when fine-tuning smaller LLMs, models pretrained closer to convergence produce fine-tuned models with similar mechanisms while this isn’t the case for models pretrained without being close to convergence; this is analogous to current LLMs that are very far from convergence at the end of training. This is supported by linking together existing work on model souping, linear mode connectivity, mechanistic similarity and path dependence.
  • Generalisation in Reinforcement Learning

    Reinforcement Learning (RL) could be used in a range of applications such as autonomous vehicles and robotics, but to fulfil this potential we need RL algorithms that can be used in the real world. However, reading RL generalisation research can be challenging, as there is confusion about the exact problem being tackled and the use of terminology. To address this confusion, we've written a survey and critical review of the field of generalisation in RL. This post summarises that survey.
  • How can Interpretability Help Alignment?

    We've previously written about what interpretability research might be. In this post we think about how different kinds of interpretability research (even loosely formulated) can help AI alignment research agendas and proposals. It seems that there are meaningful differences in the kind of tools and research different agendas would benefit from, and we aim to make these differences clearer. This is useful in helping prioritise what kinds of interpretability research are likely worth doing.
  • What is Interpretability?

    In this post we lay out some ideas around framing interpretability research which we have found quite useful. Our framing is goal-oriented, which we believe is important for making sure interpretability research is meaningful. We also go over a variety of dimensions which we think are useful to consider when thinking about interpretability research. We wanted to have a shared vocabulary when talking about this kind of research, and found that these ideas helped us communicate effectively.

subscribe via RSS