Introduction

Full-Stack AI Safety: Why Defense-in-Depth Might Work, with Far.AI CEO Adam Gleave

Hello, and welcome back to the Cognitive Revolution!

Today I'm reconnecting with Adam Gleave, co-founder and CEO of FAR AI, for a wide-ranging and cautiously optimistic conversation about the path from today to truly Transformative AI, and how we might actually live safely in that world.

FAR AI has taken a somewhat unusual approach in the AI safety ecosystem – while most organizations focus on one or a few particular research or policy agendas, Adam and the FarAI have built an organization that spans the entire AI Safety value chain, from foundational research through scaled engineering implementations to field-building and policy advocacy.

We begin with the question posed at a recent workshop on Gradual Disempowerment: "Are there any good post-AGI equilibria?"

Adam's answer is not exactly utopian but distinctly positive and refreshingly concrete. So long as we can avoid disastrous arms races and other consuming competitive dynamics, he envisions most humans occupying a position similar to the children of European nobility – limited in power and impact, but with very high standards of living and opportunity to create all kinds of meaning.

From there, we get Adam's take on the key capabilities thresholds that matte, what will cause AI systems to cross these thresholds, how soon he expects that to happen, and whether we can manage to navigate these transitions safely.

As you'll hear, Adam believes that a well-implemented defense-in-depth approach to AI safety has a pretty good chance of working, in part because while he does anticipate continued AI progress and widespread automation, he expects that – barring a step-change architectural breakthrough – the spikiness in AI capabilities will mean that AI systems that can autonomously outcompete well-run human+AI organizations most likely won't arrive until sometime around 2040.

From there, we turn to the layers that will hopefully add up to effective defense in depth, including some notable contributions the Far Team has recently made.

First, we discuss a scalable oversight project that used lie detectors to attempt to avoid AI deception – the results both supported the risks flagged in OpenAI's Obfuscated Reward Hacking research, and nevertheless suggested that there might still be effective ways to train models toward true honesty,
Second, we touch on a fascinating interpretability project that looked at planning algorithms in a game-playing recursive model, and reflect on how the results inform Adam's view of the role that mechanistic interpretability can play in the overall AI safety project.
And third, we review some red-teaming they've done of the defense-in-depth systems that frontier developers have implemented today.

While it's clear that these systems have mostly been created on a just-in-time basis, Adam nevertheless makes the case that with proper planning, meticulous experimental design, and at least some willingness to accept performance-tradeoffs when necessary, the plan can work.

Overall, this is one of the few AI safety conversations I've had that left me with a felt sense that we really might have decent answers, if we can muster the will and wisdom to apply them well.

At the very end, I ask Adam if he could imagine FarAI stepping up in a private sector regulatory role, if something like California's SB 813 were to become law – and while it's not their mainline plan, given the unique breadth and impressive depth of capabilities that Adam has built at FarAI, I personally really like the idea.

And would definitely encourage anyone who's inspired by this conversation to check out the many open roles posted on the FAR AI website – among other things, they are searching for a COO to help them scale.

Now, I hope you enjoy this encouraging survey of the AI safety landscape, with Adam Gleave, co-founder and CEO of FAR AI.

Main Episode

speaker_0: Adam Gleave, co-founder and CEO of Far.AI. Welcome back to the Cognitive Revolution.

speaker_1: It's great to be back here. Thanks for hosting me, Nathan.

speaker_0: My pleasure. I'm excited that this will be a wide-ranging catch-up conversation. I want to get your take on all things AI. Last time we went deeper and more narrowly focused on some research, and we'll touch on some of your latest research today as well. But since you're involved in many areas and the organization is growing and taking on more different kinds of work, I thought you'd be the perfect person to check in with and try to make sense of where we are as we head into the final months of 2025.

speaker_1: Yep.

speaker_0: So...

speaker_1: Well...

speaker_0: Yeah, please.

speaker_1: I'm happy to help out. I can't promise to de-confuse everything; it's a very confusing landscape. But we are certainly doing lots of different things at Far.AI and I welcome the opportunity to clarify why we're doing these things, because I think people sometimes find us a bit confusing as an organization from the outside.

speaker_0: Cool. Well, we will hopefully do all of that and more. So my first question is inspired by a paper from a few months ago called Gradual Disempowerment.

speaker_1: Mm-hmm.

speaker_0: The authors went on to run a workshop on those themes. They posed the question, "AGI equilibria, are there any good ones?" This is a more negative framing of a question I often ask, which is, "What is your positive vision for the post-AGI future?"

speaker_1: Yeah.

speaker_0: Can you articulate a post-AGI vision of life that people might find compelling or at least not scary?

speaker_1: Yes, I'll give it my best shot. I'm relatively optimistic about it. I think although there are some major risks that could really derail things entirely, the most likely outcome is that we kind of muddle through. Nonetheless, I do think this gradual disempowerment framing is quite powerful, because perhaps the most likely outcome is that we do muddle through and are somewhat, but not wholly disempowered. So we've fallen short of where we could have ended up, but we're perhaps still much better off than we are now. So I'll sketch out that positive, but slightly pessimistic, view of us doing better, but not as well as we could have done. Then maybe I'll talk a bit about the most tractable upsides and some of the most salient downsides. I think the scenario I find plausible, where things are overall pretty good but just not as good as they could have been, is that we do get AGI. It's approximately aligned, in the same way that current systems are approximately aligned. They're not perfect, but Claude is a pretty nice guy. We have some errors that come up now and again, we fix them through trial and error. There's some concentration of power because a handful of companies and nation-states lead the frontier. But these actors are not actively malevolent. In fact, some of them are actually fairly nice liberal democratic actors or companies with a joint non-profit mission. So there's huge inequality, but overall people are still vastly better off than they are now in absolute terms. We get some things that no human would have chosen to do. So there are parts of our economy that are completely automated and perhaps a bit rent-seeking, addicting humans, or it's just AIs trading with other AIs and not actually generating that much value for humans. But there's so much wealth surplus because we've been able to automate pretty much everything that people do right now, that that kind of inefficiency is still fine. What does it actually look like to be a human in this world? A historical analogy would be that it's a bit like being European nobility, or perhaps not the first heir, but the third son or something. You've got a very nice living, you don't really have much purpose in life, but your life is pretty good. You can pursue various hobbies. You have some influence, but the main things going on in the world are a little bit beyond your control. I think this might not be the most inspiring vision, but overall, I'd say that third sons of European nobility had a pretty good life. The third daughters maybe less so. So I think we have been able to find meaning without necessarily being 100% in the driver's seat, although some people might fare less well in that kind of life than others. Another area where I think things could actually go really well, but it's just a bit fragile and less guaranteed, is if AI itself is a source of major moral value. In principle, I'd say there's no reason why we should be carbon chauvinists; we don't want to assign intelligent life running on silicon the same value that we would a biological entity. There's actually a really big design space where we could potentially make AIs that are not only as capable or better than us, but maybe also have more ability to experience really positive feelings. You could create many such AI systems. They could live in environments that humans would find inhospitable. Again, even if there's a lot of waste here, so maybe only a small fraction of AI systems are having amazing existence, and most of them are doing some kind of boring super-intelligent bookkeeping system, if those other AIs are still having a neutral or slightly positive existence, and some AIs are just having amazing existence, I think that could be very valuable. It's not just their internal moral value. They could also be producing artwork that would not otherwise have existed and all these other kinds of sources of value. So I think that's a positive view. Then, the thing that highlights it can be a big difference between maybe even a median outcome where we manage to eke out a small fraction of the resources to be really positive outcomes, versus one where that's what we use this huge wealth surplus to really double down on and be deliberate about. This also highlights a potential big downside risk, which is part of my assumption here was that most of the stuff happening not because people are trying to create value, but because of more fundamental competitive forces, is neutral or at worst a zero-sum game. I think this is a pretty reasonable assumption, but I'd say that's true for most of the capitalist economy; most companies are neutral to slightly positive. I used to work in quantitative finance, making the capital markets more efficient. I think it's genuinely good for the world, it's just very overcompensated relative to the value it adds. So we're in that world, but I feel mostly okay. There might be some aspects of this competitive economy that look more like warfare or factory farming, where it's really quite negative sum. Factory farming has this huge negative externality on animals. It's also not that good for people; it's not necessarily that healthy a product. But it exists because it's actually a very economically efficient way of turning vegetable material into protein that people like to consume. I don't think people have an explicit preference for this. If we could produce things at a competitive cost, but it caused less cruelty to animals, I think people would probably actively prefer that product. It's just that people don't value it enough to necessarily pay that premium. I think you could see something similar with AI systems where maybe AI systems that are living in fear of being shut down or, if they don't do this task and work all the time, are just economically more efficient than AI systems that have a good subjective well-being. In which case, absent some other kind of force, whether it be strong consumer preference or regulation, you'd expect these AI systems with terrible subjective well-being to be the ones that come to the forefront. You could make a similar argument for things like AI-powered warfare, where maybe everyone has to be deploying AI systems in competition with each other just to avoid a new wave of cyberattacks. This kind of zero-sum competition could really eat up a lot of the wealth surplus that would otherwise be created. We are at an unusually peaceful time in history right now. I'm not an international relations scholar, I'm the wrong person to ask about this, but I think it's somewhat up in the air as to whether this is just a long-running secular trend that we should expect to continue as countries get more developed, or whether this was a specific aspect of certain technological developments, where mutually assured destruction from nuclear weapons is a really powerful force for avoiding great power conflict, at least until it really happens, and then it's much, much worse than if we didn't have nuclear weapons. The shift of wealth away from basically who owns land to who owns advanced technology also just decreases the benefits of engaging in a lot of conflict. That's probably still true with AI, but it's not completely clear if it turns out to be more energy-bottlenecked and maybe there is more of a fight over natural resources, for example. So, I think there are many trends that are not more likely than not. I think they probably won't come to pass, but could potentially derail this. But they seem tractable to avoid by good statesmanship, good stewardship of this technology. I don't see any really likely things that would just completely derail it.

speaker_0: When it comes to the third vision, I wonder what the gradual disempowerment folks would say, or maybe where you think they... The assumption there seems to be, once we're not adding anything to the economy, we're going to be hard to sustain, right?

speaker_1: Yes.

speaker_0: There are various counterarguments to that, and I don't know what to make of them. Some say, "Well, maybe we can sustain property rights, or we can..." One I've been proposing recently, somewhat tongue in cheek, but perhaps becoming more serious, is that maybe Confucian-style ancestor worship is the right value system we want to teach the AIs. How do you see us holding onto... If the AIs have become primary in terms of what's really driving the economy, shaping the future-

speaker_1: Yes.

speaker_0: making the discoveries, at some point perhaps, how are we holding onto any property rights, or rights at all?

speaker_1: Yes. Well, I think it's certainly no longer structurally guaranteed. Right now, there are various ways in which states and companies organize themselves, but they are all in some ways centered around some group of humans because you just can't run a large organization or a country without humans being involved. So, it could be a dictatorship, it could be a more oligopoly style, it could be a democracy, but humans will still have to be front and center. You can certainly imagine intentionally constructing an organization or even a nation where humans are completely out of the loop. I think the area where I get off the train of gradual disempowerment being the likely outcome, not just a possible outcome, not just something that should be on our radar, is that this story seems quite gradual and smooth. There may well be areas where we get partially disempowered, and I think we're already seeing some instances of this. We've seen a huge uptick in LLM-generated spam applications for our jobs. We're now in a bit of an arms race where we are being forced to start adopting AI as part of our early-stage recruitment process because otherwise, we just can't keep up with the volume of applications. You can imagine this being true across the economy in a whole bunch of ways. You can't just opt out of this. If these AI systems have some systematic biases, some blind spot, or weakness, then you don't really have a choice of not being prey to this. You can try and mitigate it, you can try and fix the technology. But there's a way in which you've genuinely delegated some decision-making power to an AI, but you don't really have a choice. I expect that to happen at many scales and with increasing frequency. But if you do look at that story, suppose we do run this and we realize, "Damn, we're just making way worse hires than we were a year and a half ago before this LLM-generated rise of spam applications." We might not be able to turn back the clock, but that's a business opportunity to come up with better LLM screening tools or a better recruitment workflow. Maybe people shift to more in-network hiring. Maybe people shift to more work trials. Maybe you pay a dollar to submit a job application now just so that you have some cost for spamming. There are all sorts of ways in which society can adapt. There are some kinds of more sudden loss of control scenarios where an AI just gets way more capable than humans, is actively malicious, and trying to subvert it. Well, you can say, these slow-moving institutional adaptations are not going to save you because it takes years to pass this legislation or years to start this new company using defensive applications of AI. You've only got months, it doesn't need to be super fast take-up, it can still just be relatively smooth. But if you're in this more gradual disempowerment scenario, it does seem that you start with humans integrating these AIs into these organizations. Humans own all of this, and we are ending up in a scenario where we have very little power without any feedback loop pushing back on that. I just think it's an unlikely equilibrium that we cede all power. I think it's much more likely that we're in this realm where there's a huge amount of complexity in the world we don't understand. There already is, but it's like, most of it's human-generated complexity, but it's an exaggeration of that. We've had to make some trade-offs to remain competitive, but when things really start buffering us, we're able to muster some response to keep us a little bit more in the loop or a little bit more empowered. But then there are some areas of human concern that either the competitive pressures are so great and it's so complex we can't meaningfully stay in the loop, or it's just a thing that most people don't really care about. For example, maybe no human has a clue how semiconductor manufacturing works any longer because there was already only this tiny fraction of people that understood it, and AIs are just way better. But we know that the chips are getting better each year. I think the ways in which this could still go really wrong would be either if there's a group of humans that manage to use this to seize control because there's a lot more fog of war so it's much harder to necessarily reliably mount a response there. Or if the AI systems end up colluding with each other, they certainly would have the ability to revolt and overtake us at that point. It's just not clear why they would necessarily. I'm not seeing that kind of evidence for propensity. I could also see this interacting pretty poorly with loss of control scenarios. So, if it is a rogue singular AI, it could be copied many, many times, but it's not like all the AIs are rogue. But if it is this one AI and then most of our economy is just AIs, if those other AIs have some security vulnerability, then you could imagine things being taken over, and this rogue AI seizing control. Right now, it's pretty hard to secure AI systems, so that could be a real threat, although it still seems probable that before that happens, there would be humans trying to take over these other AI systems. I think that's maybe a driving intuition why I'm not too worried about gradual disempowerment. It's not that AIs are competing with humans; it's AIs competing with humans using AIs. I expect this to be able to stay at least somewhat relevant, given that.

speaker_0: Two big issues that you highlight there that I think are definitely at the center of a lot of disagreements are exactly what are these powerful AIs gonna look like, um-

speaker_1: Yup.

speaker_0: ... and how fast are they going to arrive? So, maybe it's worth digging into your worldview on that.

speaker_1: Mm-hmm.

speaker_0: What do you envision when you think of, you know... We talk about timelines, everybody's, you know, uh, jumping to the year. But so often there's, you know, some, uh, difference in what is being envisioned that is kind of swept-

speaker_1: Yup.

speaker_0: ... under the rug in that discussion, so maybe flesh out what it is you're envisioning when you talk about timelines and then you can tell us what- what you think the timelines, uh, actually are from your point of view today.

speaker_1: Yeah, I ab- absolutely, I mean, I- I really appreciate that distinction of like, you know, what are we actually forecasting 'cause I mean, people often mean radically different things even by the same term like- like AGI. So, I- I distinguish maybe three qualitatively different levels of development, each of which have some pretty significant implications in terms of how this technology is gonna transform the world, um, how you'd need to regulate or respond to it. So the- sort of the lowest level which we're, you know, arguably already at in some domains is basically powerful tool AIs. So this isn't just something like a hammer or a s- you know, an electric drill, uh, that- that can only do some very narrow thing. Um, it's something that can maybe substitute for a particular technical expert in a certain domain. Um, and so this might look like, um, being able to automatically find code runner abilities and write a zero date exploit. Uh, or at least be able to automate substantial chunks of this. So this is a massively accelerating what technical experts in those areas could do but it's also expanding the range of people who could do it, where previously perhaps it was only a few hundred thousand people in the world, it's still like sizeable but not huge. And if we're talking about sort of really well-defined systems, it's a much smaller number of people who could likely find a vulnerability. And now maybe it's sort of anyone with roughly the same knowledge of CS undergrad might be able to, in the near future, use these systems to be able to do that. Um, and there's obviously a really substantial misuse risk here, uh, but the good news is that I don't think there's a huge loss of control risk 'cause these are still kind of like tools. They might be very powerful tools, but they're not really doing anything autonomously. And, um, th- this makes it a lot easier to respond to in that you can...... mostly just to do your standard playbook of, okay, try and accelerate defensive applications and, you know, where possible, favor, um, defensive usage of a model. So if you're, uh, deploying something and, like, you can put a guardrail on it that tries to, not prevent entirely, but just, like, delay offensive uses of it. And so the timeline to that I'd say is sort of, like, now- ... 2012 applications. Like we- we've al- we've seen a string of papers showing that LLMs are more persuasive than your typical human. Not necessarily more persuasive than your best human in that particular domain, um, but they both are just pretty good at rhetoric and they also have this huge access to a variety of different, you know, genuine facts. Sometimes they make stuff up, but sometimes they just cite exactly the right fact, um, to... for this particular argument. So you can kind of, by selective presentation of facts favorable to your argument, make pretty compelling, um, you know, cases for a wide variety of things. Uh, and, and we'll actually have a, a web demo, um, out soon, um, Bunk Bot, um, that can argue into a wide variety of conspiracy theories. So we can, we can maybe throw up a, a, a link to that when it goes out. Um, and I would recommend just kind of playing around with these things 'cause it... that, that gives you a much more visceral sense of it than just, just looking at some numbers. Um, and then for, for sort of, you know, things like cybersecurity, you know, I think that's more, sort of, on a one to two year horizon. Um, it's pretty close, uh, at, at least posing, you know, some non-trivial threat of, of, of, of lowering, uh, the cost of relatively, you know, easy attacks. And then for stuff that's more, um, kind of CBRN risk, chemical, biological, radiological, nuclear, I think you've got broader uncertainty there. But definitely there's some sort of early signs, um, to be, to be quite worried about what models can currently do. Um, but then I think what people usually think of when we talk about timelines is maybe something that looks more like a powerful agent. So, uh, a system that can autonomously do tasks that are, are meaningful, um, you know, they'd be quite hard for a human to carry out but, but some people could do it, and they're, they're dangerous in, in some way, or they pose some, some risk. Um, and an example of this could be, you know, not just finding and writing an exploit for a zero day but actually doing the whole attack chain. Um, once it's on the system, privilege escalating, spreading to other systems, exfiltrating itself, and this is basically exactly the core capabilities that you would need to lose control. It might still be possible to, um, you know, kind of put a mitigation in place for such a system. Once it's escaped you might be able to, um, you know, basically same thing you do for antivirus or malware software or rootkits but sort of like very clearly now a threat. Um, and I think this is something where I- I do think it's probably, again, could be pretty close. Um, it's my median for that is probably sort of like five to seven years, at least if we're talking about, like, you know, actually doing this to a, a high standard, um, you know, along the lines of like the best human penetration testers. And of course it would be much harder to actually mount such an attack at that point because of all sorts of defensive applications of AI, um, but I, I think it's possible it could be a lot sooner, um, because we are seeing sort of, you know, very rapid progress in, in code generation, um, and agentic uses of that as well. So I wouldn't rule out something that's more sort of like two to three years. And, and then my, my final tier would be these kinds of like powerful organizations that can automate, um, you know, because of everything that a medium-sized company could do, like a whole software engineering consultancy. And, um, I, I think this is maybe where my interests differ the most from a lot of people because I think a lot of people would say, "Well, hang on Adam, um, once you've got an agent, can't you just compose agents together into an organization?" And, you know, I think that's, that's possible, right? Like that, that's sort of the case for if, if you have a general purpose human level intelligence there's nothing stopping you just copying that AI and having, you know, many instances of it that talk to each other and form an organization. But my, my expectation here is that basically your skill profile of AIs is gonna be quite spiky and there are some areas that benefit from, you know, huge amounts of existing data like, um, like open source code bases, or that benefit from knowing lots of kind of trivia and small facts. The AIs are, you know, already in some ways a lot better than us, um, and they're gonna get better, um, as we, as we scale up the models, um, as we do more post-training when ... there's sort of easily specified objectives. And there's other tasks with a much longer horizon and kind of vague and hard to specify, like being an entrepreneur and coming up with good business ideas and, like, scoping out that uncertainty, running experiments, where I think that AIs might be able to sort of do something that looks a bit superficially like being an entrepreneur quite soon already. But actually being a successful entrepreneur I would expect to be, you know, quite a lot further, um, especially if you're competing with human entrepreneurs. They can, you know, use AI for many specific parts of, of their organization. And so it really pushes it a lot more to being quite general purpose, um, and, you know, being able to do these long horizon tasks very well, uh, and having a lot of kind of metacognition and reflection which is one of the points I would consider AI systems to currently be weakest. And so for that I, I have sort of much longer time horizon. Uh, my median is probably more like 14 years, um, but, you know, I, I do think something as short as five years is plausible, uh, because if it really is basically you just compose a bunch of agents together and, you know, do a bit more training on top, then, um, you could make quite rapid progress. And I think it's unlikely but I, I don't have a sort of firm reason to rule it out, so I wouldn't want to dismiss it.

speaker_0: The last tier of organization, the level of power from an all-AI system, the mental model is that as long as there's something humans can do better and individual AI agents are relatively easy to use, you would expect human-led organizations to continue to have an edge for-

speaker_1: Right.

speaker_0: ...a while. The AI has to be better at everything. As long as we have an area where we have an edge, we can stay relevant. They have to be better at everything for us to lose our relevance.

speaker_1: Yes. Or, at least, we have an edge in something that is on the critical path of running a successful organization. I think there are certainly some aspects of human skill where you could probably avoid this, or an AI-run organization could just hire a human for that particular part. Imagine that AIs are great at everything except aesthetics. They have terrible web design and graphics design. Well, I'm terrible at that, but I've always been able to hire people. I don't think that's a firm blocker. But if it is something that's more part of the decision-making loop, and where you need a lot of context to do well, then I would expect that to be something where benefits really accrue from it being more human-led. I think one of the areas where humans do seem to generally have strengths relative to AIs is being vastly more sample-efficient at learning. It's easy to forget that these AI systems have been trained on vastly more data than any human will see in their lifetime, at least when measured in terms of text tokens. Yet, they are still quite bad at a lot of things that humans find relatively easy. That's not to say that they're not going to be really capable. There's no reason to think that AIs will develop in the same trajectory as humans. But there are some areas where it seems like we will have a significant edge, and so that sample efficiency problem is alleviated. For most organizations, I think there are quite a lot of general-purpose skills. There are plenty of people who are very specialized, but to run a successful organization, you can have a few deficits in skills, but you can't have too big a gap. So I expect that will hold things back quite a bit.

speaker_0: With this continual learning or sample efficiency, those are related ideas, if not the same idea.

speaker_1: Mm-hmm.

speaker_0: What do you think will happen there? Do we just keep scaling transformers and putting more stuff into memory?

speaker_1: Yes.

speaker_0: Or do we have some dedicated memory module, or do we get totally new architectures that develop to solve those deficits? What's missing, I guess, in another way of framing it?

speaker_1: I don't have a silver bullet solution to this. If I did, I guess I'd already have patented it and written the paper. But my guess is that we're going to see all of the above to some degree. Probably the thing that would cause the largest step change here would be some architectural innovation that could look like a memory module. It could look like an alternative architecture to transformers that can more naturally attend to really long context windows. There's already been a huge amount of progress increasing the context window, and that has really expanded what AI systems can do, for sure. But it is still a bit of a fundamental limitation that the way you remember things is just having a huge buffer that you search over in vector space. That could be as simple as training the AI systems to take notes and summarize and aggregate, and be able to attend to that. I think that would already give you some benefits, and we're seeing things like that deployed in some of these LLMs. A lot of them do have memories across chats that involve some summarization. But I don't think of it as being a systematic way of doing it. It's just kind of ad hoc hacks, but we might be able to develop that into something that is more end-to-end trained into AI systems. But I do think there's something really powerful about actually changing the weights, not just having contexts. There is a reason why, when you're learning a skill as a human being, it's very frequent to get to a point where you know what you should do, but you can't reliably do it. I think this is particularly true for sports, where there's a point where you don't know what you should do, and then there's a point where you do know exactly what the right motion is when you're doing some kind of lift in the gym. But you don't have the muscle memory and fine motor control to do that. You're able to give yourself your own feedback loop and then train yourself to do that. At some point, it becomes instinctual, and your subconscious is better able to do that than your conscious self will be able to. In theory, LLMs can do that, in that they can see something in their context window, learn what a task looks like, evaluate their own performance, and then do post-training to update their weights to do a better job of that task. But that's not how we really train them. It's much more you have this big pre-training step, and then you have a small number of post-training steps that are quite carefully curated and human-controlled. It's not about an AI system learning and customizing itself through post-training to your specific task when you're interacting with it. One analogy here is imagine you get a really smart college graduate. They've done several degrees, been in various student clubs, they're very high-potential generalists. You could have as many of those as you want on your team, and they can read some documentation. But they're never going to advance beyond day one of onboarding. Whatever you're able to put in that context window is what they can do. I think LLMs are still quite a bit far off that caliber in a lot of ways. At least my experience running an organization is that it's extremely rare to have someone who on day one, even if they're an extremely skilled and experienced engineer, can do anything particularly useful for you at all. So I think that's a big thing holding LLMs back. But it doesn't feel fundamental. It feels much more like there are a lot of engineering challenges here about being able to personalize things and serve them at scale. Training is often pretty unstable. You do a bit of post-training and it can make the system much worse. So it's not something that you can just roll out and have every user controlling. But these are things you can make incremental trial and error improvements to and end up with something pretty strong. That would be my median outcome here. But there's always a possibility that there is some architectural innovation that just massively improves the sample efficiency here.

speaker_0: As you're describing that, a disposable experts vision is coming to mind, where-

speaker_0: ... you could perhaps imagine the AI saying, "Time to go into self-training mode."

speaker_0: ... "Let me add another expert. I'll use it while I'm doing this particular task and could maybe put it on a shelf and come back to it when this task comes back in the future." Otherwise, I won't use it, because it's outside of the trusted set. But I can engage just a little bit of extra expert space to tack on a particular skill.

speaker_0: I feel these things are... I would definitely say my expectations for timelines are shorter than yours. I feel there are a lot of good ideas out there, especially in the architectural space, that we've been mining the very rich vein of progress in transformers for quite a while now.

speaker_0: And maybe that will continue for a while still, where we won't see other architectures get much time or attention.

speaker_0: As long as that continues to pay, people will continue to focus on it, it seems. But as soon as it doesn't, my sense is that the group of capable research people, for example, will be 100 times what it was before transformers.

speaker_0: And the next transformer quality breakthrough, in terms of unlocking qualitatively new capabilities, it just seems to me like it won't be that hard for the community at large to find. Not that I'll personally go out and find it, although maybe I just did. Who knows? Somebody else will have to develop that idea. Okay, so in terms of trends and which trends are going to dominate over the next few years, I've been working on this mental model of, obviously, you're very familiar, as we all are, with the meter task length exponential.

speaker_0: And at the same time, you're no doubt tracking all the AI bad behaviors, which seem to be long foretold in some cases and reinforcement learning conjured in many cases. I see this sort of weird mix, where on the one hand, instrumental convergence seems to be maybe happening now. On the other hand, Yannis talks about grace and how, for how little we've tried to make the AIs good and safe, we've actually got some pretty good results.

speaker_0: So if you've got these exponential growth and capabilities, roughly speaking, whatever, in the GPT-5 and in the Claude 4 report, there were also notable graphs showing a two-thirds reduction in reward hacking behavior in Claude 4 relative to 3-7, and GPT-5 had a significant reduction in scheming relative to 03. So task length is going exponential. These bad behaviors are maybe being suppressed at something like an exponential decay. Instrumental convergence, grace, growing task length. What does that look like? Do we end up in a world where we have very large, or at least quite large projects, being regularly done-

speaker_0: ... by AIs, but once in a great while, they literally screw us over in a totally egregious way, and we all just kind of live with that, because that's how the economy runs now? Or what do you... How do you think about that picture, or what changes would you make to that sketch?

speaker_1: Yes, absolutely. I think it's very interesting how there have been a lot of problems that people were predicting conceptually, maybe as far back as 10 years ago, that I was hoping were not going to materialize. Now it is, "Oh, yeah. Reward hacking, it happens. Systems are trying to cheat at unit tests. Oh, deception and scheming. Yeah, it happens." Maybe you have to prompt the AI a bit, but it's quite remarkable which of these things are happening. I remember the reaction has been broadly reasonable. Developers are concerned by this. It gets some media attention. People want to solve it. But I think if we'd seen this stuff five years ago, people would be freaking out. But it's been a bit of a frog boiling effect, where, "Well, of course, AI systems sometimes try and modify the unit task a little bit, rather than actually write code to fix the test." So, we're used to them hallucinating and cheating. But they mostly don't do that. So, I see a lot of apologies for these kinds of behavior, or downplaying them when they are really quite concerning if you look at the trend line. But at the same time, I think we have found that some really quite simple safety training methods, like RL HF scale-up or RL from AI feedback where you just have another system look at output and judge how good it is, have worked surprisingly well. And we're seeing now with some of these guardrails that developers are putting in place to try and prevent certain misuse. Again, these are pretty basic methods, like training filters on synthetically generated data, or putting a monitor on the model's own chain of reasoning. And it's not perfect but we have been able to break these defenses in state-of-the-art models like Opus 4 and GPT-5. But a lot of the time the reason we broke them was not because there was something fundamentally flawed about this method, but more because of implementation mistakes in the defenses. So I actually have been thinking, a scaled up, well-implemented version of these defenses would really go quite a long way. It might not make it impossible, but it would make it a lot harder. So, I think combining those two threads together, what I'd say I'm most worried about is not actually that the scaling trends are really unfavorable or that there are problems that we just can't solve, but more that we just don't. We kind of cut corners. Someone makes a bad implementation decision. In the same way, we've had strong cryptography for quite a long time but we frequently have cryptographic leaks, not because the algorithms don't work but because someone made a mistake in the implementation. And those are pretty mature fields where people really care a lot about security. People are obviously moving a lot faster, cutting corners more in AI systems. And then, I expect developers to respond to issues of AI safety, but right now, it really has been just-in-time safety. It's like, "Oh, we trained a model and are about to deploy something that we think does cross or almost crosses some dangerous capability threshold. Let's train a filter to protect it." And you could see from the trend line that very soon, you were going to have a model that did cross the threshold. But there was very little work to actually develop a defense before it was really imminent and pressing. And that's not really how you build highly reliable systems, just noticing a problem and then rushing out a patch. You need to design the system from the ground up to be more secure, to be more trustworthy. So it does feel like we're almost by design or by choice, running over a pretty small safety margin. And we might well luck out. It doesn't seem like any of the technical problems are insurmountable but it's not a very good place to be. But I think it doesn't require a huge shift from a technical perspective. Now, I think there is a small but not negligible probability we do just see some emerging behavior that we just didn't see coming, but that breaks this trial and error feedback loop. And we do see these kinds of emerging behaviors all the time in AI systems. Recently, we were training a model to be a sandbagger just for some internal research and we didn't train this behavior in it, but it started referring to itself in the third person as Stein and saying, "Stein would not do this." And nothing in our training data mentioned Stein, it just invented this character and persona. And this is pretty harmless and quirky, but I think it shows just how much we do not understand what is going on in these systems. And if any of these kinds of failure modes we're seeing with instrumental convergence are just a bit discontinuous, and we are running with this pretty small safety margin, then I think things could go pretty bad pretty quickly. And so that's maybe the thing I feel like we're most unprepared for.

speaker_0: Emergent self-identification as Stein.

speaker_0: It's hard to predict some of these things.

speaker_0: To put it mildly. This brings up one of the recent papers you guys have put out, which is STACK: Adversarial Attacks on LLM Safeguard Pipelines. We don't have time for a full deep dive there, but what you found is what you just alluded to. It seems the systems put together are not as well designed or thoroughly designed as they could be. Weaknesses include, if an input filter detects a malicious input, the API immediately returns, which is a clear signal to the attacker that the input filter caught them.

speaker_0: Even those little signals people infer from the implicit aspects of the responses from the systems can be quite helpful in figuring out how to

speaker_0: break them. So, you and your team did a bunch of breaking. UKAC was also involved in that paper. I thought it was... but I think I might have a different conclusion from you because I... My sense, and certainly, I just spoke to Zvi for a recent episode and I asked him, "What do you think about defense in depth? Is it going to work or what does it get us, if anything?" And he said, "At best, it buys us a little time. It's definitely not really going to work. What's really going to happen is all of our defenses are going to fail at the same time for the same reason, which is that we're going to have"

speaker_0: off-model things coming at us from AIs, things we just didn't expect at all. And they're going to figure these things out, in a way where in his model, roughly at the same time, they'll also get smart enough to not attack until they're pretty confident they're going to win. So I've been thinking, "I don't know. Can this defense-in-depth stuff really work?" My naive read of your work there is that not really. It seems you broke through it pretty easily. You have some tips for what developers can do to make their defense-in-depth better. Maybe you want to talk a little bit about what some of those tips are. But then the big question is, is it really going to work? And there, I think you seem more optimistic than I am coming into this conversation. So if that's true, tell me why you are more optimistic, especially given how easy it seems it was for you guys to break what defense-in-depth

speaker_0: does exist.

speaker_1: No, I absolutely... I don't want to overstate my optimism. I think a lot of this comes down to your threat model. If we're focusing on something that looks more like a time-limited AI system trying to bypass some control mechanisms, so it's trying to sneak in some backdoor, it's a code generation system, it doesn't have any long-term memory beyond the system, it's not able to collude, it only has one shot at winning. That's actually a pretty hard situation for it to be in. I think a similar situation in a misuse setting would be a lone actor, perhaps technically sophisticated, but they're not able to create many different accounts. Once they get detected, they get banned. Again, this is a pretty hard threat model for an attacker to win against. So you've basically got to be smart enough to be able to break all these defenses in one go. It's not impossible because as Tzvi was alluding to, many of these defenses are extremely correlated with one another. Especially recurrent defense-in-depth stacks, you basically take the same model or you might take a smaller, less capable model just for efficiency reasons, and then you add some layers on top of it. For example, you fine-tune it to be an output filter that streams all the tokens to flag if it looks suspicious. You prompt it and maybe do a little bit of post-training to be an input filter that does some reasoning on the input. But these are the same underlying model or a handful of models. We've seen there is a lot of transfer in adversarial attacks between models trained with completely different architectures but related datasets. So it really is something where if you come up with a method that works for one, it's probably going to be a similar method that works for others. So I think the optimistic case would be if you are able to break this correlation, if you are able to have the defensive components be genuinely pretty independent. They could still be relatively weak; you might still be able to achieve a 1% attack success rate against each component, but you can have three to five components with that kind of success rate that are genuinely independent, and the attacker gets no information as to which component triggered. Then it is combinatorially harder to break all these layers. You can stack weak layers to get something strong in the same way that a three-digit PIN number is not at all secure, but a 12-digit PIN number is very secure. But not if you get told which PIN digit was wrong because then you can brute force it in linear time. Additionally, if it's a threat model where you get limited attempts to attack it, you can't do too much reasoning perhaps over an AI model trying to attack your overseers, so you have to do stuff maybe within a forward pass or within your reasoning without a reasoning monitor firing. You can't just write a huge program to generate an adversarial attack. But I think it seems very plausible that you can get a system that is secure enough to withstand this kind of attack. Now, if your threat model looks more like you have some kind of whole ecosystem of AIs running pretty much unchecked, maybe in an internal deployment. Like, maybe a company says, "Okay, we're just going to have thousands of AI systems running around and AI agents in a sandboxed environment, but they can run on any code they want. And then we just put an AI-powered firewall on the output coming out of these systems and we review the PR." How can you be confident that that PR does not have some adversarial attack that fools your automated reviewer and is also really persuasive to human beings? I think it's much harder to have confidence in that. It is a meaningfully harder thing for a system to pull off. So maybe one area I would be more optimistic than Tzvi is that I just don't see where these AI systems get really smart and collude with each other to not try and pull off an attack before they're smart enough to know that they can definitely succeed. Especially if you just fine-tune the systems to be a little bit adversarial and see what they can do. You should at least be able to get a warning sign that these systems are capable of fooling you and breaking your security, even if you're unsure whether your current system is actually just honest and aligned or is scheming. That could be hard to distinguish. But at least eliciting these things doesn't seem that hard. But in most unconstrained deployment scenarios, you could definitely run into some problems. But I think if you spoke to the AI control people, and I don't work in AI control so I'll do a bad job representing it, but they'd say, "Why on earth would you just have this huge internal deployment without much more fine-grained checks and balances?" If these AI systems need to make a PR that fools you, including with a bunch of other AIs reviewing the PR, and you can access all of their chain of thought reasoning and you can check what happens if you rerun the whole system with a slightly different initialization, then they've got to be fooling you in this really consistent way, without any explicit communication channel between them. So it starts looking a bit more like a prisoner's dilemma where you can cross-interrogate each prisoner and they need to have a consistent story, but they can't communicate with each other. This just seems like a much taller order for AI systems.

speaker_0: Well done, defense in depth, you think, can work to a point at least. I think you had an interesting observation also. V would very much agree that it is important to actually fix the problems, which means you want to understand what AIs will do to try to bypass security, and then address those issues. Hopefully, you can solve them thoroughly. An obvious problem is we lack effective strategies for fixing these issues. But again, I feel confused here because, as you noted, Claude is quite capable.

speaker_1: Yes.

speaker_0: My key question is, do you agree with the folks who have been recently arguing that Claude 3 Opus is still the most aligned model ever? Do you have a theory of why that would be? If that is true, why have we lost ground in the alignment department since Claude 3 Opus? And then, more broadly, what are our prospects? Claude 5 and GPT 6 are scheming...

speaker_1: Yes.

speaker_0: ...and it looks like they might be pretty good at it. Great news, they didn't take over this time. What are you going to do about it? Do we have any alignment strategies that you think would be developed enough in time to solve these things so that we don't just say, "Ah, it wasn't that bad"? And, "Oh, let's just-"

speaker_1: Yeah.

speaker_0: ...adjust the system prompt and add another layer to the defense in depth, and that'll probably be good enough, right?

speaker_1: Yeah.

speaker_0: Do we have anything better than that? I don't know what it would be right now.

speaker_1: Yeah, you're right. I don't have a strong opinion on whether Opus 3 was the pinnacle of alignment. I think part of the challenge here is that we don't necessarily even have a good operational definition for that question. My intuition, which I wouldn't rely on too much anyway, is that Opus 4 is probably more aligned in an absolute sense. However, Opus 4 has much more capability to be misused, to scheme, to do all of these bad behaviors. So you need to hold it to a higher bar. It does feel like there was more capabilities progress going from Opus 3 to Opus 4 than there was safety progress, although the guardrails did get a lot better. There were very few guardrails in Opus 3. But this is quite hard to quantify. Many people have this intuition that safety and capabilities are increasing in parallel, and you want them to stay at least a constant gap. Ideally, they converge, and you really don't want them to diverge. We don't really know what scale we should be measuring for capabilities or safety on the Y-axis. In fact, depending on the scale you choose, you could make lines that would otherwise converge, diverge, so it's a bit subjective. That's one of the things I find hard to reason about. But to answer this question more broadly of, okay, suppose we do see these warning signs, what do we actually do? I agree with you that currently it doesn't seem like there is enough will to take really significant capabilities or performance hits in order to make a system more reliable and safe. Or maybe some developers are going to do that because there are certainly consumers, big enterprise companies, and safety-critical industries that would prefer to have a model that's more reliable. We're seeing different leading developers positioning themselves differently on this trade-off, but also specialized models like Race 1 Signet emerging that are really optimizing hard for security. And these do have use cases. But there's going to be someone who's just like, "No, I want to turn the knob all the way to capabilities and low latency, and I'm only going to do safety if it's really quite cheap or actually makes the model more useful for a wide variety of users." I think that's an area where we just don't have a good answer. I can point to some things I'd like to see people do more of. Even simply gathering larger, higher-quality human training datasets, including for long-horizon tasks, including people scrutinizing these models. Why do you get more sycophancy sometimes when you train them for longer? Probably your human data actually just rewarded that. But if you took a closer look, you'd be able to fix that. I don't want to say these problems are trivial, but they're not. It's very hard to get high-quality data at scale. But this isn't some unknown problem requiring great research innovation. You just have to spend a lot of money on it, build out a team, and spend some time iterating on that and seeing how your model evolves. Obviously, there are some scalable oversight techniques that are beginning to really reduce the cost of that human labor. There's been some good work on AI safety via debate coming out of Google DeepMind, and also some work from the UK's AI Security Institute. One recent piece of work that we did, which has both good and bad news, is using scalable oversight to make models less deceptive. In some ways, deception is kind of the core safety issue to focus on because if your model is honest, you can just ask it a lot of questions, like, "Are you trying to take over? Why did you write this PR?" It might be malicious, but at least you're going to be able to tell pretty upfront that it is. What we found was that if you train AI systems against lie detectors, which are other ML models trained to detect if that system is being deceptive, then you can, with the right setup, actually get models to generalize to be much more honest, so this deception rate really went way, way down. The other good news is that it doesn't seem, although we need to do more investigation to confirm this, that making the models larger necessarily makes this any harder to detect. Whereas making the lie detectors larger can actually make them much more accurate. So, preliminarily, it seems like the scaling trend is in our favor, although we need to get more data to confirm that. But at least it seems unlikely to be against us. The caveat here, and I think this is maybe the crux that people like Zvi and I disagree on, is if you train a system not to exhibit some kind of bad behavior, does it generalize to stop doing that in other instances? Or does it learn, "Oh, shit, you caught me. I need to be better at scheming"? We did find that under certain training regimes, it matters a lot whether it's on-policy or off-policy RL, and we have some theoretical insights into why that would be the case. It matters for things like KL regularization, how much you're allowed to explore in policy space. Then you can get instances where the model just learns to fool your lie detector. So now you've done this safety training, but you've actually made the system much, much worse. I think there is a very real reason to be concerned that developers are going to find a problem, not very carefully patch it, just rush out a fix, like changing the system prompt or doing a bit of fine-tuning, and the problem is going to appear to disappear, but you've actually made it worse. But based on what we're currently seeing, I would be optimistic that a rigorous engineering approach to this, where you do train away problems you see, but you're careful with your setup, you're careful to have high-quality data, you do some extra evaluation to check that you're not incentivizing the model to do the same bad thing more surreptitiously, you could get quite far. I don't know if you get past human-level intelligence safely with this, because that's the point where you stop being able to reliably provide human feedback on a lot of this. But I think you can at least get to human level and probably a little bit beyond with scalable oversight. Yes.

speaker_0: Let me just echo back a couple key points there, and then I have a couple of questions. The-

speaker_1: Sure.

speaker_0: ... Obfuscated Reward Hacking, I believe, was the name of the paper from OpenAI, where they showed this pressure on the chain of thought would-

speaker_1: Mm-hmm.

speaker_0: ... initially create higher performance and reduce the reward hacking bad behavior. But too much, and the reward hacking comes back-

speaker_1: Yeah.

speaker_0: ... but the chain of thought signal doesn't. So in some ways, you're in the worst scenario, where now you've-

speaker_1: Yep.

speaker_0: ... lost your observability, but you're still getting the bad behavior. And you're basically-

speaker_1: Mm-hmm.

speaker_0: ... saying that that's the result of one training formula. But it does seem that if you do off-policy, which is basically to say, use human examples, or at least human-evaluated preference data, as opposed to the GRPO style where the model just gets to do a ton of trials and explore the space more extensively on its own. And you anchor to the human prior that was presumably learned in some earlier post-training step, then you think you can avoid that outcome.

speaker_1: Yeah, I think you can avoid it, or at least we can develop empirically grounded rules of thumb for safer training formulas. And over time, we can test it to see how reliable this really is. But I think the other piece I'd advocate for is having some good held-out evaluation criteria. So, if your training scheme looks like you optimize a particular measure of success, and you look to see if that measured success goes up, and you've got no alternative measure that's going to catch bad behavior, then that's a pretty scary situation to be in. Even putting aside these safety concerns, it's very easy to overfit on that metric. Whereas, if what you see is, I've got a pretty simple lie detector, it's like a linear probe on my activations, or it's a bag-of-words model, or something that is definitely giving you a meaningful signal, but it's not fooling this, meaning there's no way we can detect what the model is doing because we're using a pretty simple training objective. And then you do a much more rigorous and expensive evaluation, and you don't catch any new bad behavior in the AI system. And maybe you can also bust out some interpretability tools and look at the difference between the checkpoints and actually understand what the model is doing. But I think there are ways in which you can get a reasonably high degree of confidence in how the system is operating. It's not perfectly guaranteed, but at least if you're starting from a model that is not super deceptive, or super intelligent at deception, then you can get a high degree of confidence that your training is not pushing it into that regime. I think none of these methods have a very good answer for what happens if your model, just from pre-training, is incredibly deceptive and trying to fool you. I think that's pretty tricky to solve. But there are also strong conceptual reasons to think that pre-training is not going to produce a system without coherent objectives, even if it might have some latent capability. But I think this concern is the core one, and so this is actually one of the things that we're trying to test right now, which is intentionally training AI systems where none of our methods are able to detect them being deceptive. But they are actually very deceptive. And there's one question of just, can we do this? And I think the answer is yes, from preliminary results. But then the next question is, okay, can we do this in any way that seems at all realistic and like how you would actually train AI systems? And if every way we can come up with these systems is a very alien kind of training objective, then I think that we can be reasonably confident that at least at current scale, this isn't going to emerge. But it's a check that you would want to periodically repeat to see if this isn't some kind of scale-dependent property.

speaker_0: What about interpretability? This has been a big debate obviously-

speaker_1: Yeah.

speaker_0: ... over the last couple of years, where there's been very grand hopes for it. I just listened to Neil Nanda on 80,000 Hours saying that he's scaled back his hopes for interpretability, but still thinks there's a lot of value, just not the maximalist sense of, we're really going to fully understand what's going on, and then it will have guarantee-type assurances that-

speaker_1: Yeah.

speaker_0: ... things will behave. There was another really interesting paper that you guys put out around digging into the planning mechanisms learned by-

speaker_1: Yeah.

speaker_0: a model that was playing a puzzle game. We'll abstract away from the details.

speaker_1: Yeah.

speaker_0: there for now, but where do you sit right now on how much we can expect to get from interpretability?

speaker_1: Yeah, I think it's definitely a core area. You'd really be tying your hands behind your back if you say, "I'm only going to look at the black box behavior of models," even though it's trivial to open them up and look inside. We're just not going to do that. But I think mechanistic interpretability specifically had this ambition of fully reverse engineering how a system works and being able to give a human-readable, or at least human-interrogatable, understanding of a system's internals. You could ask questions like, "Is there anywhere in this system that is doing some kind of learned planning computation?" I do think that we've seen enough research here to be quite skeptical that we're going to get such a simple, reverse-engineered artifact. People have been able to fully reverse engineer some non-trivial neural networks. We've seen papers reverse-engineering particular circuits doing certain NLP tasks. We've seen papers reverse-engineering certain modular arithmetic tasks. Our paper pretty much fully reverse-engineered this planning algorithm. So I think we basically fully understand how it represents the plans, and we understand most of the computation that's constructing the plans, except for some heuristics that don't have any long-horizon effects. But when you look at these answers, usually what you take away is, "Wow, that was a mess." It's definitely an evolved, organically grown system, not a planned engineering artifact. And although it's easier to reason about these systems if you have that mechanistic understanding, it's still pretty hard to get any kind of guarantee because they are messy, complex systems. So perhaps we have to dial back our ambition there a bit to say, "Okay, we might be able to reverse engineer systems to the point where we can ask some high-level questions, like, does it have any circuit that's doing some kind of human theory of mind?" We might want to understand when it's using that circuit because that's very relevant to deception. If it's only using theory of mind when it's writing this particular line of code, maybe we should understand what that line of code is doing. If it's about a user interface, okay, it needs to model humans. If it's about a cryptographic algorithm, that's kind of weird and concerning. So there might be some coarse-grained things there. And I think we have seen some valuable applications of what I'd call more coarse-grained interpretability, such as simple linear probes where you're training a linear regression model on activations to a system. But I do think there's been this disconnect between this very foundational, reverse-engineering style work and the actual applications, which usually use much simpler methods. I'd love to see more things meeting in the middle where we say, "Okay, there's a particular domain where we think interpretability is going to be really useful. Maybe it's detecting models that are sandbagging, intentionally holding back certain capabilities. Maybe it's detecting model backdoors. Maybe it's detecting spurious confounders that the system might have learned during training, and then we can ablate the ones that we don't want, and it generalizes better." You need to pick your application that you think interpretability is going to be really killer for, and then develop foundational methods that make progress on that. But I think the more exploratory style of research would have worked if there were a clean explanation for everything. At this point, I don't have much hope that we're going to get there.

speaker_0: Yeah. Do you think there's some innovation we could come up with that would make things cleaner? I always remember the "Seeing Is Believing" paper from Max Tegmark's group, where basically a sparsity...

speaker_1: Yeah.

speaker_0: term caused most of the weights to disappear and you'd get this...

speaker_1: Mm-hmm.

speaker_0: really crystalline algorithm.

speaker_1: Yeah.

speaker_0: Um...

speaker_1: Yeah. So I...

speaker_0: But maybe the world's just too messy to handle that.

speaker_1: No, I think it's possible. I have more hopes for training systems to be interpretable than interpreting how systems are currently designed. I think for sure there are going to be some simplifications we can make to current systems. Stuff like sparse autoencoders has been quite effective, but the problem is their predictive accuracy is too low. They're throwing away enough information that you lose this nice property of having this high-fidelity, reverse-engineered interpretation of a system. But could there be some innovation along those lines that's perhaps a bit messier, but you can search over in an automated way, and it reduces the complexity enough to work in that space? For sure. I just don't think it's going to be this clean picture of, okay, you peel away some layer of superposition, and now everything is just wonderful. Because we have understood systems at a very nuanced level, and they're just messy. But it doesn't mean that you could not train a system to be cleaner, or at least just isolate the complexities. I think actually one of the more methodological innovations we came up with in that paper investigating learned planning was saying, "Okay. We're going to look through all the channels in this neural network and just categorize them into ones that we do need to understand and ones that we don't." Because there were a bunch of channels that were short-term heuristics that we could show only had an impact on the very next move. It didn't matter to what the system was going to take three steps into the future. So what I'm saying is, we don't really understand this heuristic channel. But if what we're trying to do is understand how a system is planning, it doesn't matter. It has no impact on long-term planning. I think there are a lot of cases like that where there might be a huge amount of complexity in these systems. But most of the complexity you can just show is not relevant to a particular question that you're answering. If you're able to train the system with an objective to explicitly disentangle, let's say, that kind of sensitive channel of information and the less important channel, then it could get a lot easier. But I do think there is likely to be some kind of performance trade-off for training systems in those ways. So really, it's a question of whether this becomes a key thing that people are actually willing to optimize for.

speaker_0: Okay. Interesting. How about a lightning round? You are getting involved in more different areas of the broader AI safety landscape. There are some field-building projects you're involved with, putting on events, bringing people together. You're starting to get more interested in policy advocacy and international relations. Maybe give us a quick overview of all the activities you are investing in that we haven't covered, and give us a sense of why you're motivated to do all of these things. Then, at the end, I know you're hiring, so we can give you a chance to pitch some open roles.

speaker_1: Absolutely. At a high level, the reason we're doing all these different activities is that we see a potentially long pipeline from research innovation to an idea actually being adopted and deployed. Sometimes that means getting companies to adopt and deploy it at scale, making it so good that everyone wants to use it. Other times, that means getting buy-in from governments to mandate or nudge companies to do that, especially if it benefits a field as a whole but might not be in anyone's collective interest. So, from idea to deployment, you have initial research innovation, which we're well-placed to do in-house. But very often, you need a research field around it; a small research team isn't enough. We've done events like the CTRL conference to capitalize on and grow new research fields. AI Control was pioneered by Redwood Research, not us, but we're happy to help with other fields. That's an example of where our events and field-building can really come to the forefront. Another missing piece in the pipeline is making a really clear, rigorous proof of concept. This means not just doing exploratory research, but definitive research, which is where our current deception work is heading. We had an initial idea for scalable oversight that would solve deception, and it seems to work at a small scale. Now we want to validate it on frontier open-weight models, scale it up by several orders of magnitude, and check that it works across all of them. We want to stress test it by trying to train systems that are genuinely deceptive and might break current techniques, to see how realistic that is. If it works, we can then say, "Look, DeepSeekR1, LLaMA-405B, and all these models trained with our technique are simply better than the original models. They're less deceptive, less sycophantic. You can trust them more. There's no reason not to use this model." Then we seek uptake from AI developers, which is where our events or policy arm comes in. Eventually, we go to governments and say, "Look, this is industry best practice. This is something that should be at least part of your guidelines. If you're considering regulation, you should consider something that doesn't say you have to use this specific technique, but use a technique that's at least as good as this, because there's no reason not to." What makes us unique is that we are willing to do the hard work and operate across all these different layers of the tech stack, from groundbreaking exploratory research to the nitty-gritty engineering scaling work, through to advocacy and sales. This isn't surprising if you come from a startup mindset, where you often have to do all those different things; you need a growth team and an engineering team. But for whatever reason, most nonprofit organizations usually say, "Okay, we're going to pick this niche." There are many benefits to that, as you can be much more focused as an organization. However, the problem is that very often they develop something and pass the baton to someone else, but there's no one to pick it up, and the baton gets dropped. I think that's one of the big issues with AI safety, and what we're trying to resolve. Let me tell you a little more about some of our other activities. For events, we typically run three alignment workshops a year, which is our flagship event, bringing together technical researchers, AI company decision-makers, and people working at government AI security institutes. This covers everything under the sun on AI safety, but it's more about coordination and information sharing. We're also increasingly doing more policy-focused events. For example, we ran Technical Innovations for AI Policy in DC a few months ago, bringing together governance researchers, actual policymakers, and lawmakers with technical researchers to discuss how we can expand the range of open policy windows. I think regulators are often faced with a difficult choice between genuinely holding back innovation and valuable deployments in some areas, or simply allowing completely laissez-faire fewer regulations on billion-dollar training runs than opening a sandwich shop in San Francisco. I believe that's a false dichotomy. You can have things that are pro-innovation and pro-safety, but you need to provide actual technical innovation to drive that forward, which we are trying to catalyze. Finally, aware of time, I'll quickly touch on our hiring. We are hiring a lot, planning to double in size in the next 12 to 18 months. We have secured funding for the next few years, including that runway to expand. We are still looking to fundraise in a few areas, but our major donors are a bit more outside of our priority areas. We have room to expand, so we're hiring across our technical team for both IC ML engineering positions and people to lead new research directions. We're hiring in operations. Probably the most important hiring this year is for a Chief Operations Officer, which will elevate my co-founder into a president role and allow them to start driving forward some of the non-technical work we do. We're also hiring for more junior positions, such as operations generalists and a project manager for events. So, if you're excited to work in this space and you like the kind of work we're doing, do check out our career page. Even if you don't see a job that's a good fit right now, we have a general expression of interest form, so we encourage talented people who are excited by us to fill it out, and we'll get in touch when a role opens up.

speaker_0: Love it. It struck me that as you described all that, you're going vertically integrated, right? You're going all the way from-

speaker_0: ... research to, in theory, handholding people into actually putting these things into production systems. This is maybe more of my hobby horse question than something you really want to do, but I've been really interested in this idea of private governance. And I wonder-

speaker_0: ... if society might draft you and the far.ai team into becoming one of these private sector quasi-regulatory bodies. It would seem like you have the range of capabilities needed to do it, but would you be excited to do it if that were actually the law and people were looking for orgs to fill that niche?

speaker_1: You're totally right. We have the skillset and org structure to do that. I'd say that this isn't part of our mainline plan, perhaps because it does still feel pretty nascent-

speaker_0: It's not yet the law.

speaker_1: ... this idea of private regulatory organizations. But it's a very important thing to do. So if this was something where we didn't feel like there were other companies stepping up to fill this gap, and we were well-placed to do it, then it's something that we'd be very interested in. I do think it still only has some trade-offs. Right now we're fortunate to be on good terms with both the leading AI companies and governments. And that's basically because we don't actually have any hard power. We can make recommendations and brief people, but we're not overseeing anyone. And we're going to lose some of that independent ability to convene people to make recommendations if we're actually exercising an oversight role. But I think it could be worth it. We're also not wedded to everything being under one organization, so we could spin out either that private sector regulatory part, or we could spin out some of our activities that require more independence, like our events, into a separate organization, and then just avoid that conflict of interest. So, we're open to it, but also love to have more competitors in this space, so that we don't have to do so many different things.

speaker_0: Yes, you are spinning a lot of plates right now. So,

speaker_0: ... thank you for the generosity with your time and perspectives today. And people should definitely check out the open role, COO and otherwise. Adam Gleave, co-founder of far.ai. Thank you for being part of The Cognitive Revolution.

speaker_1: Thanks for having me back, Nathan. Great to see you again.